tokenize

tokenize

part of shnell – a source to source compiler enhancement tool

© Jens Gustedt, 2019

Tokenize the input stream

This script provides means to tokenize a byte stream according to C’s rules about different kinds of tokens:

After this tokenization such tokens are surrounded by spaces and a lot of markers (composed of control characters, see below) such that other skripts then easily can treat them on a word base. This is done in a way that the spacing of the source can mostly be reconstructed after parsing.

Interfaces

Unicode

Tokenization of Unicode code points is delegated to sed, so all character classification your sed knows about should work. Usually this should work well for the “basic multilanguage plane” (BMP). Most modern scripts for existing languages fall into that category. Beware though, that C has restrictions which Unicode code points are permitted for identifiers, since it only allows alphanumeric characters. E.g a Greek alpha, α, is fine, whereas an infinity symbol, ∞, is not permitted.

We also recognize certain Unicode punctuators that represent mathematical operators. There use should make your source easier readable and also avoid posible ambiguities (for the human reader) in cases where several punctuators are adjacent or where the same punctuator (such as *) can have very different meanings according to the context. The following are the sed replacement patterns that are used.

opUtf8="
 #### comparisons
 s@ ≤ @ <= @g
 s@ ≥ @ >= @g
 s@ ≟ @ == @g
 s@ ≡ @ == @g
 s@ ≠ @ != @g
 #### Boolean logic
 # Beware, & on RHS is special for sed
 s@ ¬ @ ! @g
 s@ ‼ @ !! @g
 s@ ∨ @ || @g
 s@ ∧ @ \\&\\& @g
 #### set operations
 # Beware, & on RHS is special for sed
 s@ ∪ @ | @g
 s@ ∩ @ \\& @g
 s@ ∁ @ ~ @g
 s@ ⌫ @ << @g
 s@ ⌦ @ >> @g
 #### arithmetic
 # The second line for division is in fact the Unicode point x2215,
 # which happens to have the same glyph as the usual division character
 # x2F.
 s@ × @ * @g
 s@ ÷ @ / @g
 s@ ∕ @ / @g
 s@ − @ - @g
 #### assignment ops
 # Beware, & on RHS is special for sed
 s@ ∪= @ |= @g
 s@ ∩= @ \\&= @g
 s@ ×= @ *= @g
 s@ ÷= @ /= @g
 s@ ∕= @ /= @g
 s@ ⌫= @ <<= @g
 s@ ⌦= @ >>= @g
 #### syntax specials
 s@ … @ ... @g
 s@ → @ -> @g
 #### attributes
 s@ ⟦ @ [[ @g
 s@ ⟧ @ ]] @g"
 

Coding and configuration

The following code is needed to enable the sh-module framework.

SRC="$_" . "${0%%/${0##*/}}/import.sh"

Imports

The following sh-modules are imported:

Details

All of this can be tuned by assigning different values to the markers.

For procedural reasons the following must be a single control character, each:

export markl="${markl:-}"
export markr="${markr:-}"
export markpr="${markpr:-}"
export markc="${markc:-}"
export markib="${markib:-}"
export markit="${markit:-}"

The following now can be encoded as two control characters, they don’t need to appear in character classes or IFS.

export markp="${markp:-}"
export markplus="${markplus:-}"
export markminus="${markminus:-}"
export marksm="${marksm:-}"
export marksl="${marksl:-}"
export marksr="${marksr:-}"
export markcm="${markcm:-}"
export markcl="${markcl:-}"
export markcr="${markcr:-}"
export markbb="${markbb:-}"
export marksb="${marksb:-}"
export markbt="${markbt:-}"
export markdo="${markdo:-}"
export markha="${markha:-}"

These are part of identifiers, but might also be separators inside numbers in the future. Therefore we use hex characters to replace them.

export markun="${markun:-AbCDeF$$fEdCBa}"
export markCL="${markCL:-}"
export markCR="${markCR:-}"
export markCMP="${markCMP:-}"