tokenize

part of shnell – a source to source compiler enhancement tool

Tokenize the input stream

This script provides means to tokenize a byte stream according to C’s rules about different kinds of tokens:

number: Beware that number tokens are not always proper numbers, but only a superset which is sufficiently distinct from the identifiers. We also recognize the i suffix of complex floating point literals and rewrite this to a portable form.
identifier: All words starting with an alphabetic character or underscore, continuing the same or decimal digits.
punctuator: We recognize all C punctuators, plus the bang-bang operator !! and opening and closing double-brackets [[ and ]] for attributes. See punctuators and the Unicode section below for details.
string and character literals: These are encoded and escaped specially such that no blank character is visible inside such a token, and such that there is no direct contact between alphanumeric characters and punctuaton.
comment tokens: These are encoded in a similar way, such that no parsing will match these.

After this tokenization such tokens are surrounded by spaces and a lot of markers (composed of control characters, see below) such that other skripts then easily can treat them on a word base. This is done in a way that the spacing of the source can mostly be reconstructed after parsing.

Interfaces

tokenize and untokenize which are functions that do the job of doing the tokenization on the input stream and removing the markers, respectively.
${sedmark} and ${sedunmark} which are temporary files that contain regular expressions for sed to do the job

Unicode

Tokenization of Unicode code points is delegated to sed, so all character classification your sed knows about should work. Usually this should work well for the “basic multilanguage plane” (BMP). Most modern scripts for existing languages fall into that category. Beware though, that C has restrictions which Unicode code points are permitted for identifiers, since it only allows alphanumeric characters. E.g a Greek alpha, α, is fine, whereas an infinity symbol, ∞, is not permitted.

We also recognize certain Unicode punctuators that represent mathematical operators. There use should make your source easier readable and also avoid posible ambiguities (for the human reader) in cases where several punctuators are adjacent or where the same punctuator (such as *) can have very different meanings according to the context. The following are the sed replacement patterns that are used.

opUtf8="
 #### comparisons
 s@ ≤ @ <= @g
 s@ ≥ @ >= @g
 s@ ≟ @ == @g
 s@ ≡ @ == @g
 s@ ≠ @ != @g
 #### Boolean logic
 # Beware, & on RHS is special for sed
 s@ ¬ @ ! @g
 s@ ‼ @ !! @g
 s@ ∨ @ || @g
 s@ ∧ @ \\&\\& @g
 #### set operations
 # Beware, & on RHS is special for sed
 s@ ∪ @ | @g
 s@ ∩ @ \\& @g
 s@ ∁ @ ~ @g
 s@ ⌫ @ << @g
 s@ ⌦ @ >> @g
 #### arithmetic
 # The second line for division is in fact the Unicode point x2215,
 # which happens to have the same glyph as the usual division character
 # x2F.
 s@ × @ * @g
 s@ ÷ @ / @g
 s@ ∕ @ / @g
 s@ − @ - @g
 #### assignment ops
 # Beware, & on RHS is special for sed
 s@ ∪= @ |= @g
 s@ ∩= @ \\&= @g
 s@ ×= @ *= @g
 s@ ÷= @ /= @g
 s@ ∕= @ /= @g
 s@ ⌫= @ <<= @g
 s@ ⌦= @ >>= @g
 #### syntax specials
 s@ … @ ... @g
 s@ → @ -> @g
 #### attributes
 s@ ⟦ @ [[ @g
 s@ ⟧ @ ]] @g"

Coding and configuration

The following code is needed to enable the sh-module framework.

SRC="$_" . "${0%%/${0##*/}}/import.sh"

Imports

The following sh-modules are imported:

Details

All of this can be tuned by assigning different values to the markers.

For procedural reasons the following must be a single control character, each:

Markers for inserted spaces surrounding tokens, left or right

export markl="${markl:-}"
export markr="${markr:-}"

for character and string prefixes

export markpr="${markpr:-}"

A marker for a double colon. If you leave this allone, the only effect is that spaces arround :: (as in [[ gnu :: deprecated]]) will be removed

export markc="${markc:-}"

for internal blanks

export markib="${markib:-}"

for internal tabs

export markit="${markit:-}"

The following now can be encoded as two control characters, they don’t need to appear in character classes or IFS.

Transient markers for periods, plus and minus within a number. These are removed at the end of tokenization, so you probably don’t want to mess with this.

export markp="${markp:-}"
export markplus="${markplus:-}"
export markminus="${markminus:-}"

string literal markers, mid, left and right

export marksm="${marksm:-}"
export marksl="${marksl:-}"
export marksr="${marksr:-}"

character literal markers, mid, left and right

export markcm="${markcm:-}"
export markcl="${markcl:-}"
export markcr="${markcr:-}"

for double backslash

export markbb="${markbb:-}"

for single backslash

export marksb="${marksb:-}"

for a backtick

export markbt="${markbt:-}"

for a dollar

export markdo="${markdo:-}"

for a hash

export markha="${markha:-}"

for an underscore

These are part of identifiers, but might also be separators inside numbers in the future. Therefore we use hex characters to replace them.

export markun="${markun:-AbCDeF$$fEdCBa}"

comment markers left, right, //-encoding

export markCL="${markCL:-}"
export markCR="${markCR:-}"
export markCMP="${markCMP:-}"