part of shnell – a source to source compiler enhancement tool
© Jens Gustedt, 2019
Tokenize the input stream
This script provides means to tokenize a byte stream according to C’s rules about different kinds of tokens:
number: Beware that number tokens are not always proper numbers, but only a superset which is sufficiently distinct from the identifiers. We also recognize the
i
suffix of complex floating point literals and rewrite this to a portable form.identifier: All words starting with an alphabetic character or underscore, continuing the same or decimal digits.
punctuator: We recognize all C punctuators, plus the bang-bang operator
!!
and opening and closing double-brackets[[
and]]
for attributes. Seepunctuators
and the Unicode section below for details.string and character literals: These are encoded and escaped specially such that no blank character is visible inside such a token, and such that there is no direct contact between alphanumeric characters and punctuaton.
comment tokens: These are encoded in a similar way, such that no parsing will match these.
After this tokenization such tokens are surrounded by spaces and a lot of markers (composed of control characters, see below) such that other skripts then easily can treat them on a word base. This is done in a way that the spacing of the source can mostly be reconstructed after parsing.
Interfaces
tokenize
anduntokenize
which are functions that do the job of doing the tokenization on the input stream and removing the markers, respectively.${sedmark}
and${sedunmark}
which are temporary files that contain regular expressions for sed to do the job
Unicode
Tokenization of Unicode code points is delegated to sed
, so all character classification your sed
knows about should work. Usually this should work well for the “basic multilanguage plane” (BMP). Most modern scripts for existing languages fall into that category. Beware though, that C has restrictions which Unicode code points are permitted for identifiers, since it only allows alphanumeric characters. E.g a Greek alpha, α, is fine, whereas an infinity symbol, ∞, is not permitted.
We also recognize certain Unicode punctuators that represent mathematical operators. There use should make your source easier readable and also avoid posible ambiguities (for the human reader) in cases where several punctuators are adjacent or where the same punctuator (such as *
) can have very different meanings according to the context. The following are the sed
replacement patterns that are used.
opUtf8="
#### comparisons
s@ ≤ @ <= @g
s@ ≥ @ >= @g
s@ ≟ @ == @g
s@ ≡ @ == @g
s@ ≠ @ != @g
#### Boolean logic
# Beware, & on RHS is special for sed
s@ ¬ @ ! @g
s@ ‼ @ !! @g
s@ ∨ @ || @g
s@ ∧ @ \\&\\& @g
#### set operations
# Beware, & on RHS is special for sed
s@ ∪ @ | @g
s@ ∩ @ \\& @g
s@ ∁ @ ~ @g
s@ ⌫ @ << @g
s@ ⌦ @ >> @g
#### arithmetic
# The second line for division is in fact the Unicode point x2215,
# which happens to have the same glyph as the usual division character
# x2F.
s@ × @ * @g
s@ ÷ @ / @g
s@ ∕ @ / @g
s@ − @ - @g
#### assignment ops
# Beware, & on RHS is special for sed
s@ ∪= @ |= @g
s@ ∩= @ \\&= @g
s@ ×= @ *= @g
s@ ÷= @ /= @g
s@ ∕= @ /= @g
s@ ⌫= @ <<= @g
s@ ⌦= @ >>= @g
#### syntax specials
s@ … @ ... @g
s@ → @ -> @g
#### attributes
s@ ⟦ @ [[ @g
s@ ⟧ @ ]] @g"
Coding and configuration
The following code is needed to enable the sh-module framework.SRC="$_" . "${0%%/${0##*/}}/import.sh"
Imports
The following sh
-modules are imported:
Details
All of this can be tuned by assigning different values to the markers.
For procedural reasons the following must be a single control character, each:
- Markers for inserted spaces surrounding tokens, left or right
export markl="${markl:-}"
export markr="${markr:-}"
- for character and string prefixes
export markpr="${markpr:-}"
- A marker for a double colon. If you leave this allone, the only effect is that spaces arround :: (as in [[ gnu :: deprecated]]) will be removed
export markc="${markc:-}"
- for internal blanks
export markib="${markib:-}"
- for internal tabs
export markit="${markit:-}"
The following now can be encoded as two control characters, they don’t need to appear in character classes or IFS.
- Transient markers for periods, plus and minus within a number. These are removed at the end of tokenization, so you probably don’t want to mess with this.
export markp="${markp:-}"
export markplus="${markplus:-}"
export markminus="${markminus:-}"
- string literal markers, mid, left and right
export marksm="${marksm:-}"
export marksl="${marksl:-}"
export marksr="${marksr:-}"
- character literal markers, mid, left and right
export markcm="${markcm:-}"
export markcl="${markcl:-}"
export markcr="${markcr:-}"
- for double backslash
export markbb="${markbb:-}"
- for single backslash
export marksb="${marksb:-}"
- for a backtick
export markbt="${markbt:-}"
- for a dollar
export markdo="${markdo:-}"
- for a hash
export markha="${markha:-}"
- for an underscore
These are part of identifiers, but might also be separators inside numbers in the future. Therefore we use hex characters to replace them.
export markun="${markun:-AbCDeF$$fEdCBa}"
- comment markers left, right, //-encoding
export markCL="${markCL:-}"
export markCR="${markCR:-}"
export markCMP="${markCMP:-}"