There is a wide range of extensions to the preprocessor that eĿlipsis provides. Available for all languages are
There are also some that are specific to C that instrument some punctuators, namely local unique identifiers with __LOC_NEW and __LOC, and the defer feature.
A directive is what in C would be the commands to the preprocessor that are introduced with a #
character at the beginning of a line, such as in
EĿlipsis allows to extend this concept to other programming or text-processing languages by providing a syntax that replaces the #
character that is used in C and that is then more adapted to the respective context. Nevertheless, in the following we mostly use C syntax for our examples and talk of directives with their C names, e.g #define.
An overview of the directives that are implemented by eĿlipsis is aggregated in a pseudo source file directives.c. Among the extensions are directives that
These new directives are probably best observed in action, see below.
C compilers and preprocessors generally have comand line options to add directories to their include search paths. EL' lipsis has an extension that directly works with the #include directive, by using file names that end with a slash.
With this feature configuration of the preprocessor may be simply kept in a header file for the respective language.
C's syntax using <
and >
for includes is rather weird and does not fit well into the rest of the language concerning tokenization. After all, these are operators, and in particular it is not easy to comprehend if and when arguments to the #include directive are expanded. Also, since other languages such as html even use these punctuators themselves as brackets, we cannot generalize their use.
Therefor we add syntax to avoid part of the problem: we have ⟨
and ⟩
(MATHEMATICAL LEFT and RIGHT ANGLE BRACKET) to take that role for all languages.
In contexts where include file names are expanded or where other complex preprocessor features are used it can be convenient to have empty strings (""
and ⟨ ⟩
(or <>
for c)). They have the same effect as including an empty file.
These attributes are borrowed from C23's #embed directive. Prefix and suffix are such that they execute their contents as directives before or after the inclusion of the source file. This is particularly interesting for the #bind directive
this defines foo
for the duration of the inclusion to the indicated value and then undefines it at the end.
bind
or foo
exe already defined as macros weird things might happen. Use the #include_source directive, instead, where possible.In a similar approach we also provide the length attribute. it accounts for offsets in lines, and not in bytes as for #embed.
For timing reasons this #embed attribute had not been included into C23. We also introduce it to #include where, as for length, it accounts for offsets in lines, and not in bytes.
The virtual include fine ellipsis-builtins.dirs has the description of builtins that are provided as macro interfaces. In general, they have a behavior that is a bit different from "normal" macros, for example those that are function-like do not expand their argument list but treat that list directly, or they refer to external state such as times or a counter.
The include file ellipsis-predefined.dirs contains several macro definitions that provide features that can be implemented directly as macros in eĿlipsis, but which you probably don't want to implement for each project from scratch.
The directory The include directory contains several headers that provide more sophisticated features, in particular list processing with ellipsis-foreach.dirs and iteration with ellipsis-do.dirs. All of these are independent of the language that is processed.
Internally, eĿlipsis encodes most punctuators that are used for operators with a single code point. That concerns those punctuators that C has as digraphs with the appropriate Unicode code point, e.g. the ugly <=
becomes ≤
. As an extension, eĿlipsis accepts these operators in all languages and in particular you may use them in preprocessor expressions.
When the input tokens are filtered by a C-like preprocessor nominals (called identifiers in C) that correspond to the name of a macro are expanded. C only has two exceptions from this:
ID
, other occurences of that same name ID
are not expanded. So preprocessing has no normal recursion. (But see Tail recursion, below.)ID
of a functional macro (that is a macro with a parameter list) occurs and that name is not followed by an opening parenthesis (possibly preceded by whitespace) then this token is not expanded, either.The lack of the possibility to protect other nominals can result in a lot of irritation in contexts where names of identifiers are glued together by means of the token join operator ⨝
(or ##
in C). You'd see things such as the following quite commonly in preprocessor oriented C headers:
The user then expects the following
to expand to
But in fact it wouldn't always. If this code is located in a header and another header is included that would define hurlipurtz
to expand to schnurtz
, the above function declaration would read.
not at all what was intended. Obviously for the given example this is highly unlikely but the smaller the name components that are used get, the higher the probability of collision. For example a lot of code would use the token _
to glue together two name components, so whenever some smart colleague thinks they should overload that token with something funny, you are in trouble.
To protect against this effect, as an extension, eĿlipsis adds a third method for identifier protection, the keep and peek brackets. Rewritten with these, the above definition would look
That is, here the funny brackets (BOTTOM LEFT HALF BRACKET and BOTTOM RIGHT HALF BRACKET) protect the nominal hurlipurtz
to be expanded when MYFUNC
is used. If you are programming in C and are allergic to Unicode characters you could also use the digraphs <|
and |>
that replace them.
These brackets have the following properties
The latter can be in particular be useful for long macro definition that go over multiple lines
or equivalently
or even
Such a definition ensures that users text editor basically sees a function, and thus code indentation and highlighting should work as expected. The nested used of keep and peek brackets then also guarantees that the local identifiers a
, b
and tmp
are not accidentally overwritten during macro expansion. So using this macro as SWAP(fine, well)
would expand to
For the original C and C++ preprocessors the only way to change the internal state are directives:
As already seen above, eĿlipsis adds several directives that are variants of the above, and that also change the internal state.
With the addition of the __COUNTER__ macro, gcc has changed the game. Indeed, this macro increments a counter at each invocation, and thus changes the preprocessor state at each time.
Our builtins __INCREMENT__, __DECREMENT__, __INSTANT__ and __CLEAR__ move this concept on a higher level, namely
0
.0
if it has not been set before and then perform their specific action. For example, you'd use __INSTANT__ to query the current value or set it to 0
if it had not been set before. In contrast to that __CLEAR__ will erase a current value, if it exist, and set and return 0
unconditionally.These features are used in several header files, for general use or specialized for use in C. Most work by using counters to construct hierarchic unique identifiers for which the generation pattern may follow syntactic properties of the target language, see ellipsis-unique.dirs, ellipsis-defer.h or ellipsis-trigger.h.
We instrument the following punctuators for C because they have key roles in the grammatical structure of the language:
punctuator | roles | nesting level |
---|---|---|
{ | start of compound statement | __BRACE_LEVEL__ |
start of struc or union | ||
start of initializer | ||
} | end of compound statement | |
end of struc or union | ||
end of initializer | ||
[ | start of index | __BRACKET_LEVEL__ |
start of designator | ||
start of array length | ||
] | end of index | |
end of designator | ||
end of array length | ||
= | start of initializer | |
assignment | ||
; | end of statement | |
end of declaration | ||
separation for for loops | ||
[[ or ⟦ | start of attribute | __ATTR_LEVEL__, __BRACKET_LEVEL__ |
]] or ⟧ | end of attribute | |
:: | attribute separator | |
name composition |
This instrumentation is achieved by making these tokens "nominals" that can then be defined as macros. The default macro for each of these resolves to the punctuator itself and also maintains counters for the nesting level where indicated.
Note that __ATTR_LEVEL__ should only be 0
or 1
. Also, attributes increment and decrement the level for brackets, such that we know if a ]]
actually closes two array subscripts or if it closes an attribute.
The result of a ::
token will be different according to context: if __ATTR_LEVEL__ is 0 it serves for name composition (generally resulting in a replacement by a ‿
) and for [[prefix::suffix]]
vendor attributes otherwise.
You cannot undef these builtin macros, don't even try. But you could, in principle, add features to them:
This would add the printf
call after each occurrence of a {
in the source, even for struct
definitions or initializers. So this is probably not what you want.
We currently use this for two C extension that otherwise would not easily be implementable in the preprocessor:
C has not really an identifier hierarchy and so it is easy to shut yourself by defining macros that interact badly with some code that you don't control. The header ellipsis-loc.h provides tools to circumvent this problem by creating unique identifiers for the whole translation unit that can be referred to locally.
As an example take a function that needs to name a parameter because it is used as an array length in other parameter declarations. With __LOC_NEW and __LOC this becomes as simple as in
Here the preprocessor then chooses a unique name for __LOC_NEW that then is used two times by __LOC(). Which particular identifier is chosen is not our concern, we just have the guaranty that it doesn't interact with other identifiers in the same unit.
There is a proposal for an extension of the C language called defer, that is prospected to either be published in a technical specification or the next revision, C2Y, of the C standard:
__attribute__((cleanup(...)))
Through defer
, r1Our implementation follows the simplest version that is described here, only some additional restrictions apply, see below.
This feature is motivated by code patterns as the following, which use goto
jumps to ensure that certain cleanup code snippets are executed in the correct order and under the right circumstances:
Such spaghetti patterns are quite commonly used in C because they currently provide the only systematic tool to do such cleanup. The pattern has several disadvantages
The defer feature avoids all of these. Using this inside a compound statement indicates that that code is to be executed at the end of that compound statement:
By using hierarchic identifiers, we manage to implement the static aspects of this feature, namely to suspend the execution of specific code snippets until the end of the surrounding block.
The semantics of the example are as if we had written something similar to the following.
Note that the goto DEFER3
statement in the before last line of the compound statement starts a chain that walks backwards to the code snippets that had been introduced by the defer clauses. These clauses are protected from being executed when we pass through the block by hiding them inside if (false)
statements. By such a trick the defer code is only executed when jumped to directly.
Evidently, the code that is really produced when you use this feature is a bit more complicated, but I hope you get the idea.
The example above does not allow for us to jump into or out of the compound statement; only a normal return
statement would guarantee to execute all deferred statements.
The other features that allow to finish a particular construct early are break
and continue
. With this implementation in the preprocessor we are only able to count nested compound statements reliably; other nested statements such as towers of if
, switch
and for
are not so easy to detect. Overall we impose the following restrictions.
defer
implementation, all loop and switch bodies must be compound statements. The implementation tries to enforce this. You'll get a lot of warnings for code that doesn't stick to this rule.return
correctly in macros we have to ensure that the return value is collected in some local variable before we unwind all the deferred blocks. To do so, our macros need the return type of the function which unfortunately is not available directly in C. The purpose of the invocation of DEFER_TYPE(int)
is to provide that information to the macros. The return macro is programmed in such a way that it only changes the semantic to the defer-semantic when there are indeed defer statements; if not, a plain old return
remains in the replacement code after preprocessing. If you want to emphasize that a void
function uses defer, the macro DEFER_TYPE
may be used with an empty argument list.defer
into the corresponding anchor by means of jump statements are prohibited. The detection uses the compiler's support for variably modified types to avoid such jumps.break
, continue
and return
are captured at compile time. Others, goto
, longjmp
or exit
and similar are in general not possible without jeopardizing the cleanup mechanism.As an exception of the latter, the very first defer in a function: a non-regular exit or jump from there cannot jump over another deferred statement. For example the following is valid
With this code a termination of main
by a return will not necessarily terminate the execution but only the thread that corresponds to main
. Here the macro defer_return_value provides a read-only access to the return value that had been collected at the return statement. Similarly to the example above that first defer may contain calls to exit
, quick_exit
, long_jmp
or other functions that transfer control out of the function.
EĿlipsis provides access to some of the state of preprocessing, in particular it traces the include hierarchy for nested includes. There is one primitive, __INCLUDE_DEPTH__, that allows to track this, and several other builtins that are derived from it.
The top-level source file inherits a "project name" from the environment variable __PROJECT__, that, if set, should be an identifier (such as ellipsis
) or a list of identifiers separated by a ∷
token (such as ellipsis∷token
). Other files that are then subsequently included can use this name to define a __UNIT__ name (most of the time abbreviated as ¤, CURRENCY SIGN, U00A4), for example by appending a new component. If the project is ellipsis∷token
the #unit directive
would place the code inside the ellipsis∷token∷array
unit. The punctuator ∷
(PROPORTION, U2237) is reserved by eĿlipsis for this purpose. According to the parsed language, it might map to another punctuator.
The current unit name is then available as __UNIT__ and the currency character ¤
is available as a shortcut. For example ¤∷grow
would then refer to the function ellipsis∷token∷array∷grow
.
If the unit feature is used as above to define the unit name (derived with ¤¤∷something
, the __PARENT_UNIT__ and ¤¤ still refer to the unit from which the current unit is deduced. For example in the ellipsis∷token∷array
unit the PARENT_UNIT macro refers to ellipsis∷token
and ¤¤∷cat
would be ellipsis∷token∷cat
.
An example for this mechanism is the builtin header file ellipsis-move.h. It sets the unit name to the one of its includer with the function name move
appended:
Then the inline
function definition uses the macro ELLIPSIS_MOVE
to move the contents of the second pointer to the first. If the first already was pointing to some object that object is deleted by using the function ¤¤∷delete
which is assumed to exist.
Seen in the context of the ellipsis‿str8 type, expanded function interface looks like ellipsis‿str8::ellipsis‿str8‿move
Another way of achieving the same can be seen here:
Instead of setting the current unit to the new name, we just inherit the parent name; during the include of this header ¤ is set to ¤¤. Indeed, the result is very similar
In C, the token ::
has an ambiguous role, because it represents name composition most naturally in a similar way that someone coming from C++ would expect. So, in most contexts the token ::
is equivalent to ∷
, such that you may choose to input the feature the way you prefer. The only context where it is not is within attributes (in the form of [[something]]
or ⟦something⟧
). Here we have to be more careful: we can't mangle names such as [[gnu::something]]
because we have to transfer them as such to the real C compiler at the end.
In the context where it is understood as name composer, the ∷
punctuator token is simply replaced by a ‿
(UNDERTIE, U203F) character; the resulting combination of identifers with such characters are then valid identifiers for C23 and can be used by compilers as such. For example the above function ellipsis∷token∷array∷grow
(or ellipsis::token::array::grow
) appears as ellipsis‿token‿array‿grow
in intermediate output and for the linker. This is possible because the undertie character is allowed for identifiers, but fortunately yet rarely used by applications.
There is some support for ordinary macro recursion in eĿlipsis, provided by the header ellipsis-recursive.dirs. This only works for direct recursion (a macro calling itself directly) and is also limited to a small recursion depth. This is so, because of the expansion rules of preprocessing (in particular for the __VA_ARGS__
pseudo-parameter) such a recursive macro quickly blows up to have quadratic complexity, which is not the right thing to do for text processing.
Instead we provide a mechanism to program tail recursion directly, without having to name or expand __VA_ARGS__
at any point. This works with a mechanism that is similar to C23's __VA_OPT__
, a pseudo parameter that is only active if the macro has been called with a non-empty variable argument list. The syntax is
where IDENTIFIER
is the name of a macro to be invoked. This name can be omitted, and then the call refers to the current macro, instead.
results in
Here the first part of the definition, X;
, appends a semicolon to the first argument. The construct __VA_TAIL__(MAKE_SEMI)
then calls the same macro again with the remaining argument list, only that for doing so, that argument list is not expanded again, but just passed through.
To simplify writing such a macro and to possibly accelerate its expansion a bit, the name of the macro can be omitted. The construct then refers to the macro where it is found. So an equivalent definition of the above could be:
where the redundant information of the name is omitted from the tail call.
results in the three lines
This is a bit more complicated. First, if this macro is called with just one argument, the whole part within the __VA_OPT__
construct is not used; the definition is then as if it were just "(X)"
and so the first invocation results in "(A)"
.
Now, if there is an additional argument, this is as if it had been defined (X+__VA_TAIL__(MAKE_PAREN))
. So there is prefix "(X+"
before the tail call and a suffix ")"
after it. The result is then nicely appending the prefixes to the left (three times with arguments A
, B
and C
) and stacking the suffixes to the right, here resulting in a sequence of closing parenthesis.
results in the three lines
The identifier that appears in the __VA_TAIL__
construct needs not to be the same as the name of the macro that is defined. For example here are two macros LEFT
and RIGHT
that call each other.
The result is a ping pong pattern where the odd numbers are pushed to the left and the even numbers to the right.
The __VA_TAIL__
construct is handled as the before last step of macro replacement, namely after all arguments have been expanded, but before the final expansion of the overall result.
Thus the following where the parameter F
of PARENS
is used as the argument to __VA_TAIL__
is valid and results to an invocation of whatever name is found after parameters have been expanded.
This leads to:
Note that the effect is to apply REVERT
to "1, 2, 3"
twice, so the result is the same tokens in the initial order.
A macro that contains this construct has to fulfill the following requirements:
__VA_TAIL__
construct must only appear once.Also it is recommended that
__VA_TAIL__
construct and the use of __VA_ARGS__
are exclusive.This is because this would lead to an expansion of __VA_ARGS__
at every recursion level, thus quadratic complexity.
The easiest way to imagine what is going on with this feature is to imagine it as follows.
__VA_TAIL__
nominal is encountered during that expansion, it is passed through.__VA_TAIL__(INDENTIFIER)
are replaced as if IDENTIFIER(__VA_ARGS__)
is invoked, only that the __VA_ARGS__
and the result are not expanded.__VA_TAIL__(IDENTIFIER)
had been found.So if there are several nested tail calls the result is a list of prefixes for each argument, followed by a reverted list of suffixes, also possibly depending on the arguments.
This is not an extension as such, in principle even in C as of today, an include file may include itself. Only that it is quite tedious to limit the recursion depth; see this blog post for a proof of concept.
EĿlipsis changes this because it allows to keep track of state and also to do arithmetic that changes its state. For example the generic includes ellipsis-foreach.dirs and ellipsis-do.dirs use this to provide relatively efficient iteration features.
Consider the Fibonacci.h example:
You should be able to process this with something
and in a blink of an eye this should print out a code snippet with the firs 94 Fibonacci numbers, similar to the following.
This header uses two #include_directives to include itself recursively, namely by indicating __FILE__ as the include file.
FIB0
, FIB1
, FIB2
and FIBONACCI
. After that first inclusion, four lines of C that invoke the macro FIBONACCI()
produce the output that we have seen above.During the setup phase we define the macros that are needed during the subsequent recursive calls. All four macros use
()
; otherwise the identifier is left as is in the token sequence.The macros are set to the defining two values of the Fibonacci sequence, namely 0U
and 1U
for FIB0()
and FIB1()
, respectively. FIBONACCI()
will (essentially) hold the currently known sequence, so the initial value is 0U, 1U
, plus some weird comments to beautify the output a bit (see operators above).
Then the source calls itself recursively. Therefore include_directives is prefixed with #expand to make sure that the __FILE__ macro is expanded to the corresponding string.
Note that when returning from that first include_directives directive, FIBONACCI()
is used twice. First, it is used within __NARGS__ to compute the number of elements in the list (for me this results in the literal 94
) and then to insert the whole list into the output.
When terminating the #ifndef clause, all macros that had been bound with #bind are automatically undefined, so the macro space of the source that includes this recursive one is not polluted.
First, a termination of the recursion is ensured by testing
this checks if the next value FIB1() + FIB0()
would overflow. Since we don't actually want the computation to overflow, ever, we subtract the smaller value FIB0()
from the maximal value that eĿlipsis can handle. Note that 1U
is an unsigned 1
and -1U
, the negative of that, seen as an unsigned, is a very big number. (The preprocessor only has one type of signed and unsigned numbers each, corresponding to the types intmax_t
and uintmax_t
in C.)
If FIB1()
is still less than the bound that eĿlipsis can handle, we do the processing; if it is not, nothing is done and the inclusion of the source ends.
The important computational part of the recursion #bind's FIB2()
to a new value, namely FIB0() + FIB1()
. To ensure that the value is really computed and stored this uses
to expand the sum to a decimal unsigned integer literal. Again the #expand prefix to #bind ensures that all of this is expanded in place and only assigned to FIB2()
thereafter. As written
would already expand FIB2
due to the following parenthesis; therefore FIB2
is #undef on the preceding line, such that the #bind directive (after expansion) only sees the FIB2
identifier and not a macro.
Then, the expanded #gather directive appends a comma, some comments and the contents of FIB2()
that we just computed to the FIBONACCI
macro. Follows the update of the FIB0()
and FIB1()
macros to the new values and down we go into the rabbit hole.
Note that this is a method to compute the Fibonacci numbers with linear cost. In contrast to that, the usual recursion that is often taken as an example in introductory courses uses double recursion by calling itself recursively twice, once for the LHS of the plus and a second time for the RHS.
Here, we only have a single trailing recursion. Because the overhead on each recursion level is constant, the overall cost is linear in the number of Fibonacci numbers that is produced.
So the disadvantage that we still have with eĿlipsis is that such a loop needs recursion (here a depth of 92 on my 64 bit machine). On the other hand each recursion level is relatively cheap, so a reasonable number of iterations (up to some hundred or even thousand) can be easily handled.