eĿlipsis
a language independent preprocessor
 
All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
Loading...
Searching...
No Matches
The implementation

One of the goals is to show that modern C can be used to efficiently implement a tool such as preprocessing without compromising much on safety. Efficient here is in a relatively broad sense:

  • The development was quite fast, time was probably mostly limited by my own pedantry.
  • Eŀlipsis compiles in 21 seconds on my laptop (30 kLOC, full optimization and static analysis including).
  • It then processes the type-generic code to generate parts of its own source code in about 3 seconds.

Moving to a parallel build improves this to 7 and 1 – 2 seconds, respectively.

Unicode everywhere

EĿlipsis uses Unicode as its base and provides means to use real Unicode more widely even in the source code that it processes.

  • The multi-byte input and output encodings are supposed to be UTF-8. If your platform needs something else, you'd have to transscribe your source on input or output. There are a lot of tools out there that do that well. Modern C is still not well equipped to handle this consistently otherwise. Please get rid of other encodings, they are clearly outdated nowadays, and we have other fish to fry than running after legacy encodings.
  • The internal encoding is UTF-32, and all argumentation or recognition of special features is based on that. In particular, we use the properties of Unicode characters such as being white space, parts of identifiers, punctuators and so on.
  • Source and resource file names for #include and #embed, respectively, are supposed to be UTF-8 encoded.

Based on that, we allow a use of Unicode that is more comfortable and aligned with use in languages in general. For example we use normal characters for features that are represented in Unicode such as , , ¬, or . Once you get used to these, you'll probably ask how you were able to tolerate the crude digraph replacements <=, ||, !, -> or ... for so long.

If you are still living in the last millennium and don't know how to configure your keyboard to work comfortably with such characters (an argument that I hear more often than you think) you may still use the old digraphs. Nobody will be forced.

Also in our own source we use characters such as π or φ to name features that are usually noted with these characters in math.

When producing intermediate C sources that are to be compiled by traditional compilers, we translate the punctuators as mentioned above back to the digraphs. We don't do a similar translation of code points that appear in identifiers, though; we suppose that compilers that support C23 are able to deal with these in one way or another. It would probably not be difficult to add an option to eĿlipsis that would translate these to the ugly \u or \U forms (such as \u03c0 for π), though. If this tempts you as a side project, be my guest.

See also
Punctuators

Using C23 features

Enumeration types with specified base type

C23 offers a new feature where you can fix the base type of an enumeration type. For example we have

enum category : unsigned char {
… here are the enumeration constants …
}
enum ellipsis_fibfac : size_t {
… here are the enumeration constants …
}

meaning that constants, variables or members of ellipsis‿category only take up one byte, and that enumeration constants of type ellipsis‿fibfac always behave like a size_t.

Additionally we have an include feature ellipsis-enum-xcode.h that can be used to wrap declarations of enumerations in a wrapper that adds some basic functionality.

Named constants with constexpr

Before C23, named constants could only be defined for integer types (not wider than int) by abusing the enum feature. With the new constexpr feature we are able to define compile time constants of any type. For example

static constexpr auto φ = 7540113804746346429.0L / 12200160415121876738.0L;
constexpr auto φ
Definition ellipsis-tdict.c:77

defines a compile time constant of type long double const for the golden ratio φ. These then are not only compile time constants, but also objects, so here taking the address such as in is possible. Regardless of the context where the above declaration is found the static storage specifier has it that the storage duration is static.

The deprecated attribute for controlled access

C23 has a new feature set called attributes. There are not yet many standard attributes defined in C23 and we only use one of these repeatedly, [[deprecated]].

As the name indicates this attribute marks a declaration to which it is applied, and issues a warning if you do use it anyhow. We use this to mark structure members that are considered internal and which a user should not access.

This is for example used for the dictionary structure ellipsis‿token‿dictionary, see the source of ellipsis-tdict.h and ellipsis-tdict.c. This has the following properties

  • The structure is complete, so instances can be allocated as normal.
  • Access to such objects is only granted through the function interfaces that are provided in the header.
  • These functions ensure that the different members are always consistent.
See also
The deprecated attribute in C23 does much more than marking obsolescence

New keywords

There are new keywords for central language features.

  • nullptr avoids to use bizarrely typed features such as NULL, or even more weirdly 0, as initializers for pointers.
  • false, true and the type bool finally give a satisfactory interface to a Boolean type
  • thread_local provides variables with thread storage duration.

Macros with conditional expansion

Programming with variadic macros becomes much easier with the __VA_OPT__ feature that provides a simple in-macro conditional.

See also
The new VA_OPT feature in C23

Type inference

There are three new features in C23, typeof, typeof_unqual and auto that can be used to infer a type from an expression and ensure that two features automatically have consistent types, regardless of type changes that may occur at some distant code location. This is primarily used for the generic code that is found in the sources/generate directory.

Enforcing non-null pointer arguments with [static]

Since several revisions, C has the possibility to use [static] to describe function parameters that point to an array object with a minimal number of elements. Such parameters may then be assumed to be non-null. Only recently static analysers that are built into compilers have been made capable to take this information into account and to give useful feedback to the programmer.

We use this feature throughout and check for all warnings that for example gcc produces. This has found a lot of problematic places where the API design had to be clarified, namely where it had to be decided if a function interface might receive a null pointer or not. So now all pointer parameters that are not supposed to receive a null pointer use [static]. Those that may, use * notation. In all cases where the analyzer is not capable that an argument cannot be guaranteed to be non-null, a null-pointer check is inserted that terminates execution if it triggers.

Flexible array members for type-generic array and string types

Flexible array members are not new to C23, but probably not well exploited to the potential they have. We provide a generic interface for such arrays, see also Generic programming with XFiles, below, that is efficient and hopefully easy to use.

See also
ellipsis_str8
ellipsis_str32
ellipsis_carray
ellipsis_sarray
ellipsis_tarray
Initialization, allocation and effective type
stdc-init.h

Fibonacci hash functions for an efficient dictionary type

A Fibonacci hash functions uses an approximation of the golden ratio φ to spread consecutive hash keys (such as "variable123" and "variable124") uniformly over a hash array. This function has well understood properties and can be implemented quite efficiently. The only real constraint that has to be ensured is that the integer approximation φ₀ of the golden ratio φ is co-prime to n the size of the hash array. We ensure that property for a new hash array by testing values successively until this condition holds.

Such a hash array is then the basis for our dictionary type ellipsis‿token‿dictionary. Here an appropriate Fibonacci hash factor is recomputed each time that the dictionary is resized.

Controlled single shot initialization and cleanup

eĿlipsis has a relatively complicated chain of dependencies between different translation units. Namely the support for different languages and different parts of the lexer is initialized dynamically. According to the chosen language more and more features are added to global arrays that hold the strings that are recognized as puntuators or specials.

Since this is dynamic and eĿlipsis is mildly multi-threaded, we have to ensure that the initialization happens only once (see ONCE_DEFINE) and that there are no memory leaks. The tools for such a consistent initialization are provided by ellipsis-once.h.

See also
Simple TU initialization and cleanup handling with dependencies

Initialization dependencies

As a result eĿlipsis has a relatively complicated dependency between translation units for initialization. This is because different units initialize global data such as keywords, token names or punctuators dynamically at startup. These dependencies are handled with a dependency mechanism as described above.

Here rectangular boxes correspond to identified initialization features. These are colored red if they use ONCE_DEFINE_STRONG, black otherwise.

dot_init-dependenies.png

Generic programming with XFiles

The implementation of eĿlipsis reuses an old technique of generic programming in C, best named "XFiles". It consists in including a specific include file that is parameterized with some macros. E.g an include file "my_fa_struct_xfile.c" to define a structure with a flexible array member (FA) could contain code that is parameterized by a type BASE_TYPE and a name FA_TYPE. It would then be included such as in

#define BASE_TYPE double
#define FA_TYPE my_double_arr
#include "my_fa_struct_xfile.c"
#undef BASE_TYPE
#undef FA_TYPE

In fact, eĿlipsis itself proposes several extensions that help programming with XFiles. For example the above would typically be coded with eĿlipsis as

#include_source "my_fa_struct_xfile.c" \
prefix(bind BASE_TYPE double) \
prefix(bind FA_TYPE my_double_arr)

Here, using the prefix attribute with bind (instead of define) ensures that the macro definitions are only active during the inclusion. The undef from above are no more necessary. include_source (instead of include) inhibits the expansion of the line; thereby arguments to the prefix attributes are not expanded and are used for the binding as is.

To be able to bootstrap the compilation of eĿlipsis, the sources are organized in two levels. the "normal" 1st-order C sources are already partially expanded, such that you may compile eĿlipsis with any modern C compiler. But these C sources are themselves produced by eĿlipsis from 2nd-order sources that contain special directives for eĿlipsis. Once eĿlipsis is operational on a new machine, processing these 2nd-order sources should produce exactly the same 1st-order sources; git status should not show any differences.