C*

From XionKB
Revision as of 22:31, 21 December 2024 by Alexander (talk | contribs) (we got mangling lol)
Jump to navigationJump to search
C*
Flavour image for the C* logo.
Paradigm imperative, procedural, structured
Designed by Alexander Nicholi
First appeared December, 2020
Typing discipline static, strong, manifest, nominal, concrete
Filename extensions .cst, .hst
Influenced by
Ada, C, Thinking Machines C*, D, Go
Influenced
C~, C♭

C* (pronounced C star) is an imperative, procedural, mechanicalist systems programming language created by Alexander Nicholi. It facilitates comprehensive compile time guarantees of fully arbitrary mutability of state. Work on it began in early 2020, and publications first started appearing towards the end of that year. Work on it has been ongoing ever since. The name C* is meant to "point to" the aspects of C which have been overlooked or even derided by the field of programming language theorists, chiefly its embodiment of data-oriented design and self-evident semantics.

C* was created as a result of informatics research conducted by its creator that uncovered a new paradigm of programming called mechanicalism, a school of thought about computing architecture that draws on such concepts as data-oriented design in direct contrast to functionalism, the generic, extensible kind of programming taken for granted as universal before. C* leans into a property of C called communicativity by researcher Stephen Kell[1], radically reforming its abstract machine model and introducing several new features that provide programmers more expressive power without compromise to the bit-precise yet portable niche that C occupies. In a nutshell, it is a more canonical language for generalised bare-metal software, such as drivers and kernels.

Overview

Practicality of complexity before now has always been achieved through genericism. C* rejects this prescription, and instead capitalises on the explicit semantics of C. Genericism is anathema to systems programming, because it inherently obfuscates a program as a means to compartmentalise complexity. This does not address the complexity in a way that programmers can positively appreciate, rather trying to "do away" with it and let them pretend it is something abstract when it is not. The complexity itself is already more than enough for a human brain to handle – this abstract metaprogramming is surely a denial of the system in any real mind and makes for bad systems.

Instead, C* capitalises on C's communicativity to make obvious and clear the details of a system. It then provides a slew of new semantic mechanisms for constraining valid state, and a specification oriented around bits alone instead of abstract objects or octets of any length. This is called law & order and it is the key feature of C*.

General semantics

There are many common words in informatics that primary literature on C* has to be careful with. Such effort plays a large part in substantiating the design philosophy of the language as well as its general adherence to mechanicalism as a school of thought. Among other terms, this includes:

  • avoiding the term function to refer to callables, instead using routine
  • using octet to refer to the magnitude of data, reserving byte only for the mass of data (see Octet, not byte)

C* also adds many new terms that build upon the existing lexicon of our field, including:

  • deeper elucidation of the term marshalling with regard to data validation in addition to mere serialisation
  • a new term suite to refer to semantically parallelisable statements joined by the comma operator in place of statement terminators
  • segment routines, or simply segroutines, referring to labels inside routines with external visibility for jumping into

Changes from C

Like C, C* is an imperative programming language in the ALGOL tradition. C* was derived specifically from ANSI C, that is, the C language as standardised by the American National Standards Institute's working group X3J11[2]. From C, it inherits the following characteristics:

  • a full set of control flow keywords
  • all arithmetic and bitwise operators present in C
  • subroutines and procedures
  • the CPP
  • the concept of the "compilation unit"

However, C is often more illustratively described by what you might expect out of a language that it lacks, and C* is characteristically no different. Among other things, there are many high-level constructs that will never be provisioned by the C* language, including:

  • nested subroutine declarations
  • object-oriented programming facilities, including
    • classes (or any form of non-POD structure really)
    • parameter polymorphism
    • operator overloading
    • constructors/destructors
    • methods
  • garbage collection
  • lambdas
  • templates or generics
  • reflection
  • concurrency
  • module declaration system (imports)
  • test harnessing
  • line comments
  • strong typing

C* provides many great additions and changes to ANSI C instead. Changes and removals include:

  • removal of all built-in types save for void
  • removal of all support for all source encodings other than ASCII*
  • removal of trigraphs
  • change the meaning of sizeof( ) to be denominated in bits rather than octets

The changes are modest compared to the many fantastic additions C* brings to the language, including:

  • law & order
    • marshalling for run-time law enforcement
    • transient variable lifetime traversal for compile-time law enforcement
  • one fundamental primitive type, the bit
  • bit-oriented struct definitions
  • attributes for declaring complex behaviour about types for the compiler to implement
  • the underscore pronoun, which serves many purposes
  • flex structs
  • explicit padding
  • struct synonyms
  • legal enums
  • enumerated unions
  • union alignment
  • union punning
  • Unicode literals
  • binary numeric literals
  • transmogrification
  • multiple return values
  • explicit inlining
  • code transclusion
  • several new arithmetic operators

Law & order

Laws

C* provides law & order through a few new keywords and a key concept. First among these is, of course, the law keyword, which defines and optionally names constraints to be used on data types. This is semantically accomplished through a series of boolean expressions, like so:

/* anonymous law applied to type */
law : s32
{
_ >= 0;
_ < 1000;
};


/* named law */
law leet
{
_ == 1337;
};


/* applying previously declared law */
law leet : u16;

These laws are enforced upon the data types they apply to at compile time through an exhaustive program analysis. The compiler works backwards to create a control flow tree representing a transient variable lifetime, and exhaustively validates the initialisation and modifications of that transient variable against the laws enacted upon it. This is made practical by formalising the boundaries of the compilation unit as a border between "native" and "foreign" code, which in the essay is called the total system. Data which is confined to this total system gains the performance benefit of fully arbitrary validity checking at compile time.

Marshalling

To deal with foreign code, C* provides a mechanism called marshalling. This is a definition of marshalling expanded from its current meaning in computer science as a synonym for serialisation, to also include the act of validating data being serialised according to arbitrary schemas, or in the case of C*, arbitrary laws. All subroutines that are callable from outside the total system must provide marshalling blocks for validating their variables, like so:

typedef bit[8] mybyte;
typedef bit[32] u32;

law : mybyte
{
_ < 255;
_ != 0;
};


void foo( mybyte a, u32 c )
{

/* marshalling happens one parameter at a time */
marshal a
{
if(a == 0)
{

/* a MUST be set to a valid value through marshalling
* but, we can check around that, smartly */

a = 1;
break;
}


/* exit the routine otherwise */
return;
}


/* this is the minimum required
* if ANY laws enacted upon u32, this will fail to compile */

marshal c
{ }

/* alternatively, this minimal marshalling will do law checks
* and return upon any failures, since marshal blocks are only
* entered when the runtime checks for the laws fail */

marshal c
{
return;
}
}

Marshal blocks can only reference the parameter they are marshalling. They may declare and modify local variables with automatic storage duration, and may only call pure routines with such parameters.

Transient variable lifetime traversal

Transient variable lifetime is a term coined to refer to the ephemeral object of interest in performing the C* compiler's most valuable task: compile-time law enforcement. It refers to the exhaustive graphing of data as it flows through various names in all possible call graphs of a program. In a nutshell, we imagine a "variable" as a kind of ephemeral object that "travels" around the program, being modified and passed on. Consider the following C* code:

typedef bit[32] myu32;
typedef bit[32] u32;

/* Must be less than 100 and cannot ever equal 17 */
law : myu32
{
_ < 100;
_ != 17;
};


/* Fibonacci sequence will satisfy both of those constraints, but how do we know? */
void fibonacci( void )
{

u32 i, n;
myu32 t0, t1;
u32 tn;

t0 = 0;
t1 = 1;

/* print the first two terms */
fprintf( stdout, "Fibonacci series: %d, %d", t0, t1 );

/* print 3rd to 12th terms */
for(u32 i = 2; i < 12; ++i)
{

tn = t0 + t1;
fprintf( stdout, ", %d", tn );
t0 = t1;
t1 = tn;
}
}

Going through the Fibonacci sequence, we know that if we limit the number of terms to 12, we will never reach 100. But how does the C* compiler break this down?

It evaluates the possible values of each variable term that it is enforcing at every point they are modified, in an exhaustive recursive fashion. This means that the algorithmic complexity of verification is proportional to the algorithmic complexity of the program being verified. The verification algorithm will first minimise the possible program space by factoring in all constant values, which in the routine above is very helpful.

In cases where the output of the routine depends on outside variables, the laws applied to the incoming parameters are assumed to hold either directly or by marshalling, but beyond that, it will assume worst values for the type's size. In the case of complex algorithms, it will often happen that it is not trivial to guarantee the validity of a given combination of laws; for example, if a foreign n was given of type u32, it may require brute force search to ensure that some other variable dependent on n never equals 17.

The default behaviour of the C* compiler in situations like these is to error out, asking the programmer to give it more certainty about the data it is dealing with. Practically speaking, this involves creating more concise types with more permissible laws. For instance, if you want to be sure a 40 bit integer never overflows via multiplication, you need to make sure the types multiplied to create it have a bit size that, summed together, does not exceed 40 bits. Like so:

typedef bit[64] outint;
typedef bit[64] term0;
typedef bit[64] term1;

law : outint
{
_ <= 0xFFFFFFFFFF;
};


law : term0
{
_ <= 0xFFFFFF;
};


law : term1
{
_ <= 0xFFFF;
};


void mysubroutine( void )
{

myout a;
term0 b = /* ... */;
term1 c = /* ... */;

/* This is OK */
a = b * c;
}

If the above code was modified to have laws that permit any valid addition or subtraction but not multiplication (ergo, the laws are only enough to allow linear mixing, not quadratic), then a = b + c would still be valid, but the compiler would error out if it found a = b * c. The precautionary principle is in play.

However, it will be possible to put the compiler into that brute force mode, potentially at great computational cost, in order to arrive definitively at an answer to that question. This is accomplished using a framework of satisfiability solver programs, which provide a bitcode proof that can be saved by a programmer for trivial verification of its satisfiability once the solution is found.

Introducing the transient variable lifetime to this approach means that we transcend callsite boundaries within the total system to thoroughly simulate all subroutines in a program as one big meta-routine. This means that we can get more information about possible states than is possible when marshalling without attached formal proofs. Data confined within a total system has a far smaller number of possible states. More precisely, the number of possible states it has is directly proportional to the number of changes it has. The larger the program, the longer it takes to validate, but that does not scale exponentially in its own right. It merely follows the algorithmic complexity of the program being validated.

Concrete type system

C* has no abstract type system, not even a weak one as provided by ANSI C. Instead, it has a simple yet rigorous concrete type system based on three fundamental primitive types: bit, void and fifo. They are considered fundamental because they are built into the language, and primitive as they are elementary types (as opposed to complex ones created by structure and bifurcated by dot notation). More generally, the bit is "something", while void is "nothing", and fifo is a secret third thing currently only valid in the context of transmogrification for the transient transit of data.

C* uses its radically simplified set of primitive types as a basis for a powerfully expressive complex type system that far outshines that provided in C. Enumerations, structures, and unions have all received major semantic changes at the outset, and on top of this provide a host of new expressions that are not possible in C. Many expressions are entirely new to the imperative paradigm thanks to the conceptual distinction between functionalism and mechanicalism mentioned already. In other words, techniques and concepts previously only possible in the abstract through functional programming are now accessible in concrete way.

Enumerations

Enumerations have received comparatively modest treatment in the design of C*. They still behave as they do in C, with one major conceptual difference: enumerations do not have an implicit typing of int (or any implicit typing at all for that matter). Instead, the values of enumerations in C* hold fully arbitrary integers and floating-point numbers, using a big integer implementation for the former and a Type I Unum implementation for the latter. This is made practical by the architecture of the Oración assembler backend that the Sirius C* compiler will use. Enumerations do not have a C♯ style "namespacing" effect, so they put their symbols into the main symbol namespace like everything else does in a C compilation unit.

Concrete enumerations

Since enums in C* are ephemeral by default, it is useful then to have a non-ephemeral enum variant that does indeed carry a concrete type (ergo, a definite size). We call these concrete enumerations and they are denoted in a familiar syntax borrowed from C++:

enum : u32
{
PRIMA,
SECUNDA,
MAX_MYENUM
};
/* all of the above enum identifiers are u32s */

Concrete enumerations have a set of anonymous members that fill in every possible number representable by their underlying type not already named. They are semantically interchangeable with each other so long as they have the same underlying type, unless they are also legal enumerations.

Legal enumerations

C* introduces a variant of the typical C enumeration called the legal enumeration, distinguished by the composite opening keyword enum law as opposed to just enum. These work the same as normal enumerations with one major difference: the values of the enumeration's members cannot be set. This means that legal enumerations always start at zero, increment by one, and never hold non-integers. This has two benefits: first, it aids in compile-time law enforcement, and second, it enables the sizeof( ) expression to be taken from the enumeration, yielding a headcount of how many members it has. This obviates the need for manually specifying the common pattern of MAX_* as the final member of an enumeration denoting its size. Observe:

enum law types
{
FIRST,
SECOND,
THIRD
};

enum law err
{
/* Not legal: */
FIRST = 0,
/* also not legal: */
FOOBAR = 42
/* legal enums cannot have their members set to arbitrary values */
};

/* this gives you the common pattern of a final enum member MAX_* */
/* sizeof(enum law types) == 3 */

Legal enumerations can also be combined with concrete enumerations to further aid the transient variable lifetime analyser and achieve more comprehensive law enforcement.

Structures

C* has radically changed the semantics of structures to be oriented and denominated in bits, rather than members and octets with implicit padding.

Inline structures

Inline structures are a syntactic sugar that makes it practical to define new primitive types. They are constituted by a struct definition that has one and only one member, the name of which is the pronoun _. With this, all data typed to such a structure will not use dot notation to access the data, but will access it directly as a primitive type. Observe:

typedef struct
{
bit[8] _;
}

u8;

/* this is how we do it */
u8 foo = 255;
foo = 254;

Explicit padding

This is one of the more radical departures C* makes not just from C but even from related languages based on C: implicit "padding" does not exist in the abstract machine model for C*. Since the compiler will never be permitted to insert padding surreptitiously on its own, it is up to the programmer to perform this explicitly. Explicit padding is constituted by structure members named by the pronoun _ where there is more than one member.

Structure synonyms

C* provides a way to declare distinct structs to be "synonyms" of each other, meaning that they can be treated interchangeably in subroutine calls. This is only permitted in cases where, differences in ordering and explicit padding aside, they are semantically identical. Therefore, struct synonyms are a way to automate logic-free transmogrification of structured data. The syntax is as follows:

struct prima
{
bit[8] octet;
bit[24] _;
bit[32] alpha;
bit[64] beta;
};


struct secunda {
bit[64] beta;
bit[32] alpha;
bit[8] octet;
};


/* declare them synonyms */
struct secunda : struct prima;

/* directionality does not matter */
struct prima : struct secunda;

Attributes

An implementation will have a known list of attributes that convey certain information about a new primitive. These are construed through a braced list of string literals at the end of the typedef's body, before the name, like so:

typedef struct
{
bit _[8] { "signed2" };
}
s8;

In this example, the attribute signed2 conveys that the number is signed using two's complement. This means the most significant bit of the type will be treated as a sign bit by the implementation using two's complement.

Some attributes and their provisions include:

Attribute Description
signed1 Signed integers using one's complement
signed2 Signed integers using two's complement
ieee-bin16 IEEE 754 floating point binary16
ieee-bin32 IEEE 754 floating point binary32
ieee-bin64 IEEE 754 floating point binary64
ieee-bin128 IEEE 754 floating point binary128
ieee-bin256 IEEE 754 floating point binary256
bigint Unlimited precision integer

Flex structs

One of the major limitations of C is its inability to parameterise the size structure members as part of the overlying type signature. C has always been able to do this with constant expressions in its array syntax, and since C99 it can do this dynamically with VLAs. It just cannot do this with structures, but C* can. We call these flex structs.

Below is an example of how flex structs prove useful. It is an abbreviated implementation of a singly linked list node, first using the C-compatible pointer approach, and then using the C*-specific inline approach.

struct uni_1ll_node;

struct uni_1ll_node
{
u8 *
data;
struct uni_1ll_node * next;
};


void foo( void *, struct uni_1ll_node );

struct uni_1ll_inode[_];

struct uni_1ll_inode[x]
{

u8 data[x];
struct uni_1ll_inode[x] * next;
};


void fooi( void *, struct uni_1ll_inode[64] );

The value of these semantics is obvious, as it makes it possible to avoid the indirection of having pointers, without requiring abuse of the preprocessor or template/macro metaprogramming. The number, being a compile-time constant, is simply worked into the final definition of the type as if it were present in the struct body itself, dictating its ultimate size and the resulting ABI requirements.

Structure punning

C* provides a feature called structure punning, which allows members of a structure definition to be "faked out" for constant data instead of holding a variable value as they usually do. The programmer can also decide whether to hold space for such data in memory, allowing it to be cast variable later on. Observe:

struct foo
{
u16 a;
u16 b;
/* punned without storage */
u32 sig = 0xDEADBEEF;
};


struct bar
{
u16 a;
u16 b;
/* punned with storage */
u32 sig := 0xDEADBEEF;
};


struct baz
{
u16 a;
u16 b;
/* not punned */
u32 sig;
};


extern struct foo x1;
/* this would cause UB due to potential lack of storage allocated to y */
struct bar y1 = (struct bar)x1;
/* same problem */
struct baz z1 = (struct baz)x1;

extern struct bar x2;
/* this is not UB as it merely leaves inaccessible the sig data
* however it can be confounding as .sig is no longer being accessed
* from memory as it was */

struct foo y2 = (struct foo)x2;
/* this is USEFUL as it makes the backed pun not punny */
struct baz z2 = (struct baz)x2;

extern struct baz x3;
/* this is not UB as it merely leaves inaccessible the sig data
* however it can be confounding as .sig is no longer being accessed
* from memory as it was */

struct foo y3 = (struct foo)x3;
/* this WILL cause .sig to be overwritten with 0xDEADBEEF, thereby
* destroying whatever variable data was stored there */

struct bar z3 = (struct bar)x3;

Unions

C* also provides several new semantics for unions, most of which help achieve common high-performance optimisation patterns that C programmers would ordinarily be forced to rely on messy CPP macros or assembly code to achieve.

Union alignment

It is helpful to be able to align members of a union relative to one another on a bit level, similar to paragraph alignment to the left or right. This obviates the need for cumbersome struct boilerplate to create artificial alignment with other union members that it is not defined in direct reference to, which is more semantically straightforward. C* uses the >_> symbol following the member type name to signify rightward alignment (i.e. towards the least significant bit) with respect to the largest member, and likewise <_< to signify leftward alignment (i.e. towards the most significant bit). The default alignment (i.e. indeterminate alignment) can be denoted explicitly using the >_< symbol if desired.

typedef bit[16] myu16;
typedef bit[9] myu9;

union
{
/* largest member, no alignment needed, but given anyway */
myu16 >_< prima;
/* right align as carats point right */
myu9 >_> secunda;
};

Although alignment is not explicitly required by the language like padding is, the default alignment is indeterminate and the compiler reserves the right to align members however it pleases unless instructed otherwise with the above symbols.

Union punning

It is very useful to be able to pun the values of other union members in order to overload the bitfield in cases where one or more bits of a field may be zero and therefore usable for other purposes. A common example of this is flag storage in pointers, where a pointer may offer 1 or more bits on the least significant end that are always zero (guaranteed by either the hardware or by the allocator). Punning requires explicit union member alignment. Here is an example of a pointer type where the alignment is assumed to be at a minimum of 4, giving us two bits to use as flags:

typedef bit[32] ptri;

union
{
ptri ptr { _[0:1] = 0b00, flags = 0b00 };
bit[2] >_> flags;
};

This example is redundant but fully explains the feature at work here:

  1. we have the pointer itself, ptr
    • it's 32 bits wide as defined by the typedef for the sake of example
    • being the largest member, it needs no explicit alignment
  2. following its declaration is a braced list, which contains several items
    1. the pronoun, _, which refers to itself, ptr
      • the subscripting of the pronoun defines which range of bits we are dealing with
      • the "assignment" of the two least significant bits (as denoted before) to zero, meaning:
        • these bits of ptr will always read back as zero
        • writing to these bits has no effect
    2. flags, which refers to the member of that name in the union
      • setting it to zero, which causes ptr to be treated as if flags is always zero regardless of its actual value
        • this is redundant with the pronoun field given previously
        • if given by itself, it would cause reads of the two least significant bits to always return zero
        • however, it alone would not stop writing nonzero bits to it as part of a write to the ptr field
  3. finally we have the flags member field, which is a bit[2]
    • it lacks any punning annotations in a braced list, unlike ptr
    • it is explicitly aligned to the right, so that its bits correspond to the least significant bits of ptr

Enumerated unions

C* provides a way to enumerate unions, providing some scaffolding to leverage unions in the conventional manner without imposing the semantic restrictions of "tagged unions" typical to other languages.

/* using the enum law example code from above ... */

struct a
{
enum law types t;
union b : t
{
u16 >_> x;
u32 >_> y;
u64 >_< z;
};
};

/* access syntax: */
a.t = FIRST;
/* this is accessing x within */
a.b = 0xFFDD;
/* you can bypass the enumeration of the union and modify directly */
a.b.y = 0xFFFFFFFF;
/* a.b.x will then be equal to 0xFFFF, not 0xFFDD */

Notably, this functionality can be used to create a kind of synthetic "optional" data type that respects C*'s paradigm:

enum law bool
{
FALSE,
TRUE
};

struct optional_foo
{
enum law bool is;
union val : is
{
bit nop { _ = 0 };
struct foo data;
};
};


/* optional_foo would then be accessed like so: */
extern union optional_foo fooey;

switch(fooey.is)
{

case FALSE:
/* fooey.val is bit that is always zero */
break;
case TRUE:
/* fooey.val is a struct foo */
break;
}

Dealing with data

C* has made many useful departures from the archaic models of C in how it conceptualises data for the programmer. It sports a new and much-simplified abstract model of computer memory. It also has new semantics for string and character literals that not only add Unicode support but do so in an encoding-agnostic manner that fully leverages C*'s powerful new type system. Encoding of literals and numeric constant data in general has been supercharged by the inclusion of a pushdown automaton that performs transmogrification of data in source format into its desired binary form in the final program.

Abstract memory model

The language considers three broad categories of memory. All storage falls into one of these categories, regardless of its mechanism of storage. In other words, this is distinct from the other memory distinction between "automatic" stack-allocated memory and "manual" heap-allocated memory.

Types of memory
Private memory Same as in OpenCL parlance; in CUDA terms it may be called "registers" or "local memory"; in general-purpose CPU terms it is thread-local storage. Writable, and only accessible from a single execution context.
Shared memory Same as in CUDA parlance; in OpenCL terms it is called "local memory"; in CPU terms it is typical, often heap-allocated memory. Writable and accessible from potentially multiple concurrent execution contexts; this is the only memory category that demands manual synchronisation.
Constant memory Shared memory that is read-only for all execution contexts. This memory can be shared and used without need for synchronisation across multiple execution contexts, but is only modified at the point it was declared and initialised.

Literals

C* imposes several strong measures to help contain complexity in systems programming. Many of these show up in the specifics of construing data literally within source code using "literals".

C* does provide a binary literal notation that is identical to that of C++ and many other languages:

u32 foo = 0b10110011;

C* does not deviate from C in its syntax for octal literals.

There are literal notations for two kinds of "text": 7-bit "narrow" ASCII, and 21-bit "wide" Unicode. As in C, single quotes are used to construe character literals, and double quotes are used to construe string literals. C* uses the @ symbol prefixed to the opening marks to denote the literal as being Unicode instead of ASCII. Observe:

'a'; /* literal ASCII lowercase A (number 97) */
@'a'; /* literal Unicode lowercase A (U+0061) */
'\377'; /* ASCII DEL (number 127) */
@'\u2018'; /* Unicode opening single quote (U+2018) */
"Good morning, Vietnam!\n"; /* literal ASCII string */
@"Good morning, Vietnam!\n"; /* literal Unicode string */

Transmogrification

Notice
This feature is under heavy redevelopment. Its final form will probably be quite different from the work-in-progress you see here.

C*'s provisions for literals are usually not going to translate into their ideal storage medium as-is. Everything defaults to being bit-packed, including ASCII as 7-bit and Unicode as 21-bit, which is hostile to most CPU architectures. In order to help programmers work through such problems without the follies of metaprogramming, C* provides syntax for a kind of transmogrifier subroutine that is worked through to transform literals into their final form within the program.

Transmogrifier subroutine are the sole context for C*'s third fundamental primitive type, fifo, as well as two operators, <- and ->. Definitionally, a subroutine is a transmogrifier subroutine if its return type is fifo and its sole parameter is an anonymous fifo type. The <- is the output operator, streaming bits as output as the routine progresses; the -> is the input operator, streaming bits as input as the routine progresses.

With these, it is possible, for example, to write a transmogrifier that takes a Unicode string literal and outputs it as UTF-8:

fifo ustr2utf8( fifo )
{

bit c[21];
u8 n;

/* take in 21 bits from input FIFO
* n reports how many bits were available */

c, n <- 21;

if(n < 21)
{

/* do something different, potentially */
}

/* ... implementation ... */

/* send out 8 bits */

c & 0xFF -> 8;
}


fifo str2utf8( fifo )
{

bit c[7];
u8 n;

c, n <- 7;

if(n == 0)
{

return;
}

else if(n < 7)
{

/* Houston... we have a problem */
}

/* 7-bit ASCII into 8-bit stream */
0 -> 1;
c -> 7;
}


const u8 * my_unicode = ustr2utf8@"\u201CBlah blah\u201D";
const u8 * my_ascii = str2utf8"Good morning, Vietnam!";

Other features

Domains

What other languages often call "modules" or "namespaces" are provided by C* as domains. Domains are a simple semantic grouping tool for making coherent collections of symbols and identifiers. In contrast to C++ namespaces, they are not lexically "grouping", that is, they are merely declared to exist, and used in other declarations directly as desired. Observe:

domain sys;
domain sys.io, sys.mem, sys.utf;

using sys.io.printf; /* printf is now in scope unqualified */
using sys.io.printf = p; /* p refers to sys.io.printf now */
using struct sys.io.file; /* now struct file is in scope */
using struct sys.io.file = struct f; /* struct f declared */
using struct sys.io.file = f; /* ERROR: cannot cross namespaces */
typedef struct f f; /* if you really wanted to do that, this is how */

While in a vacuum, C*'s domains hardly justify their existence in light of the sufficiency of normal symbols as in ANSI C, the utility can be realised in how it makes possible smarter contextualisation of parameters for routine calls and structure initialisation, like so:

domain mylib;

enum mylib.foo
{
PRIMA,
SECUNDA
};

void mylib.bar( enum mylib.foo e );

/* regardless of the presence of using statements, the enum would be
* contextualised in the routine call so it never needs qualifying */

mylib.bar( PRIMA );

/* or, with using */
using mylib.bar;
bar( SECUNDA );
/* never brought in enum mylib.foo directly */

/* this can be avoided by globalising the call with a leading dot */

.mylib.bar( mylib.PRIMA );

/* or, with using */
using mylib.bar;
.
bar( mylib.SECUNDA );

The main danger of domains is obfuscation of interface – for this reason, C* disallows using statements outside of block scope, and additionally forbids any form of "wildcard" selectors in using statements entirely. Since the above feature of soft contextualisation applies to all identifiers in a given domain, the application of using statements as a general "decluttering" is avoided and refitted solely as a tool for bringing desired subroutines into scope. In this spirit, C* mandates that using statements are hoisted to the top of the block scope, before all variable declarations.

Segment routines

C* provides a way to export labels inside routine bodies as ABI symbols, giving it multiple points of entry. This is useful for bypassing certain kinds of housekeeping code for performance reasons when one knows that the variants held by such boilerplate hold without it executing. Consider this:

void foo( knot20 *, u32 );
void foo:quick( );

void foo( knot20 * cord, u32 knotcount )
{

u32 i, j;
u32 olimit = knotcount - 1;
u32 ilimit = 0x40000;

goto algo;

quick::
olimit = 0;
ilimit = knotcount * 0x40000;

algo:
for(j = 0; j <= olimit; ++j)
{

u32 * const d = (u32 *)cord[j];

for(i = 0; i < ilimit; ++i)
{

d[i] ^= d[i];
}
}
}

There are several things being described here. On the high level, we are conceptually dealing with an algorithm that can work with modular memory – that is, memory that has been intelligently segmented to be digestible on processors with small memories (think 16-bit). This routine was heavily modified to algebraically move all of the differences in execution into different starting variables that cause the desired behaviour. The algorithm inside is illustrative: it merely XORs the input data with itself, inverting it. The idea is to support bookkeeping that advances the algorithm's work linearly over one knot, and then adjusts the data pointer to work on the next knot, but with a catch: if we instead call into the :quick( ) segroutine, it will treat the first knot in the cord as the start of a contiguous block of knots, skipping all of the overhead of advancing from one knot to the next because we are told they immediately follow one another in memory.

Some other details that are important include:

  • forward declaration of segroutines must always have an empty parameter list
    • segroutines always take the same number and types of parameters as their parent routine
  • at the assembly level, segroutine labels in the implementation imply a hidden stack allocation to make space for all of the variables hoisted and declared at the top of the routine
    • beware of compound declaration-definitions! at the start of a segroutine label, the variables are only declared, not initialised
  • the hidden stack allocation also implies a hidden goto inserted immediately before it, targeted to the position immediately after it
    • therefore, explicit gotos like in the example above incur no performance penalty, and give the programmer full control over expression differentiation in the rest of the routine

While this example merely shows different initial values of stack variables for the purposes of illustration, a more real-world implementation of a modular memory aware algorithm may instead insert machine-specific plumbing code, such as incrementing a segment register or switching active banks, while offering the segroutine as a bypass to this potentially costly part of execution in cases where it is known to not be needed.

Flexible anonymous typing

Since C* embodies the maxim of "data is all we have," it does not trip up the programmer when they use a variety of different phrasings of what boils down to the same underlying bit structure. In other words, it lacks the abstract type system enforcement typical of C, which might confound or prevent a programmer from working with their data in a self-evident way. C* will cause an error when two different types are assigned to one another, unless they are structure synonyms or there is an explicit cast. The compiler should also warn the programmer if they are casting a smaller variable into a larger one, as this may cause UB due to lack of allocated memory. Since the type system is so self-evident, detecting this is almost always easy to do. This self-evident approach to type identity is only constrained by the facilities of law and order which act upon the type names they are applied to.

Multiple return values

Even though C*'s concrete type system is highly syntactically flexible, so much so that it is easy to write up an anonymous structure as a return type and handle it without issue, the language nonetheless provides multiple return values directly without such boilerplate, in much the same fashion as seen in other programming languages. Types are comma separated, and the subroutine call site can receive any variety of them by comma, dropping unneeded values using the pronoun _.

Statement suites

C* provides a high-level semantic parallelism with what are called statement suites, or simply suites. This is a maximisation of C's lack of ordering of expression evaluation (not to be confused with order of operator precedence): entire statements can be conjoined or "separated" using commas , instead of semicolons ;, destroying the ordering of their execution in the program semantics and allowing the statements to be executed in arbitrary time (ergo, in any order or all at once). Naturally, this precludes routines so grouped from having any data interdependency, so one cannot use the output or parameters of one function to feed another in the same suite. Despite this, statement suites prove to be the fundamental building block of fine-grained parallel computing in C* – they are conceptually analogous to the machinations of VLIW processors that dispatch several orders of logic at once.

int f1( void );
int, int f2( void );
int f3( void );

/* ANSI C approach: the compiler must guess it is parallelisable */

f1( ); f2( ); f3( );

/* C* statement suite approach: we say these can happen in any order */

f1( ), f2( ), f3( );

To cope with ambiguity in other situations where commas are used, one can use parentheses to disambiguate, like so:

/* capturing multiple return values */
f1( ), (a, b) = f2( ), f3( );

/* the parentheses of routine calls also disambiguates */
void g1( int, int );
g1( f1( ), f3( ) );
/* temporarily storing return values is necessary to forward multiple
 * return values into later subroutine calls */

a, b = f2( );
g1( a, b ); /* OK */
g1( f2( ) ); /* error, g1( ) expects 2 arguments, got only the first
              * value from f2( ) */

Explicit inlining

C* models subroutine inlining in reverse of C and most other languages. Instead of dictating the intent to inline at the callee's site, it is instead dictated at the caller's. As long as the subroutine is within the total system, it can be inlined using this technique. Furthermore, it is desirable to have a feeble compiler that always inlines upon request, and never does so otherwise, the opposite of what most C compilers do with the inline keyword (ignore it). Experienced systems programmers know this all too well, and in the real world, profile-guided manual optimisation is the name of the game anyway. So, this is a tool for that kind of task.

To achieve this, C* uses the back tick symbol (`) to prefix the subroutine identifier at the call site, like so:

void foo( void )
{

/* ... */
}

void bar( void )
{

/* this calls a proper separate subroutine implemented elsewhere */
foo( );

/* this inlines foo( ) right here, always */
`foo( );
}

Code transclusion

C has long struggled to cope with the problem of inline assembly code, given the diversity of architectures and dialects, as well as the lack of a viable path to standardisation. C* attempts to solve this with a feature it calls code transclusion. Observe:

void foo( void )
{

/* ... */

bar!( );
}

In this code, bar is a symbol resolved like other routines and data. However, it is middled with an exclamation point !, as it is not a routine call, with the usual implications for calling conventions. The contents of bar are transcluded into the point in foo( ) where it appears, which some programmers might call "naked" assembly in the old parlance. Since transclusions can never take parameters or offer return values, their forward declarations are neither necessary nor permitted.

In practise, bar might be written in a proper assembly language source file, and integrated in the build step along with the C* source and other sources.

New operators

C* introduces a menagerie of new arithmetic and logical operators.

Name Symbol Variant Notes
three-way compare <=>   the return value of this comparator is balanced tri-state logic represented as an ephemeral enumeration of , corresponding to open (high-Z), low and high circuit states respectively
minimum <? <?= the assignment variant has short-circuit logic: if the destination variable is smaller, it is left unchanged; otherwise, it is set to the smaller incoming value
maximum >? >?= the assignment variant has short-circuit logic: if the destination variable is larger, it is left unchanged; otherwise, it is set to the larger incoming value
count leading zeroes (unary) ^?    
count trailing zeroes (unary) ?^    
population count (unary) ^^    
arithmetic (signed) shift right >>> >>>=  
rotate left <<< <<<=  
short-circuit logical AND assignment   &&= this kind of assignment statement only sets the left-hand side variable if its contents are nonzero (truthy)
short-circuit logical OR assignment   ||= this kind of assignment statement only sets the left-hand side variable if its contents are zero (falsey)

Division and modulus

C* also overloads the meaning of both the division operator / and the modulus operator % in a way that maintains semantic compatibility with C. It uses the multiple return values feature of C* borrowed from Go to make the following semantic equivalences:

/* these all have the same effect */
a = x / y;
a, _ = x / y;
b = x % y;
b, _ = x % y;

a, b = x / y;
b, a = x % y;

This was done because it is virtually universal that division is performed as a single operation with two output values (the quotient and the remainder). It is prudent to have the language reflect that mechanical reality.

Additionally, C* also irons out the semantics of division and modulus, so that integer division will always round towards zero, and modulus will behave consistently so that the result always carries the sign of the second operand.

Source encoding

The language requires all source code to be ASCII compliant in its raw form. No other encodings of source text are supported, although there is the doc comment exception. This basically means that inside of what C* considers doc comments—that is, comments that begin with /** and end with */—non-ASCII octets are permitted and will be ignored like the rest of the content of the comment. This makes it possible to encode UTF-8 text in comments, for example, which is important for non-English languages.

Identifier limits

When C was originally standardised by ANSI in the 1980s, the standard came with some very conservative translation limits on symbols and other identifiers:

  • 31 significant initial characters in an internal identifier or a macro name
  • 6 significant initial characters in an external identifier
  • 511 external identifiers in one translation unit
  • 127 identifiers with block scope declared in one block
  • 1024 macro identifiers simultaneously defined in one preprocessing translation unit

In the 1999 update ratified by ISO, the limits were increased:

  • 63 significant initial characters in an internal identifier or a macro name
  • 31 significant initial characters in an external identifier
  • 4095 external identifiers in one translation unit
  • 511 identifiers with block scope declared in one block
  • 4095 macro identifiers simultaneously defined in one preprocessing translation unit

As Mike Kinghan explained on Stack Overflow[3]:

There weren't any pitchforks on the lawn of the ANSI C committee when it stipulated 6 initial significant characters for external identifiers. That meant a conforming compiler could be implemented on IBM mainframes; and it need not be one to which the PDP-11 assembler would be inadequate and need not be able to emit code that couldn't even be linked with Fortan 77. It was a wholly unsensational choice.

Moreover:

An IBM 3380E hard disc unit, 1985, had a capacity of 5.0GB and cost around $120K; $270K in today's money. It had a transfer rate of 24Mbps, about 2% of what my laptop's HD delivers. With parameters like that, every byte that the system had to store, read or write, every disc rotation, every clock cycle, weighed on the bottom line. And this had always been the case, only more so. A miser-like economy of storage, at byte granularity, was ingrained in programming practice and those short public symbol names was just one ingrained expression of it. The problem was not, of course, that the puny, fabulously expensive mainframes and minis that dominated the culture and the counsels of the 1980s could not have supported languages, compilers, linkers and programming practices in which this miserly economy of storage (and everything else) was tossed away. Of course they could, if everybody had one, like a laptop or a mobile phone. What they couldn't do, without it, was support the huge multi-user workloads that they were bought to run. The software needed to be excruciatingly lean to do so much with so little.

Doing so much with so little was a practical matter in the 1980s, and while it has been outmoded by uncritically functionalist programming styles today, we understand mechanicalism as this very same practise as a principle. It does not matter that a Nexus smartphone is a hundred times faster and a hundred times cheaper today than a mainframe was in 1905; waste is still waste.

As the preeminent mechanicalist systems programming language, C* also imposes limits on symbols and other identifiers. Specifically:

  • up to 4 levels of domain hierarchy including the final symbol
  • 15 significant initial characters in an internal identifier or a macro name
  • 15 significant initial characters in an externally visible identifier
    • with a maximised domain hierarchy usage this makes the maximum "fully-qualified name" size 60 characters.
  • 255 identifiers with block scope declared in one block
  • 65535 external identifiers in one translation unit
  • 65535 macro identifiers simultaneously defined in one preprocessing translation unit

Furthermore, C* imposes a somewhat stricter rule on the meaning of these limits: conforming implementations must not permit symbols or other identifiers that exceed the limits defined above. Interworking with foreign code is provided by the extern ABI feature.

extern ABI

While C* provides the extern linkage modifier as it exists in C, and implies it onto non-static functions as C does, it also provides a C++-like ABI specifier suffix to this keyword as well. Not only does this allow implementations to expose different symbol mangling regimes opaquely to the programmer, in C* it also serves as the veneer to incorporate long foreign symbols into C*'s constraints and, depending on the API design at hand, its domain module system. extern ABI symbols are exempted from the identifier limits imposed in the rest of the language; they can be as long as the compiling machine's memory permits. At the minimum, conforming compilers must support the extern "C" ABI, but may opt to support other ones such as their default C++ ABIs.

Work to be done

A comprehensive ABI

C* strives to provide the maximum possible power to exactly specify the function of a system. While there are many facilities for this "within the reservation", so to speak, much conceptual work still needs to be done about the Application Binary Interface in the popular sense of the term. Building up Oración should help finalise a comprehensive solution to this, so that it is easy for C* programmers to exactly specify the lingua franca of their programs in a machine-agnostic way.

What C* is not

Otherwise known as "criticisms from the dustbin", this is to be a collection of common criticisms and my answers thereof. Programming language theory has been a notorious hotbed of intellectual rot, so creating a kind of critic's FAQ will help immensely in pre-empting the handful of questions that will no doubt be asked a thousand times over before it is all said and done. There are very good reasons for why everything in C* is the way that it is.

Glossary

A glossary of terms can help readers familiarise themselves with the radically different approach that C* takes in dealing with computing and systems theory. It can also serve as a stimulus for further expansion on such topics by the writers.

References

  1. "Some were meant for C: the endurance of an unmanageable language." Association for Computing Machinery. Retrieved 2024-02-01.
  2. "ANSI Standards Action Vol. 36, #48" (PDF). American National Standards Institute. 2005-12-02. Archived from the original on 2016-03-04. Retrieved 2009-08-06.
  3. "Why did ANSI only specify six characters for the minimum number of significant characters in an external identifier?". Stack Overflow. 2016-06-26. Archived from the original on 2024-11-14. Retrieved 2024-11-14.