Quasi - a tool for quasi-literate programming

Daniel Bradley

www.crossadaptive.com

Copyright

Copyright 2011, Daniel Robert Bradley.

Last updated

31 October 2011

License

This software is released under the terms of the GPLv3.

Download from:

http://www.quasi-literateprogramming.org/download/Quasi/quasi-src-1.0.7.tar.bz2

Source

View source:

http://www.quasi-literateprogramming.org/mtx/quasi.mtx.txt

Disclaimer

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Introduction

Donald Knuth coined the term "literate programming" to refer to a programming approach whereby a programmer develops a program "in the order demanded by logic and flow of their thoughts" [1]. Rather than produce source code that is commented with textual descriptions, a textual description is produced that describes the structure and semantics of code chunks embedded within the prose.

Tools can then be used to produce reader friendly documentation woven from the source, as well as an executable/compilable tangled form. Knuth's original tool was called "Web" [2], however other tools have since been developed that are language-agnostic [3].

The following code fragment from the literate programming Wikipedia page demonstrates how the "web" system works [1]. The text '<>=' defines a macro that is associated with the code that follows it.

	<<Scan file>>=
	while (1) {
		<<Fill buffer if it is empty; break at end of file>>
		c = *ptr++;
		if ( c > ' ' && c < 0177 ) {
			/* visible ASCII codes */
			if ( !in_word) {
				word_count++;
				in_word = 1;
			}
			continue;
		}
		if ( c == '\n' ) line_count++;
		else if ( c != ' ' && c != '\t') continue;
		in_word = 0;
			/* c is newline, space, or tab */
	}
	@

The macro '<>' could then be used in any other code chunk.

A problem with such an approach is the possibility that while a reader may think they fully understand the code they are reading, it is possible that they do not notice a specific interaction between various code chucks. It would be necessary for the reader to reference the tangled code in order to be sure they are properly understanding interactions within the system.

A related problem is that there are no limitation on how macros are used, allowing code to be intermixed in arbitrary ways. Software developed using the system may become increasingly hard to maintain as others are forced edit the source files.

This document describes Quasi, a tool for quasi-literate programming. It has been developed in the spirit of Knuth's literate programming but, by providing a far less powerful tool, it also simplifies the process from the perspective of a maintenance programmer.

Background

While this tool was inspired by literate programming, it is derived from an earlier tool called "extract" that is used for extracting SQL definitions from web-application requirements documents for use in database initialisation scripts. The tool would scan text files and extract the text of pre-formatted sections that matched a user supplied pattern.

For example, this command would extract the following block of SQL:

extract -p "users_table" source_file.txt >> output_file.sql
	~users_table~
	CREATE TABLE users
	(
	USER        INT(11)  NOT NULL AUTO_INCREMENT
	PRIMARY KEY (USER)
	);
	~

This allowed the definitions of SQL tables and SQL Stored Procedures to be directly developed and documented within a requirements document. This provided the motivation for developing a similar tool to also extract source code from documentation.

Concept

Similar to "extract", "quasi" extracts sections of pre-formatted text from documentation and appends it to target text files. Unlike "extract", rather than matching a supplied pattern, the identifier in the pre-formatted text section is used as the file path of the target file relative to a user supplied base directory.

For example, this command would extract the following block of source code and append it to the file 'source/c/quasi.c':

	quasi source source/mtx/quasi.mtx
	~c/quasi.c~
	int main( int argc, char** argv )
	{
		return 0;
	}
	~

The tool does sanitation of the filenames, ensuring that parent directory ('..') commands aren't included and therefore that output files remain under the specified base directory. If the specified base directory already exists the tool will exit with an error, unless the '-f' flag is passed as the first command argument.

	quasi -f source source/mtx/quasi.mtx

If the identifier of the pre-formatted block section is prefixed by an exclamation mark the file is truncated on opening. It is advisable that when a file is truncated in this manner that the code fragment be a comment warning that the file is generated:

	~!c/quasi.c~
	/*   !!!   Warning this file is auto-generated   !!!   */
	~

Quasi is implemented to process text files that use the MaxText text format [4]. If code fragments are not appropriate for the output documentation they can be commented using the standard MaxText commenting character, causing them to be ignored by MaxText, but still be processed by Quasi. This is useful for hiding code comments, or perhaps includes.

	!
	Include various standard includes.

	~!c/quasi.c!~
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	~
	!

The key difference between literate programming tools and Quasi is that Quasi forces the programmer to construct all target source files in a linear fashion, however, separate files may still be constructed in parallel. It is thought that an additional benefit of this approach is that it will enable programmers to better modularise their software, as there is very little overhead in creating new files.

Example: quasi implementation

Quasi has now been reimplemented in pure C to maximise portability, allowing it to act as the foundation of an organisation's development tool set.

/*   !!!   Warning generated from mtx source files   !!!   */

Invocation

To ensure simplicity of implementation, Quasi is invoked with a simple command-line.

	quasi [-f] BASE_DIR INPUT_FILES...

The BASE DIR argument specifies the directory that target files are created relative to -- if the BASE DIR already exists, Quasi exits returning an error, unless the '-f' flag is passed as the first argument. After the BASE DIR, one or more input files are specified for parsing.

Overview

The function declarations below show the overall structure of Quasi. First the arguments are processed (processArguments) and, if valid, the base directory is verified (canAccessBaseDirectory) and created, if needed. Each of the command-line arguments corresponding to the source files are then processed (processSourceFile). During processing, the output files are opened and closed (rejig) as necessary. Each time a target file is created the filename must be appropriately sanitised (generateSafeFilepath) and it must be determined whether to truncate the file or not (doWeTruncate).

int        processArguments( int argc, const char** argv );
int  canAccessBaseDirectory( const char* baseDir, int forced );
int      processSourceFiles( const char* baseDir, int first, int last, const char** files );
FILE*                 rejig( FILE* out, const char* baseDir, const char* line );
char*  generateSafeFilepath( const char* basedir, const char* line );
int            doWeTruncate( const char* line );

If inappropriate arguments are passed, Quasi prints a usage message (usage) and exits with an error. Similarly, if the base dir directory already exists, Quasi prints an error message (*errorDirectoryExists*) and exists with an error.

int                   usage();
int    errorDirectoryExists();

The following utility functions are also used (these are described in the appendix).

int       createDirectories(       char* safeFilePath );
int         directoryExists( const char* path );
char*       parentDirectory( const char* filepath );
char*              readline( FILE* stream );
char*            stringCopy( const char* aString );

During argument processing, the following global variables are initialised. FORCE is initialised as true (1) if the '-f' flag was passed; FIRST is initialised to indicate the first source file argument in argv; and BASE_DIR is initialised to the base directory argument.

int         FORCE;
int         FIRST;
const char* BASE_DIR;

The main function calls the appropriate functions as needed.

int main( int argc, const char** argv )
{
	int status = 0;

	if ( processArguments( argc, argv ) )
	{
		if ( canAccessDirectory( BASE_DIR, FORCE ) )
		{
			int last = argc - 1;
			status = processSourceFiles( BASE_DIR, FIRST, last, argv );
		}
		else
		{
			status = errorDirectoryExists();
		}
	}
	else
	{
		status = usage();
	}
	return !status;
}

Argument processing

If the minimum expected number of arguments is supplied, the arguments are processed. If the force argument is supplied, the global variable FORCE is set to true (1); then the global variable BASE_DIR is initialised to the next argument. Finally, if there is at least one file argument remaining, the global variable FIRST is initialised to identify it and the function returns true (1).

int processArguments( int argc, const char** argv )
{
	int status             = 0;
	int expected_arguments = 3;
	int i                  = 1;

	if ( argc >= expected_arguments )
	{
		if ( 0 == strcmp( argv[i], "-f" ) )
		{
			expected_arguments++;
			i++;
			FORCE = 1;
		}
		BASE_DIR = argv[i]; i++;
		FIRST    = i;
		status   = ( argc >= expected_arguments );
	}
	return status;	
}

Accessing the base directory

First, an attempt is made to create the base directory. True (1) is returned if this succeeds, or if the directory already existed and FORCE is true. Otherwise, the method returns false (0).

int canAccessDirectory( const char* baseDir, int force )
{
	int status = 0;

	if ( 0 == mkdir( baseDir, 0755 ) )
	{
		status = 1;
	}
	else
	{
		switch ( errno )
		{
		case EEXIST:
			status = force;
			break;
		}
	}
	return status;
}

Processing the source files

For each source file, the processFile function is called passing the baseDir and the source file.

int processSourceFiles( const char* baseDir, int first, int last, const char** files )
{
	int status = 1;
	int i;
	for ( i=first; i <= last; i++ )
	{
		status &= processFile( baseDir, files[i] );
	}
	return status;
}

Processing a file

This procedure processes an individual source file. The file stream in is opened for the duration of the procedure, while the out file stream is only opened during the processing of a pre-formatted text block.

The procedure reads lines from the in stream using the "readline" procedure. When a tilde (~) character is encountered, the system either opens, or closes, out by calling the "rejig" function, which rejigs the out stream. Each time a stream is closed a blank line is printed to the stream -- this allows the source to have spaces between chunks, while not having kludge whitespace in pre-formatted text blocks.

When a tilde character doesn't start the line and the out stream is an open (not NULL) stream, the line is written out to the stream.

If the out file stream is not null when the loop exists it indicates that the previous pre-formatted block wasn't closed properly. This causes a warning message to be printed to stderr and the function returns false (0).

int processFile( const char* baseDir, const char* sourceFile )
{
	int status = 0;
	fprintf( stdout, "Processing: %s\n", sourceFile );

	FILE* in = fopen( sourceFile, "r" );
	if ( in )
	{
		FILE* out = NULL;

		char* line;
		while ( (line = readline( in )) )
		{
			if ( '~' == line[0] ) {
				if ( out ) fprintf( out, "\n", line );
				out = rejig( out, baseDir, line );
			}
			else if ( out )
			{
				fprintf( out, "%s", line );
			}
		}
		fclose( in );

		if ( out )
		{
			fclose( out );
			fprintf( stderr, "Warning: %s is unmatched '~'\n", sourceFile );
		}
		status = (out == NULL);
	}

	return status;
}

Rejig file output

This function closes out, then if the passed line contains a valid file name, opens and returns a new file stream. First, if "generateSafeFilepath" indicates a valid file name, any necessary directories are created, then the file is either opened or created. If "doWeTruncate" returns true, the file is truncated on open.

FILE* rejig( FILE* out, const char* basedir, const char* line )
{
	FILE* ret = NULL;

	if ( out ) fclose( out );

	char* safeFilePath = generateSafeFilepath( basedir, line );
	if ( safeFilePath && createDirectories( parentDirectory( safeFilePath ) ) )
	{
		if ( doWeTruncate( line ) )
		{
			ret = fopen( safeFilePath, "w" );
		}
		else
		{
			ret = fopen( safeFilePath, "a" );
		}
	}
	free( safeFilePath );

	return ret;
}

Generation of the safe filepath

The "generateSafeFilepath" procedure attempts to produce a safe file path by combining the passed base-dir with a file path extracted from the passed line. The passed line starts with a tilde character but may have anything else as well.

First, the line is checked to make sure it doesn't include '..', which could potentially reference a directory above the base dir. Next, the line is checked to ensure it includes a period, '.', either indicating a relative directory path, or the beginning of a file type suffix -- this is to avoid MaxText pre-formatted text tags. Then, 'strtok' is used to tokenise the line using the tilde as a delimiter. The first token returned is treated as the filepath -- first it is checked to make sure it is alpha-numerical chacter (indicating an appropraite filename), then strtok is called again to verify that the token is trailed by another tilde.

If there are any problems NULL is returned.

char* generateSafeFilepath( const char* basedir, const char* line )
{
	char* full = NULL;

	int len = strlen( basedir ) + strlen( line ) + 1;
	if ( NULL == strstr( line, ".." ) )
	{
		if ( NULL != strstr( line, "." ) )
		{
			char* test  = stringCopy( line );	
			char* token = strtok( test, "~" );

			if ( token && ('!' == token[0]) ) token++;

			if ( token && isalnum( token[0] ) )
			{
				full = calloc( len, sizeof( char ) );
				strcpy( full, basedir );
				strcat( full, "/" );
				strcat( full, token );

				if ( NULL == strtok( NULL, "~" ) )
				{
					free( full );
					full = NULL;
				}
			}
			free( test );
		}
	}
	return full;
}

Truncation

The output file is truncated on open if the source file name is proceeded by an exclamation mark.

int doWeTruncate( const char* line )
{
	int truncate = 0;
	if ( 2 < strlen( line ) )
	{
		truncate = ('!' == line[1]);
	}
	return truncate;
}

Error messages

Incorrect arguments

If the program is invoked without the appropriate arguments the following usage message is printed to stderr.

Usage:
	quasi [-f] BASE_DIR INPUT_FILES
int usage()
{
	const char* ch = "Usage:\n\t quasi [-f] BASE_DIR INPUT_FILES";
	fprintf( stderr, "%s\n", ch );
	return 0;
}

Directory already exists

If the force ('-f') flag hasn't been passed, and the base directory already exists, the following error message is printed to stderr.

	Error: directory already exists, or cannot be created!
int errorDirectoryExists()
{
	const char* ch = "Error: directory already exists, or cannot be created!";
	fprintf( stderr, "%s\n", ch );
	return 0;
}

Future Work

In the future, other open source projects, including MaxText and Build, will be rewritten using Quasi and released.

Appendix A

Quasi uses the following generic auxiliary functions.

Create directories

This is a recursive procedure that finds the first existing directory, then unwinds creating the necessary directories.

int createDirectories( char* dir )
{
	int success = 1;
	if ( ! directory_exists( dir ) )
	{
		if ( createDirectories( parentDirectory( dir ) ) )
		{
			success = ! mkdir( dir, 0755 );
		}
	}
	free( dir );
	return success;
}

Directory exists

A simple procedure that determines whether a directory exists or not.

int directory_exists( const char* path )
{
	int status = 0;
	DIR* dir = opendir( path );
	if ( dir )
	{
		closedir( dir );
		status = 1;
	}
	return status;
}

Parent directory

A simple wrapper around "dirname" that allocates and returns a string. Note: due to dirname returning its own storage, can't just call:

	return stringcopy( dirname( filepath ) );
char* parentDirectory( const char* filepath )
{
	char*  ret = stringCopy( filepath );
	strcpy( ret, dirname( ret ) );
	return ret;
}

Read line

The "readline" procedure reads individual characters into a character buffer -- each character is appended to the char string "line". When a newline character is encountered "line" is returned. When the stream is empty a NULL is returned.

If the line is longer than 1023 characters the buffer "line" is doubled in size using "realloc".

Easier to implement this, than worry about portability. From memory, the POSIX function is implemented differently on different systems.

char* readline( FILE* stream )
{
	int  n     = 0;
	int  sz    = 1024;
	char ch[2] = { 0, 0 };
	char* line = calloc( sz, sizeof( char ) );

	int read;
	do
	{
		read = fread( ch, sizeof(char), 1, stream );
		if ( read )
		{
			switch ( *ch )
			{
			case '\n':
				line[n++] = *ch;
				line[n]   = '\0';
				read      = 0;
				break;
			default:
				line[n++] = *ch;
				line[n]   = '\0';
			}

			if ( n == sz )
			{
				sz  *= 2;
				line = realloc( line, sz );
			}
		}
		
	}
	while ( 0 != read );

	if ( 0 == n )
	{
		free( line );
		line = NULL;
	}

	return line;
}

String copy

Returns a copy of the passed string.

char* stringCopy( const char* aString )
{
	char* copy = calloc( strlen( aString) + 1, sizeof( char ) );
	strcpy( copy, aString );
	return copy;
}

Bibliography

[1] Wikipedia: literate programming. http://en.wikipedia.org/wiki/Literate_programming
[2] The CWEB System of Structured Documentation http://www-cs-faculty.stanford.edu/~uno/cweb.html
[3] Noweb - A Simple, Extensible Tool for Literate Programming http://www.cs.tufts.edu/~nr/noweb/
[4] MaxText will be released publicly soon.