Dienstag, 8. Juli 2008

An analysis-friendly representation

As mentioned in my last post, C's declaration syntax is painful if you want to perform more complicated analysis tasks directly on the AST.

To illustrate the problem, consider for example the file scope declaration
static struct s { union b { int a,b; } z; } x(), *y;
There are several things mixed up here: a prototype of a function x, a definition of a pointer object y and the definition of two (file scope) composite types struct s and union b. If you're working with the syntax tree returned by the parser, detangling those things becomes a major headache.
For this reasons, CIL takes a quite radical approach, and makes you work on a semantic representation only, which eliminates many subtleties of the language.
Language.C will probably take the following approach:
SemanticRepr <= analyze-generate => AST <= parse-prettyprint => C-Source Code
We still expose the parser's AST and the corresponding pretty printer to the user, for those who need to work with or modify the syntax directly. Additionally, there is a semantic representation now, which gets rid of the syntactic oddities, has stronger types and is simpler to analyze.
The rational behind this decision is that
  • Users still have access to a full-scale C parser
  • Users still have fine grained pretty-printer control.
  • Extracting properties from the C code becomes much easier in the semantic representation 
  • Using generic programming strategies is feasible in the semantic representation
One thing that is tricky to solve is the handling of anonymous structures and unions - consider the declaration
struct { int a; } b,*c;
This declaration introduces an anonymous struct type, which is shared by the objects b and c, but cannot be defined on its own. To re-export such a declaration to C you either have to keep track of all objects which share the same anonymous type, or simply give a name to anonymous types, as CIL does.
Anyway, this approach looks promising and will hopefully make the library a good tool of choice for both syntax and semantic oriented C tasks.
There a still some open questions though:
  • Which source code locations should be kept in the semantic representation ? It is nice if we're able to point at the problematic source code location, but for some transformed entities this is no longer possible.
  • Should we try to share some parts of the AST and the semantic representation ? I believe we should design simplified representations of statements and expressions instead. 
  • What is the best way to link references and definitions of objects, functions and types ? Currently, identifiers are linked using namespace lookup maps, adopted from Manuel's c2hs framework. 
Development Notes:
  • Designed a semantic representation for declaration and types.
  • Wrote portions of the AST to semantic-representation translation, including normalization of type names.
  • happy's --strict flag does a great job improving the space performance of the parser (120M down to 60M on a problematic file)
  • To reduce the necessary boilerplate work, I'm now using (offline) Data.Derive to generate some boring instances (source code locations).

Keine Kommentare: