Pointerfull Haskell

Montag, 21. Juli 2008

Language.C: Analysing Definitions

Finally I have a (more or less) working implementation for analysing declarations and definitions in Language.C.
From a practical point of view this means one can now easily perform file-scope analysis, e.g. to analyze header files.

As the API needs some more revisions, and there are a few bugs to hunt left, I'll rather present an working example now, and defer explanations to the next post.

Consider the following C file


typedef struct t { char x; short y; } __attribute__((packed)) T;
union u { T x; T* y; };
struct s {
 struct k { short b1 : 8, b2: 9, b3: 8, b4 : 7;} x;
 union u a,*b;
};

We'd like to know which composite types (struct or union) are defined in this file, and how many bytes each of them occupies on our system.
Instead of guessing we'll ask gcc, in the same way as autoconf does.


> module Main where
> import System.Environment ; import System.IO
> import Control.Arrow      ; import Control.Monad
> import Data.Map (Map)     ; import qualified Data.Map as Map
> import Data.Maybe         ; import Data.Function (fix)
> import Data.Generics

The public API of Language.C consists of a just a few modules, and Language.C.hs itself exports everything needed for most tasks


> import Language.C                 -- Language.C.{Syntax,Pretty,Parser,InputStream}
> import Language.C.Analysis        -- the new analysis modules
> import Language.C.System.GCC      -- preprocessor used
> import Language.C.Analysis.Export -- [alpha, hence an extra import]

The overall strategy will be the following: We'll parse and analyze the file, get the definitions of structs and unions we're interested in, replace typedefs and enums in the type definitions, and then generate code for gcc.


> main :: IO ()
> main = do
>     let usage = error "Example Usage: ./ComputeSize -I/usr/include my_file.c"
>     (opts,c_file) <- liftM (init &&& last) getArgs
>
>     let compiler = newGCC "gcc"
>     ast <- parseFile compiler Nothing opts c_file >>= checkResult "[parsing]"
>
>     (globals,warnings) <- (runTrav_ >>> checkResult "[analysis]") $ analyseAST ast
>     mapM (hPutStrLn stderr . show) warnings
>
>     print $ pretty (generateSizeTests c_file globals)
>     where
>     checkResult :: (Show a) => String -> (Either a b) -> IO b
>     checkResult label = either (error . (label++) . show) return

To generate the C code, we get the composite type definitions we're interested in and those we need to compute the size. Then we define the types and finally emit the main function.


> generateSizeTests :: FilePath -> GlobalDecls -> CTranslUnit
> generateSizeTests c_file globals =
>       translUnit $
>          map defineComp referenced_comps ++
>          [ genSizeTest (Map.elems comps_of_interest) ]
>     where
>     comps = Map.mapMaybe fromComp (gTags globals)
>     comps_of_interest  = compsOfInterest c_file comps
>     referenced_comps   = computeRefClosure comps comps_of_interest
>     fromComp (CompTag struct_union) = Just struct_union
>     fromComp (EnumTag _) = Nothing
>     translUnit = flip CTranslUnit intn
> intn = mkUndefNodeInfo

We're only interested in those definitions in the given file which have a name

> compsOfInterest :: FilePath -> Map SUERef CompType -> Map SUERef CompType
> compsOfInterest c_file = Map.filterWithKey isNamed . Map.filter isInCFile
>     where
>     isInCFile = ((==c_file) . fileOfNode)
>     isNamed sue_ref _ =  case sue_ref of AnonymousType _ -> False; _ -> True

computeRefClosure is a little more complicated, so we'll continue with defining the composite types. As an enumeration always has int type, we simply replace its type with TyInt; furthermore we dereference all typedefs, using Data.Generics (everywhere).


> defineComp :: CompType -> CExtDecl
> defineComp ty = CDeclExt (CDecl (map CTypeSpec (exportCompType $ derefTypeDefs ty)) [] intn)
>    where
>    derefTypeDefs ty = everywhere (mkT derefTypeDef `extT` replaceEnum) ty
>    derefTypeDef (TypeDefType (TypeDefRef _ (Just ty) _)) = ty
>    derefTypeDef ty = ty
>    replaceEnum _ = TyIntegral TyInt
>    replaceEnum dty = dty

Generating the main file without hacking would currently be to much work, as there isn't much support for code generation.
Quasi-Quoting would really help here - for now we `simulate' it using pretty and parse...


> genSizeTest :: [CompType] -> CExtDecl
> genSizeTest tys = either (error.show) fromExtDecl $
>                   parseC (inputStreamFromString test) (Position "autogenerated" 1 1)
>     where
>     fromExtDecl (CTranslUnit [decl] _ ) = decl
>     fromExtDecl (CTranslUnit decls _) = error $ "Expected declaration" 
>     test = "int main() {" ++ concatMap checkSize tys ++ "}"
>     checkSize (CompType sue_ref tag _ _ _) =
>         let tag_str = show tag ++ " " ++ show sue_ref in
>         "printf(\""++ tag_str ++": %d\\n\",sizeof(" ++ tag_str ++ ")); ";

Running this program on the above input results in the following C file


struct k {
    short b1 : 8; short b2 : 9; short b3 : 8; short b4 : 7;
};
struct __attribute__((packed)) $1749 {
    char x; short y;
};
union u {
    __attribute__((packed)) struct $1749 x;
    __attribute__((packed)) struct $1749 * y;
};
struct s {
    struct k x; union u a; union u * b;
};
int main()
{
    printf("struct k: %d\n", sizeof(struct k));
    printf("struct s: %d\n", sizeof(struct s));
    printf("struct t: %d\n", sizeof(struct t));
    printf("union u: %d\n", sizeof(union u));
}

Now we can use gcc to compute the size of the structs:


/ComputeSize compute_size.c | gcc -x c -o compute_size - && ./compute_size;
struct k: 6
struct s: 16
struct t: 3
union u: 4

The missing computation of directly nested structs and unions concludes:


> computeRefClosure :: Map SUERef CompType -> Map SUERef CompType -> [CompType]
> computeRefClosure all_comps initial_comps =
>     fixCont addReferenced ([], Map.elems initial_comps, (Map.empty,Map.empty))
>     where
>     fixCont f = fix $ \close args ->
>         let args'@(result',todo',_) = f args 
>         in (if null todo' then reverse result' else close args')
>     addReferenced (result,[],ms) = (result,[],ms)
>     addReferenced (result,(t:ts),(visit,enter)) 
>        | Map.member (sueRef t) enter = (result,ts,(visit,enter))
>        | Map.member (sueRef t) visit = (t:result,ts,(visit,Map.insert (sueRef t) t enter))
>        | otherwise = let refd = referenced t
>                      in (result, refd++(t:ts), (Map.insert (sueRef t) t visit,enter))
>     referenced (CompType _ _ members _ _) = mapMaybe getRefdComp members
>     getRefdComp memberDecl = fromDirectType (declType memberDecl) >>= fromCompTy
>     fromCompTy (TyComp (CompTypeDecl ref _ _))
>         | (Just r) <- Map.lookup ref all_comps = Just r 
>         | otherwise = error $ "Internal Error: Could not find definition for "++show ref
>     fromCompTy _ = Nothing
>     fromDirectType (DirectType tyname _ _) = Just tyname
>     fromDirectType (TypeDefType (TypeDefRef _ ref _)) = (fromDirectType.fromJust) ref
>     fromDirectType _ = Nothing

I hope that this example gives an approximate feeling what the library can achieve right now.

Dienstag, 8. Juli 2008

An analysis-friendly representation

As mentioned in my last post, C's declaration syntax is painful if you want to perform more complicated analysis tasks directly on the AST.

To illustrate the problem, consider for example the file scope declaration

static struct s { union b { int a,b; } z; } x(), *y;

There are several things mixed up here: a prototype of a function x, a definition of a pointer object y and the definition of two (file scope) composite types struct s and union b. If you're working with the syntax tree returned by the parser, detangling those things becomes a major headache.

For this reasons, CIL takes a quite radical approach, and makes you work on a semantic representation only, which eliminates many subtleties of the language.

Language.C will probably take the following approach:

SemanticRepr <= analyze-generate => AST <= parse-prettyprint => C-Source Code

We still expose the parser's AST and the corresponding pretty printer to the user, for those who need to work with or modify the syntax directly. Additionally, there is a semantic representation now, which gets rid of the syntactic oddities, has stronger types and is simpler to analyze.

The rational behind this decision is that

Users still have access to a full-scale C parser
Users still have fine grained pretty-printer control.
Extracting properties from the C code becomes much easier in the semantic representation
Using generic programming strategies is feasible in the semantic representation

One thing that is tricky to solve is the handling of anonymous structures and unions - consider the declaration

struct { int a; } b,*c;

This declaration introduces an anonymous struct type, which is shared by the objects b and c, but cannot be defined on its own. To re-export such a declaration to C you either have to keep track of all objects which share the same anonymous type, or simply give a name to anonymous types, as CIL does.

Anyway, this approach looks promising and will hopefully make the library a good tool of choice for both syntax and semantic oriented C tasks.

There a still some open questions though:

Which source code locations should be kept in the semantic representation ? It is nice if we're able to point at the problematic source code location, but for some transformed entities this is no longer possible.
Should we try to share some parts of the AST and the semantic representation ? I believe we should design simplified representations of statements and expressions instead.
What is the best way to link references and definitions of objects, functions and types ? Currently, identifiers are linked using namespace lookup maps, adopted from Manuel's c2hs framework.

Development Notes:

Designed a semantic representation for declaration and types.
Wrote portions of the AST to semantic-representation translation, including normalization of type names.
happy's --strict flag does a great job improving the space performance of the parser (120M down to 60M on a problematic file)
To reduce the necessary boilerplate work, I'm now using (offline) Data.Derive to generate some boring instances (source code locations).

Dienstag, 1. Juli 2008

Last week on Language.C (1)

This week I tried to find a better representation for declarators, which was screwed up in the old AST from c2hs. I had another look at the CIL API, which is really nice, but quite different from the rather syntax oriented AST representation I'm working with. While CIL datatypes represent the semantics of a C file, Language.C's AST datatypes rather represent the syntax. I hope that this won't be a disadvantage, and that we can use view abstractions to make using the library as easy for the user as CIL is.

The documentation of the AST is now usable, and includes some examples for the trickier parts (declarators, intializers).

Due to a bug in ghc I switched to generating Data.Generic instances offline using Data.Derive.

I completed the implementation, which used to only derive gfoldl (but not toConstr etc.), and now it seems to be working nice. Currently, the Data instances are used for testing only, but I hope to play with Strafunsky a little bit next week.

I've also abstracted the notion of an InputStream - one can use either ByteString or String as input now (chosen at compile time). I didn't optimize the ByteString stuff, and therefore the performance improvement is rather small. But the interface abstraction is nice anyway.

Finally, I didn't complete porting the declaration analysis code from c2hs (oh my), but at least some parts of the framework (traversal monad) is finished. This should be finished in a few days though, I hope.

First things first ?

This summer I'm working on a C99 parser and pretty printer library for haskell, coined Language.C - which might sound a little boring and like a lot of work.

The good (and sometimes tricky) thing about it is that there was already an almost complete parser for C99 available - written for the c2hs project and buried within a large framework. The complexity of the framework made it really hard to use for anyone outside c2hs.

The goal of the GSoC project I'm working on is to pull out the C related components from c2hs and to create a complete, well-tested library for parsing and pretty printing, with a usable and documented API.

Complete ? One thing I did was to precisely define which subset of C the library will handle (currently, that's almost all of -std=gnu99). Furthermore, the parser should record everything parsed in such a way that it is possible to reconstruct the source file from the AST (including the somewhat nasty __attribute__ annotations).

Well-tested ? We use round-trip tests (parse = parse . prettyPrint . parse) on both real world libraries and the gcc-dg test-suite. Hereby, a Data.Generics-based AST comparison helps to avoid mistakes in the tests themselves. Furthermore, automated measurements should prevent performance bugs - it is easy to introduce them in haskell.

Of course it would also be nice to have a rich analysis framework, quasi-quotation, and similar things - of which only some can be implemented by me. But I hope that it will be easy to work on and extend the library in the future.

For now, I'd like to release a preview of the library soon, as I believe it is already usable for some tasks (the repo is at http://code.haskell.org/language-c/).