Lab05 - Scanning

Due: Tue, Mar 29, 2022 at 11:59 PM to Github Classroom Assignment

Background

Over the next few weeks, we will build a scanner, parser, and interpreter for a little language which demonstrates basic arithmetic, number bases, and bitwise operations.

Requirements

  1. For this lab, you will build a scanner for the language described below. We will address parsing and interpretation in subsequent assignments.
  2. You must provide a Makefile which builds an executable called lab05
  3. You must build the token table yourself, without using C strtok()
  4. Use the techniques we learned for reading files and analyzing text to build up a linked list of tokens
  5. You may use any of the code I showed, or you can make up your own if you prefer
  6. Print the token IDs and names out exactly as shown to get credit for correctness

Example Output

$ ./lab05 "1 + 2"
TK_INTLIT("1")
TK_PLUS("+")
TK_INTLIT("2")
TK_EOT("")

$ ./lab05 "(100 / 0x32) * 0b10"
TK_LPAREN("(")
TK_INTLIT("100")
TK_DIV("/")
TK_HEXLIT("32")
TK_RPAREN(")")
TK_MULT("*")
TK_BINLIT("10")
TK_EOT("")

We will define the set of acceptable tokens using Extend Backus-Naur Form (EBNF) Here is the EBNF for this lab:

tokenlist   = (token)*
token       = intlit | binlit | hexlit | symbol 
intlit      = digit (digit)*
binlit      = '0b' bindigit (bindigit)*
hexlit      = '0x' hexdigit (hexdigit)*
symbol      = '+' | '-' | '*' | '/' | '(' | ')'
digit       = '0' | ... | '9'
bindigit    = '1' | '0'
hexdigit    = digit | 'a' | ... | 'f' | 'A' | ... | 'F'
whitespace  = ' ' | '\t'  # Ignore

Here are some definitions you should use in your implementation:

#define SCAN_TOKEN_LEN 32

enum scan_token_enum {
    TK_INTLIT, /* -123, 415 */
    TK_BINLIT, /* 0b10, 0b1001 */
    TK_HEXLIT, /* 0x7f, 0x12ce */
    TK_LPAREN, /* ( */
    TK_RPAREN, /* ) */
    TK_PLUS,   /* + */
    TK_MINUS,  /* - */
    TK_MULT,   /* * */
    TK_DIV,    /* / */
    TK_EOT,    /* end of text */
};

#define SCAN_STRINGS {\
    "TK_INTLIT",\
    "TK_BINLIT",\
    "TK_HEXLIT",\
    "TK_LPAREN",\
    "TK_RPAREN",\
    "TK_PLUS",\
    "TK_MINUS",\
    "TK_MULT",\
    "TK_DIV",\
    "TK_EOT"\
};

struct scan_token_st {
    enum scan_token_enum id;
    char name[SCAN_TOKEN_LEN];
    struct scan_token_st *next;
};

struct scan_table_st {
    struct scan_token_st *head;
    int len;
};

Implementation Notes

  1. The way to think about implementing a scanner given the EBNF is to create a function that will scan one token in the input string each time it is called. You need to figure out the starting character for each of the elements on the right-hand side of the token rule in the EBNF:
     token       = intlit | binlit | hexlit | symbol
    
  2. For the one-character elements such as symbol, simply compare the next character in the input text to each of the possible one-character elements. If you find a match, then you create a corresponding token.
  3. For multi-character elements such as intlit, binlit, and hexlit you will first see if the current character matches the beginning of one of these rules:
  4. For an intlit, the first character must be a digit.
  5. For a binlit, the first character must be a ‘0’, and the second character must be a ‘b’
  6. For a hexlit, the first character must be a ‘0’, and the second character must be an ‘x’
  7. Once you have determined that you are scanning one of the literal types, you should scan acceptable characters for that literal type until you get to a character which is not acceptable for that literal type.
  8. Your scanner should ignore whitespace
  9. You implementation should have the following functions:
     void scan_table_init(struct scan_table_st *tt);
     void scan_table_scan(struct scan_table_st *tt, char *input);
     void scan_table_print(struct scan_table_st *tt);