Symbol Table Management
last updated
10/6/03
Part of compiler
The symbol table is accessed by most phases of a compiler, beginning within
the lexical analysis scanning. The symbol table carries the collected information
about each named object in the program to other phases of the compiler. As soon
as a named token is found, depending on the token type, a call to the symbol
table routines is made.
Tokens:
Keywords of the language, special characters/punctuation, user defined
identifiers used for different meanings, constants
Identifiers:
Identifiers are user defined
- variables, memory location labels
- parameter names
- names of subroutines (functions or procedures) which are associated with
entry points of the routines' code
- array or structure names
- field names; file names; enumerated type values
- constant names; type names
Need a global structure to store identifiers and their associated meanings
(attributes) for future reference. This structure provides more information that
the grammar and syntax directed translation of the language would not normally
be able to carry.
Symbol Table Requirements
fast lookup. The parser of a compiler is linear in speed. The table
lookup needs to be as fast as possible.
flexible in structure. The entries must contain all necessary
information, depending on usage of identifier
efficient use of space. This will require runtime allocation
(dynamic) of the space.
handle characteristics of language (e.g., scoping, implicit
declaration)
Scoping requires handling entry of a local block and exit. The latter
requires removal or hiding of the entries of the block.
Program components
Declarative statements- define identifiers and their attributes
Imperative statements - uses identifiers assuming their attributes
Languages and their explicit declarative statements
| Pascal, Ada |
VAR statement |
| C, C++, Java, PL/I |
statements beginning with a type |
| FORTRAN |
TYPE, DIMENSION statement |
| COBOL |
DATA DIVISION |
Implicit Declarations
Some languages have implicit declarations
FORTRAN or PL/I SUM = SUM + I (first encounter assumes declaration with
default attributes)
APL, BASIC, LISP (most interpreted languages)
Actions on declarations
Declarative statements generally do not translate to (or directly associated
with) any executable code. They may allocate space at designated times, however.
Symbols can be monitored, especially in languages that require declaration
statements
| FIRST ENCOUNTER |
SUBSEQUENT ENCOUNTER |
CONCLUSION |
| Declaration |
Reference (in imperative stmt) |
Continue--this is the expected
case |
| Declaration |
Declaration |
Multiple declarations error unless appropriate within scoping
rules
|
| Declaration |
None |
Unused warning message
|
|
Reference (in imperative statement)
|
NA |
Undefined error
|
String Management
How to efficiently store names of identifiers
1. Restrict length of identifier
- FORTRAN 1-6 characters
Pascal had 1-10 characters in initial design
- Advantages: structure consists of simple fixed sized string array;
easy lookup
- Disadvantages: limits programmer to choose effective names; may
waste space
2. Separate String Space
- Advantages: allows unlimited identifier names; doesn't waste space
- Disadvantages: extra memory reference for access
3. Dynamic allocation of strings
Advantages: don't need predefined space for strings
Disadvantages: complexity
Extend to other parts of the symbol table
Name Searching
Functions:
| Declaration of name |
- enter new name into table
- return error if already there
- add new attributes as found |
| Use |
- expect name to be found
- return position of entry in table |
| Entry of new scope |
- allow new declarations or redefined declarations |
| Exit of scope |
- delete entries |
Goals:
- efficiency of declaration entry, insertion of new name
- efficiency of retrieval, lookup
- sorted list of symbol table dump
Most important would be #2
Approaches:
n is the number of entries in the symbol table for the analysis below
n' is the number of entries in the current [local] block
1. Linear access
- Assuming error free source file
Time(declaration) = k or O(1) [k is some constant]
Time (reference) kn/2 or O(n)
- To check for errors
Time(declaration) ~= kn'/2
Time(reference) ~= kn/2
- Time(block entry) = O(1)
Time(block exit) = O(n') or O(1) if display used
Time(sort) = O(n log n)
2. Binary access
- data structure same as linear table
- as new name is entered, insert into block to maintain sorted order
- on lookup use binary search
- Time(declaration) = O(n')
Time(ref) = O(log n')
Time(sort) = 0 (already in order)
Time(entry) = O(1)
Time(exit) = O(1)
3. Tree access
- - assumes a more random encounter of names
B A C P C D B Q C D E
- - entries of symbol table are organized into tree structure
- - have forest of trees - each tree constitutes a block
- Average Time(decl) = O(log n)
Average Time (ref) = O(log n)
Worst case Time(decl) = T(ref) = O(n)
Time(sort) = 0
Time(entry) = O(1)
Time(exit) = O(n')
4. Hash tables
- use a hash function on name to find location in symbol table
- symbol table is organized like a table of stacks since some identifiers in
different block will collide
- Time(declaration) = O(1) or O(n')
Time(reference) = O(b) where b is the chain depth
Time(entry) = O(1)
Time(exit) = O(n) or O(n') if second link connects block's ids
Single pass vs multipass compiler:
single pass can discard the entries while
multipass will need to retain temporarily inaccessible entries