sqlite_parser/utils.py annotated source

Back to index

        

Utility Functions and Constants

This module serves as the foundation of the SQLite parser, providing all the essential constants, enumerations, and lookup tables needed for tokenization and parsing.

Key Components

  1. TokenType Enum: Defines all 200+ token types including SQLite's 147 keywords, operators, delimiters, and literals
  2. Keyword Mapping: Fast O(1) lookup from string to token type
  3. Precedence Table: Defines operator precedence for expression parsing
  4. Operator Mappings: Convert tokens to AST operator nodes
  5. Helper Functions: Common operations like keyword checking and precedence lookup

The design follows separation of concerns: this module contains only static data and pure functions, with no parsing or lexing logic.

17
18"""
19Utility Functions and Constants for SQLite SQL Parser
20
21Contains keyword lists, precedence tables, and helper functions.
22"""
23
24from enum import Enum, auto
25from typing import Dict, Set
26from .ast_nodes import BinaryOperator, UnaryOperator
27

Token Type Enumeration

The TokenType enum is the parser's vocabulary. It defines every distinct type of token that can appear in SQLite SQL, organized into logical categories:

  • Keywords (lines 15-168): All 147 SQLite reserved words plus TRUE/FALSE
  • Operators (lines 170-191): Arithmetic, comparison, bitwise, and special operators
  • Delimiters (lines 193-200): Parentheses, brackets, commas, and punctuation
  • Literals (lines 202-207): Numbers, strings, blobs, identifiers, and parameters
  • Special (lines 209-211): EOF and NEWLINE for parser control flow

Using Python's auto() generates sequential integer values automatically, making it easy to add new token types without manual numbering.

41
42class TokenType(Enum):
43    """Token types for SQLite SQL"""
44    # Keywords (147 SQLite keywords)
45    ABORT = auto()
46    ACTION = auto()
47    ADD = auto()
48    AFTER = auto()
49    ALL = auto()
50    ALTER = auto()
51    ALWAYS = auto()
52    ANALYZE = auto()
53    AND = auto()
54    AS = auto()
55    ASC = auto()
56    ATTACH = auto()
57    AUTOINCREMENT = auto()
58    BEFORE = auto()
59    BEGIN = auto()
60    BETWEEN = auto()
61    BY = auto()
62    CASCADE = auto()
63    CASE = auto()
64    CAST = auto()
65    CHECK = auto()
66    COLLATE = auto()
67    COLUMN = auto()
68    COMMIT = auto()
69    CONFLICT = auto()
70    CONSTRAINT = auto()
71    CREATE = auto()
72    CROSS = auto()
73    CURRENT = auto()
74    CURRENT_DATE = auto()
75    CURRENT_TIME = auto()
76    CURRENT_TIMESTAMP = auto()
77    DATABASE = auto()
78    DEFAULT = auto()
79    DEFERRABLE = auto()
80    DEFERRED = auto()
81    DELETE = auto()
82    DESC = auto()
83    DETACH = auto()
84    DISTINCT = auto()
85    DO = auto()
86    DROP = auto()
87    EACH = auto()
88    ELSE = auto()
89    END = auto()
90    ESCAPE = auto()
91    EXCEPT = auto()
92    EXCLUDE = auto()
93    EXCLUSIVE = auto()
94    EXISTS = auto()
95    EXPLAIN = auto()
96    FAIL = auto()
97    FILTER = auto()
98    FIRST = auto()
99    FOLLOWING = auto()
100    FOR = auto()
101    FOREIGN = auto()
102    FROM = auto()
103    FULL = auto()
104    GENERATED = auto()
105    GLOB = auto()
106    GROUP = auto()
107    GROUPS = auto()
108    HAVING = auto()
109    IF = auto()
110    IGNORE = auto()
111    IMMEDIATE = auto()
112    IN = auto()
113    INDEX = auto()
114    INDEXED = auto()
115    INITIALLY = auto()
116    INNER = auto()
117    INSERT = auto()
118    INSTEAD = auto()
119    INTERSECT = auto()
120    INTO = auto()
121    IS = auto()
122    ISNULL = auto()
123    JOIN = auto()
124    KEY = auto()
125    LAST = auto()
126    LEFT = auto()
127    LIKE = auto()
128    LIMIT = auto()
129    MATCH = auto()
130    MATERIALIZED = auto()
131    NATURAL = auto()
132    NO = auto()
133    NOT = auto()
134    NOTHING = auto()
135    NOTNULL = auto()
136    NULL = auto()
137    NULLS = auto()
138    OF = auto()
139    OFFSET = auto()
140    ON = auto()
141    OR = auto()
142    ORDER = auto()
143    OTHERS = auto()
144    OUTER = auto()
145    OVER = auto()
146    PARTITION = auto()
147    PLAN = auto()
148    PRAGMA = auto()
149    PRECEDING = auto()
150    PRIMARY = auto()
151    QUERY = auto()
152    RAISE = auto()
153    RANGE = auto()
154    RECURSIVE = auto()
155    REFERENCES = auto()
156    REGEXP = auto()
157    REINDEX = auto()
158    RELEASE = auto()
159    RENAME = auto()
160    REPLACE = auto()
161    RESTRICT = auto()
162    RETURNING = auto()
163    RIGHT = auto()
164    ROLLBACK = auto()
165    ROW = auto()
166    ROWID = auto()
167    ROWS = auto()
168    SAVEPOINT = auto()
169    SELECT = auto()
170    SET = auto()
171    STRICT = auto()
172    STORED = auto()
173    TABLE = auto()
174    TEMP = auto()
175    TEMPORARY = auto()
176    THEN = auto()
177    TIES = auto()
178    TO = auto()
179    TRANSACTION = auto()
180    TRIGGER = auto()
181    UNBOUNDED = auto()
182    UNION = auto()
183    UNIQUE = auto()
184    UPDATE = auto()
185    USING = auto()
186    VACUUM = auto()
187    VALUES = auto()
188    VIEW = auto()
189    VIRTUAL = auto()
190    WHEN = auto()
191    WHERE = auto()
192    WINDOW = auto()
193    WITH = auto()
194    WITHOUT = auto()
195
196    # Special keywords
197    TRUE = auto()
198    FALSE = auto()
199
200    # Operators
201    PLUS = auto()              # +
202    MINUS = auto()             # -
203    STAR = auto()              # *
204    SLASH = auto()             # /
205    PERCENT = auto()           # %
206    CONCAT = auto()            # ||
207    EQUAL = auto()             # =
208    EQUAL2 = auto()            # ==
209    NOT_EQUAL = auto()         # !=
210    NOT_EQUAL2 = auto()        # <>
211    LESS = auto()              # <
212    GREATER = auto()           # >
213    LESS_EQUAL = auto()        # <=
214    GREATER_EQUAL = auto()     # >=
215    LEFT_SHIFT = auto()        # <<
216    RIGHT_SHIFT = auto()       # >>
217    AMPERSAND = auto()         # &
218    PIPE = auto()              # |
219    TILDE = auto()             # ~
220    ARROW = auto()             # ->
221    DOUBLE_ARROW = auto()      # ->>
222
223    # Delimiters
224    LPAREN = auto()            # (
225    RPAREN = auto()            # )
226    LBRACKET = auto()          # [
227    RBRACKET = auto()          # ]
228    COMMA = auto()             # ,
229    SEMICOLON = auto()         # ;
230    DOT = auto()               # .
231
232    # Literals
233    NUMBER = auto()            # 123, 123.45, 1.5e10
234    STRING = auto()            # 'string'
235    BLOB = auto()              # X'hex'
236    IDENTIFIER = auto()        # name, [name], "name", `name`
237    PARAMETER = auto()         # ?, ?123, :name, @name, $name
238
239    # Special
240    EOF = auto()
241    NEWLINE = auto()
242

Keyword Lookup Dictionary

The KEYWORDS dictionary provides O(1) lookup from uppercase string to TokenType. This is used during lexing to distinguish keywords from identifiers.

Design Decisions

  • Case insensitivity: All keys are uppercase because SQLite keywords are case-insensitive. The lexer normalizes input before lookup.
  • Complete coverage: Includes all 147 official SQLite keywords from the SQLite documentation
  • Easy maintenance: The 1:1 mapping from string to enum makes it trivial to add new keywords

When the lexer encounters an identifier, it checks KEYWORDS to determine if it's actually a reserved word.

258
259# Map keyword strings to token types
260KEYWORDS: Dict[str, TokenType] = {
261    'ABORT': TokenType.ABORT,
262    'ACTION': TokenType.ACTION,
263    'ADD': TokenType.ADD,
264    'AFTER': TokenType.AFTER,
265    'ALL': TokenType.ALL,
266    'ALTER': TokenType.ALTER,
267    'ALWAYS': TokenType.ALWAYS,
268    'ANALYZE': TokenType.ANALYZE,
269    'AND': TokenType.AND,
270    'AS': TokenType.AS,
271    'ASC': TokenType.ASC,
272    'ATTACH': TokenType.ATTACH,
273    'AUTOINCREMENT': TokenType.AUTOINCREMENT,
274    'BEFORE': TokenType.BEFORE,
275    'BEGIN': TokenType.BEGIN,
276    'BETWEEN': TokenType.BETWEEN,
277    'BY': TokenType.BY,
278    'CASCADE': TokenType.CASCADE,
279    'CASE': TokenType.CASE,
280    'CAST': TokenType.CAST,
281    'CHECK': TokenType.CHECK,
282    'COLLATE': TokenType.COLLATE,
283    'COLUMN': TokenType.COLUMN,
284    'COMMIT': TokenType.COMMIT,
285    'CONFLICT': TokenType.CONFLICT,
286    'CONSTRAINT': TokenType.CONSTRAINT,
287    'CREATE': TokenType.CREATE,
288    'CROSS': TokenType.CROSS,
289    'CURRENT': TokenType.CURRENT,
290    'CURRENT_DATE': TokenType.CURRENT_DATE,
291    'CURRENT_TIME': TokenType.CURRENT_TIME,
292    'CURRENT_TIMESTAMP': TokenType.CURRENT_TIMESTAMP,
293    'DATABASE': TokenType.DATABASE,
294    'DEFAULT': TokenType.DEFAULT,
295    'DEFERRABLE': TokenType.DEFERRABLE,
296    'DEFERRED': TokenType.DEFERRED,
297    'DELETE': TokenType.DELETE,
298    'DESC': TokenType.DESC,
299    'DETACH': TokenType.DETACH,
300    'DISTINCT': TokenType.DISTINCT,
301    'DO': TokenType.DO,
302    'DROP': TokenType.DROP,
303    'EACH': TokenType.EACH,
304    'ELSE': TokenType.ELSE,
305    'END': TokenType.END,
306    'ESCAPE': TokenType.ESCAPE,
307    'EXCEPT': TokenType.EXCEPT,
308    'EXCLUDE': TokenType.EXCLUDE,
309    'EXCLUSIVE': TokenType.EXCLUSIVE,
310    'EXISTS': TokenType.EXISTS,
311    'EXPLAIN': TokenType.EXPLAIN,
312    'FAIL': TokenType.FAIL,
313    'FILTER': TokenType.FILTER,
314    'FIRST': TokenType.FIRST,
315    'FOLLOWING': TokenType.FOLLOWING,
316    'FOR': TokenType.FOR,
317    'FOREIGN': TokenType.FOREIGN,
318    'FROM': TokenType.FROM,
319    'FULL': TokenType.FULL,
320    'GENERATED': TokenType.GENERATED,
321    'GLOB': TokenType.GLOB,
322    'GROUP': TokenType.GROUP,
323    'GROUPS': TokenType.GROUPS,
324    'HAVING': TokenType.HAVING,
325    'IF': TokenType.IF,
326    'IGNORE': TokenType.IGNORE,
327    'IMMEDIATE': TokenType.IMMEDIATE,
328    'IN': TokenType.IN,
329    'INDEX': TokenType.INDEX,
330    'INDEXED': TokenType.INDEXED,
331    'INITIALLY': TokenType.INITIALLY,
332    'INNER': TokenType.INNER,
333    'INSERT': TokenType.INSERT,
334    'INSTEAD': TokenType.INSTEAD,
335    'INTERSECT': TokenType.INTERSECT,
336    'INTO': TokenType.INTO,
337    'IS': TokenType.IS,
338    'ISNULL': TokenType.ISNULL,
339    'JOIN': TokenType.JOIN,
340    'KEY': TokenType.KEY,
341    'LAST': TokenType.LAST,
342    'LEFT': TokenType.LEFT,
343    'LIKE': TokenType.LIKE,
344    'LIMIT': TokenType.LIMIT,
345    'MATCH': TokenType.MATCH,
346    'MATERIALIZED': TokenType.MATERIALIZED,
347    'NATURAL': TokenType.NATURAL,
348    'NO': TokenType.NO,
349    'NOT': TokenType.NOT,
350    'NOTHING': TokenType.NOTHING,
351    'NOTNULL': TokenType.NOTNULL,
352    'NULL': TokenType.NULL,
353    'NULLS': TokenType.NULLS,
354    'OF': TokenType.OF,
355    'OFFSET': TokenType.OFFSET,
356    'ON': TokenType.ON,
357    'OR': TokenType.OR,
358    'ORDER': TokenType.ORDER,
359    'OTHERS': TokenType.OTHERS,
360    'OUTER': TokenType.OUTER,
361    'OVER': TokenType.OVER,
362    'PARTITION': TokenType.PARTITION,
363    'PLAN': TokenType.PLAN,
364    'PRAGMA': TokenType.PRAGMA,
365    'PRECEDING': TokenType.PRECEDING,
366    'PRIMARY': TokenType.PRIMARY,
367    'QUERY': TokenType.QUERY,
368    'RAISE': TokenType.RAISE,
369    'RANGE': TokenType.RANGE,
370    'RECURSIVE': TokenType.RECURSIVE,
371    'REFERENCES': TokenType.REFERENCES,
372    'REGEXP': TokenType.REGEXP,
373    'REINDEX': TokenType.REINDEX,
374    'RELEASE': TokenType.RELEASE,
375    'RENAME': TokenType.RENAME,
376    'REPLACE': TokenType.REPLACE,
377    'RESTRICT': TokenType.RESTRICT,
378    'RETURNING': TokenType.RETURNING,
379    'RIGHT': TokenType.RIGHT,
380    'ROLLBACK': TokenType.ROLLBACK,
381    'ROW': TokenType.ROW,
382    'ROWID': TokenType.ROWID,
383    'ROWS': TokenType.ROWS,
384    'SAVEPOINT': TokenType.SAVEPOINT,
385    'SELECT': TokenType.SELECT,
386    'SET': TokenType.SET,
387    'STORED': TokenType.STORED,
388    'STRICT': TokenType.STRICT,
389    'TABLE': TokenType.TABLE,
390    'TEMP': TokenType.TEMP,
391    'TEMPORARY': TokenType.TEMPORARY,
392    'THEN': TokenType.THEN,
393    'TIES': TokenType.TIES,
394    'TO': TokenType.TO,
395    'TRANSACTION': TokenType.TRANSACTION,
396    'TRIGGER': TokenType.TRIGGER,
397    'UNBOUNDED': TokenType.UNBOUNDED,
398    'UNION': TokenType.UNION,
399    'UNIQUE': TokenType.UNIQUE,
400    'UPDATE': TokenType.UPDATE,
401    'USING': TokenType.USING,
402    'VACUUM': TokenType.VACUUM,
403    'VALUES': TokenType.VALUES,
404    'VIEW': TokenType.VIEW,
405    'VIRTUAL': TokenType.VIRTUAL,
406    'WHEN': TokenType.WHEN,
407    'WHERE': TokenType.WHERE,
408    'WINDOW': TokenType.WINDOW,
409    'WITH': TokenType.WITH,
410    'WITHOUT': TokenType.WITHOUT,
411    'TRUE': TokenType.TRUE,
412    'FALSE': TokenType.FALSE,
413}
414

Operator Precedence Table

The PRECEDENCE dictionary defines the parsing priority for binary operators. Lower numbers bind less tightly (evaluate later), following the precedence climbing algorithm used in expression parsing.

Precedence Levels (from lowest to highest)

  1. OR (1): Logical OR - lowest precedence, binds loosest
  2. AND (2): Logical AND
  3. NOT (3): Logical negation (unary, but listed here for reference)
  4. Comparison (4): =, <, >, IS, IN, LIKE, BETWEEN, etc.
  5. Bitwise (5): Bit shifts and bitwise AND/OR
  6. Additive (6): +, -, || (string concatenation)
  7. Multiplicative (7): *, /, % - highest precedence, binds tightest

Example

In 1 + 2 * 3 OR 4 < 5 AND 6: - 2 * 3 evaluates first (precedence 7) - 1 + 6 follows (precedence 6) - 4 < 5 comparison (precedence 4) - AND (precedence 2) - OR (precedence 1) last

Note: Special operators like BETWEEN, IN, and LIKE have their own parsing methods, but are listed here at precedence 4 for when they appear as binary operators.

442
443# Operator precedence (lower number = lower precedence)
444PRECEDENCE: Dict[TokenType, int] = {
445    TokenType.OR: 1,
446    TokenType.AND: 2,
447    TokenType.NOT: 3,
448    # Comparison operators
449    TokenType.EQUAL: 4,
450    TokenType.EQUAL2: 4,
451    TokenType.NOT_EQUAL: 4,
452    TokenType.NOT_EQUAL2: 4,
453    TokenType.LESS: 4,
454    TokenType.GREATER: 4,
455    TokenType.LESS_EQUAL: 4,
456    TokenType.GREATER_EQUAL: 4,
457    TokenType.IS: 4,
458    TokenType.IN: 4,
459    TokenType.LIKE: 4,
460    TokenType.GLOB: 4,
461    TokenType.MATCH: 4,
462    TokenType.REGEXP: 4,
463    TokenType.BETWEEN: 4,
464    # Bitwise
465    TokenType.LEFT_SHIFT: 5,
466    TokenType.RIGHT_SHIFT: 5,
467    TokenType.AMPERSAND: 5,
468    TokenType.PIPE: 5,
469    # Arithmetic and string
470    TokenType.PLUS: 6,
471    TokenType.MINUS: 6,
472    TokenType.CONCAT: 6,
473    TokenType.STAR: 7,
474    TokenType.SLASH: 7,
475    TokenType.PERCENT: 7,
476    # COLLATE handled specially
477    # Unary handled specially
478}
479

Associativity

SQL operators are left-associative by default, meaning a - b - c parses as (a - b) - c. This empty set is a placeholder for any right-associative operators that might be added in the future (though SQL has very few right-associative operators).

485
486# Right-associative operators
487RIGHT_ASSOCIATIVE: Set[TokenType] = set()  # Most operators in SQL are left-associative
488

Helper Functions

These utility functions provide clean abstractions over the lookup tables, making the lexer and parser code more readable and maintainable.

Each function has a single, clear responsibility: - Keyword checking and normalization - Operator classification and precedence lookup - Token-to-AST-node conversion

498
499def is_keyword(word: str) -> bool:
500    """Check if a word is a SQLite keyword"""
501    return word.upper() in KEYWORDS
502
503
504def normalize_keyword(word: str) -> str:
505    """Normalize keyword to uppercase"""
506    return word.upper()
507
508
509def get_keyword_token_type(word: str) -> TokenType:
510    """Get token type for keyword"""
511    return KEYWORDS.get(word.upper(), TokenType.IDENTIFIER)
512
513
514def is_binary_operator(token_type: TokenType) -> bool:
515    """Check if token type is a binary operator"""
516    return token_type in PRECEDENCE
517
518
519def get_precedence(token_type: TokenType) -> int:
520    """Get precedence for operator"""
521    return PRECEDENCE.get(token_type, 0)
522
523
524def is_right_associative(token_type: TokenType) -> bool:
525    """Check if operator is right-associative"""
526    return token_type in RIGHT_ASSOCIATIVE
527

Token-to-AST Operator Mappings

These dictionaries convert TokenTypes to AST operator enums, bridging the gap between the lexer (which produces tokens) and the parser (which builds AST nodes).

Binary Operators

Maps operator tokens to BinaryOperator enum values. For example, when the parser sees a TokenType.PLUS token, it converts it to BinaryOperator.ADD for the AST node.

This separation allows the same token type to map to different operators in different contexts (though SQL doesn't often need this flexibility).

540
541# Token type to binary operator mapping
542TOKEN_TO_BINARY_OP: Dict[TokenType, BinaryOperator] = {
543    TokenType.PLUS: BinaryOperator.ADD,
544    TokenType.MINUS: BinaryOperator.SUBTRACT,
545    TokenType.STAR: BinaryOperator.MULTIPLY,
546    TokenType.SLASH: BinaryOperator.DIVIDE,
547    TokenType.PERCENT: BinaryOperator.MODULO,
548    TokenType.EQUAL: BinaryOperator.EQUAL,
549    TokenType.EQUAL2: BinaryOperator.EQUAL2,
550    TokenType.NOT_EQUAL: BinaryOperator.NOT_EQUAL,
551    TokenType.NOT_EQUAL2: BinaryOperator.NOT_EQUAL2,
552    TokenType.LESS: BinaryOperator.LESS_THAN,
553    TokenType.GREATER: BinaryOperator.GREATER_THAN,
554    TokenType.LESS_EQUAL: BinaryOperator.LESS_EQUAL,
555    TokenType.GREATER_EQUAL: BinaryOperator.GREATER_EQUAL,
556    TokenType.AND: BinaryOperator.AND,
557    TokenType.OR: BinaryOperator.OR,
558    TokenType.AMPERSAND: BinaryOperator.BIT_AND,
559    TokenType.PIPE: BinaryOperator.BIT_OR,
560    TokenType.LEFT_SHIFT: BinaryOperator.LEFT_SHIFT,
561    TokenType.RIGHT_SHIFT: BinaryOperator.RIGHT_SHIFT,
562    TokenType.CONCAT: BinaryOperator.CONCAT,
563    TokenType.IS: BinaryOperator.IS,
564    TokenType.IN: BinaryOperator.IN,
565    TokenType.LIKE: BinaryOperator.LIKE,
566    TokenType.GLOB: BinaryOperator.GLOB,
567    TokenType.MATCH: BinaryOperator.MATCH,
568    TokenType.REGEXP: BinaryOperator.REGEXP,
569}
570

Unary Operators

Maps unary operator tokens to UnaryOperator enum values. Unary operators appear before their operand (prefix position): -5, NOT condition, ~bitmask.

Note that + and - can be both unary and binary operators. The parser determines which based on context (whether there's a left operand).

578
579# Token type to unary operator mapping
580TOKEN_TO_UNARY_OP: Dict[TokenType, UnaryOperator] = {
581    TokenType.PLUS: UnaryOperator.PLUS,
582    TokenType.MINUS: UnaryOperator.MINUS,
583    TokenType.NOT: UnaryOperator.NOT,
584    TokenType.TILDE: UnaryOperator.BITNOT,
585}
586