How to parse computer code, Advent of Code 2024 day 3
Having tackled some of the later Advent of Code challenges, I wanted to revisit Day 3, which presented an interesting parsing problem. The task involved extracting valid code from a noisy input, a great exercise in parser and lexer development. Join me as I explore my approach to this challenge.
When I first wrote about the ruler DSL, I relied on Hy for parsing. However, my recent exploration of generative AI has introduced a new parsing method: generated code using the funcparserlib
library. This Advent of Code challenge allowed me to delve into the intricacies of funcparserlib
and develop a much stronger grasp of the generated code's functionality.
Implementing the Lexer (Lexical Analysis)
The first step in processing our corrupted input is lexing (or tokenization). The lexer (or tokenizer) scans the input string and divides it into individual tokens, which are the basic building blocks for further processing. A token represents a meaningful unit in the input, categorized by its type. For this puzzle, we’re interested in these token types:
- Operators (
OP
): These represent function names, such asmul
,do
, anddon't
. For example, the inputmul(2, 3)
contains the operator tokenmul
. - Numbers: These are numerical values. For instance, in the input
mul(2, 3)
,2
and3
would be recognized as number tokens. - Commas: The comma character (
,
) acts as a separator between arguments. - Parentheses: Opening
(
and closing)
parentheses define the structure of the function calls. - Gibberish: This category encompasses any characters or sequences of characters that don’t match the other token types. This is where the “corrupted” part of the input comes in. For example,
%$#@
or any random letters would be considered gibberish.
While funcparserlib
often uses magic strings in its tutorials, I prefer a more structured approach. Magic strings can lead to typos and make it difficult to refactor code. Using an Enum to define token types offers several advantages: better readability, improved maintainability, and enhanced type safety. Here's how I defined the token types using an Enum:
from enum import Enum, auto
class Spec(Enum):
OP = auto()
NUMBER = auto()
COMMA = auto()
LPAREN = auto()
RPAREN = auto()
GIBBERISH = auto()
By using Spec.OP
, Spec.NUMBER
, etc., we avoid the ambiguity and potential errors associated with using plain strings.
To seamlessly integrate Enum
with funcparserlib
, I created a custom decorator named TokenSpec_
. This decorator acts as a wrapper around the original TokenSpec
function from funcparserlib
. It simplifies token definition by accepting a value from our Spec
Enum as the spec
argument. Internally, it extracts the string representation of the enum (spec.name
) and passes that along with any other arguments to the original TokenSpec
function.
from funcparserlib.lexer import TokenSpec
def TokenSpec_(spec: Spec, *args: Any, **kwargs: Any) -> TokenSpec:
return TokenSpec(spec.name, *args, **kwargs)
With the decorated TokenSpec_
function, this allows us to define the tokenizer. We use make_tokenizer
from funcparserlib
to create a tokenizer that takes a list of TokenSpec_
definitions. Each definition specifies a token type (from our Spec
ENUM) and a regular expression to match it.
from funcparserlib.lexer import make_tokenizer
def tokenize(input: str) -> tuple[Token, ...]:
tokenizer = make_tokenizer(
[
TokenSpec_(
Spec.OP, r"mul(?=\(\d{1,3},\d{1,3}\))|do(?=\(\))|don\'t(?=\(\))"
),
TokenSpec_(Spec.NUMBER, r"\d{1,3}"),
TokenSpec_(Spec.LPAREN, r"\("),
TokenSpec_(Spec.RPAREN, r"\)"),
TokenSpec_(Spec.COMMA, r","),
TokenSpec_(Spec.GIBBERISH, r"[\s\S]"),
]
)
return tuple(
token for token in tokenizer(input) if token.type != Spec.GIBBERISH.name
)
The OP
regular expression uses alternation (|
) to match the different function formats. Specifically:
mul(?=\(\d{1,3},\d{1,3}\))
: Matchesmul
only if it's followed by parentheses containing two numbers separated by a comma. The positive lookahead assertion(?=...)
ensures that the parentheses and numbers are present but are not consumed by the match.do(?=\(\))
: Matchesdo
only if followed by empty parentheses.don\'t(?=\(\))
: Matchesdon't
only if followed by empty parentheses.
Finally, the tokenize
function filters out any GIBBERISH
tokens during tokenization to focus on the relevant parts of the input for further processing.
The process of interpreting code typically involves two main stages: lexical analysis (or lexing) and parsing. We’ve already implemented the first stage: our tokenize
function acts as a lexer, taking the input string and converting it into a sequence of tokens. These tokens are the fundamental building blocks that the parser will use to understand the structure and meaning of the code. Now, let's explore how the parser uses these tokens.
Implementing the parser
The parsed tokens returned by the tokenize
function are then sent to a parser for further processing. To bridge the gap between our Spec
Enum and the tok
function, we introduce a decorator named tok_
.
from funcparserlib.parser import tok
def tok_(spec: Spec, *args: Any, **kwargs: Any) -> Parser[Token, str]:
return tok(spec.name, *args, **kwargs)
For example, if we have a Spec.NUMBER
token, the returned parser will accept the token, and return a value, as follows:
>>> from funcparserlib.lexer import Token
>>> number_parser = tok_(Spec.NUMBER)
>>> number_parser.parse([Token(Spec.NUMBER.name, '123'])
'123'
The returned value can then be transformed into the desired data type using the >>
operator, as shown below:
>>> from funcparserlib.lexer import Token
>>> from ast import literal_eval
>>> number_parser = tok_(Spec.NUMBER) >> literal_eval
>>> number_parser.parse([Token(Spec.NUMBER.name, '123'])
123
Typically, it’s best practice to use ast.literal_eval
when parsing unknown input to avoid potential security vulnerabilities. However, the constraints of this particular Advent of Code puzzle—specifically, that all numbers are valid integers—allow us to use the more direct and efficient int
function for converting string representations to integers.
number = tok_(Spec.NUMBER) >> int
We can define parsing rules to enforce specific token sequences and transform them into meaningful objects. For example, to parse a mul
function call, we require the following sequence: left parenthesis, number, comma, another number, right parenthesis. We then transform this sequence into a Mul
object:
from abc import ABC
class Expr(ABC):
pass
@dataclass
class Mul(Expr):
alpha: int
beta: int
def evaluate(self) -> int:
return self.alpha * self.beta
def __str__(self) -> str:
return f"mul({self.alpha},{self.beta})"
mul = (
tok_(Spec.OP, "mul")
+ tok_(Spec.LPAREN)
+ number
+ tok_(Spec.COMMA)
+ number
+ tok_(Spec.RPAREN)
) >> (lambda elem: Mul(elem[2], elem[4]))
This rule combines the tok_
parsers for the required tokens (OP
, LPAREN
, COMMA
, RPAREN
) with the number
parser. The >>
operator then transforms the matched sequence into a Mul
object, extracting the two numbers from the tuple elem
at indices 2 and 4.
We can apply the same principle to define parsing rules for the do
and don't
operations. These operations take no arguments (represented by empty parentheses) and are transformed into Condition
objects:
@dataclass
class Condition(Expr):
can_proceed: bool
do = (tok_(Spec.OP, "do") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(True)
)
dont = (tok_(Spec.OP, "don't") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(False)
)
The do
rule creates a Condition
object with can_proceed = True
, while the don't
rule creates one with can_proceed = False
.
Finally, we combine these individual parsing rules (do
, don't
, and mul
) into a single expr
parser using the |
(or) operator:
expr = do | dont | mul
This expr
parser will attempt to match the input against each of the rules in turn, returning the result of the first successful match.
Our expr
parser handles complete expressions like mul(2,3)
, do()
, and don't()
. However, the input might also contain individual tokens that are not part of these structured expressions. To handle these, we define a catch-all parser called everything
:
everything = (
tok_(Spec.NUMBER) | tok_(Spec.LPAREN) | tok_(Spec.RPAREN) | tok_(Spec.COMMA)
)
This parser uses the |
(or) operator to match any single token of type NUMBER
, LPAREN
, RPAREN
, or COMMA
. It's essentially a way to capture any stray tokens that aren't part of a larger expression.
With all the components defined, we can now define what constitutes a complete program. A program consists of one or more “calls,” where a “call” is an expression potentially surrounded by stray tokens.
The call
parser handles this structure: it matches any number of stray tokens (many(everything)
), followed by a single expression (expr
), followed by any number of additional stray tokens. The operator.itemgetter(1)
function then extracts the matched expression from the resulting sequence.
from funcparserlib.parser import finished, many
call = many(everything) + expr + many(everything) >> operator.itemgetter(1)
A full program, represented by the program
parser, consists of zero or more calls, ensuring that the entire input is consumed by using the finished
parser. The parsed result is then converted into a tuple of expressions.
program = many(call) + finished >> (lambda elem: tuple(elem[0]))
Finally, we group all these definitions into a parse
function. This function takes a tuple of tokens as input and returns a tuple of parsed expressions. All the parsers are defined within the function body to prevent polluting the global namespace and because the number
parser depends on the tok_
function.
def parse(tokens: tuple[Token, ...]) -> tuple[Expr, ...]:
number = tok_(Spec.NUMBER) >> int
everything = (
tok_(Spec.NUMBER) | tok_(Spec.LPAREN) | tok_(Spec.RPAREN) | tok_(Spec.COMMA)
)
mul = (
tok_(Spec.OP, "mul")
+ tok_(Spec.LPAREN)
+ number
+ tok_(Spec.COMMA)
+ number
+ tok_(Spec.RPAREN)
) >> (lambda elem: Mul(elem[2], elem[4]))
do = (tok_(Spec.OP, "do") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(True)
)
dont = (tok_(Spec.OP, "don't") + tok_(Spec.LPAREN) + tok_(Spec.RPAREN)) >> (
lambda elem: Condition(False)
)
expr = do | dont | mul
call = many(everything) + expr + many(everything) >> operator.itemgetter(1)
program = many(call) + finished >> (lambda elem: tuple(elem[0]))
return program.parse(tokens)
Solving the puzzle
With our parser in place, solving Part 1 is straightforward. We need to find all mul
operations, perform the multiplications, and sum the results. We start by defining an evaluation function that handles Mul
expressions
def evaluate_skip_condition(expressions: tuple[Expr, ...]) -> int:
return reduce(
operator.add,
(
expression.evaluate()
for expression in expressions
if isinstance(expression, Mul)
),
)
To solve Part 1, we tokenize and parse the input, then use the function evaluate_skip_condition
we just defined to get the final result:
def part1(input: str) -> int:
return evaluate_skip_condition(parse(tokenize(input.strip())))
For Part 2, we need to skip evaluating mul
operations if a don't
condition has been encountered. We define a new evaluation function, evaluate_with_condition
, to handle this:
from functools import reduce
def evaluate_with_condition(expressions: tuple[Expr, ...]) -> int:
def reducer(current: tuple[bool, int], incoming: Expr) -> tuple[bool, int]:
condition, result = current
match incoming:
case Mul():
return (
condition,
(result + incoming.evaluate()) if condition else result,
)
case Condition():
return (incoming.can_proceed, result)
case _:
raise Exception("Unknown expression")
return reduce(reducer, expressions, (True, 0))[-1]
This function uses reduce
with a custom reducer function to maintain a running sum and a boolean flag (condition
). The condition
flag is updated when a Condition
expression (do
or don't
) is encountered. Mul
expressions are only evaluated and added to the sum if the condition
is True
.
Previous Iteration
Initially, my approach to parsing involved two distinct passes. First, I would tokenize the entire input string, collecting all tokens regardless of their type. Then, in a separate step, I would perform a second tokenization and parsing specifically to identify and process mul
operations.
# Simplified representation of the double parsing approach
tokens = tokenize(input) # First pass: all tokens
mul_tokens = tokenize_mul(input) # Second pass: only mul-related tokens
parsed_mul = parser_mul(mul_tokens) # Parse mul tokens
# ... further processing ...
The improved approach eliminates this redundancy by performing the tokenization and parsing in a single pass. We now have a single parser that handles all token types, including those related to mul
, do
, don't
, and other individual tokens.
# Simplified representation of the single parsing approach
tokens = tokenize(input) # Single pass: all tokens
parsed_expressions = parse(tokens) # Parse all tokens in one go
# ... further processing ...
Instead of re-tokenizing the input to find mul
operations, we leverage the token types identified during the initial tokenization. The parse
function now uses these token types to directly construct the appropriate expression objects (Mul
, Condition
, etc.). This avoids the redundant scanning of the input and significantly improves efficiency.
That wraps up our parsing adventure for this week’s Advent of Code. While this post required a significant time commitment, the process of revisiting and solidifying my knowledge of lexing and parsing made it a worthwhile endeavour. This was a fun and insightful puzzle, and I’m eager to tackle more complex challenges in the coming weeks and share my learnings.
As always, thank you for reading, and I shall write again, next week.