Compilers — Lexical Analysis

Kevin Da Silva
4 min readJan 6, 2022

--

Hi everyone, I'm about to start what I think will be a new series on compilers and interpreters basics, so considering that I'm starting in the last week of December this series will only have an ending in 2022, so let's get started with today's topic.

But first, what is a compiler made of?

A compiler has at least four phases lexical analysis(today's topic), syntactic analysis, semantic analysis, and code generation(interpreters normally don't implement this one). Both a compiler and an interpreter can have more phases like a module system aggregation, or optimization phases for example.

Best way to get started?

To get started and put theory into real-world apps using compilers principles, we are going to build a lisp-like arithmetic expression's interpreter using Elixir, but that can be implemented in any programming language you want to.

A lisp expression follows the rules:

(+ 5 5) //10
open paren | arithmetic operation | number | number | close paren
and can also be nested:
(+ (* 3 2) 5) //11
following the parenthesis rule, 3 * 2 will be solved first and then 6 + 5 will be evaluated

If you want to have some spoilers in the project, here's the link on GitHub.

What is lexical analysis?

Basically, a DFA(Deterministic finite automaton), where we get our source code as a huge string and start breaking the program into tokens, let's suppose we have a source file like:

/my_program.mylangprint "hello world"
sum = 1 + 2

A lexical analyzer would receive the following

“print \”hello world\”\nsum = 1 + 2”

And would read the whole string and break it into a list like this:

[{:fn_call, "print"}, {:string, "hello world"}, {:identifier, "sum"}, {:assign, "="}, {:number, "1"}, {:fn_call, "+"}, {:number, "2"}]

This is what in compilers is generally called a token list, lexical analysis's main responsibility is to label every part of our program.

Notice that the + sign was labeled as a function call because in compilers even our arithmetic operators are functions (infix functions to be precise, see an example in Haskell) that are the same thing as writing a sum(n1, n2) function.

Now notice our lisp (+ 5 5) our + operator are very similar to a sum(5, 5) function because the operator in lisp is in a position that we call prefix

Now lets implement our lexical analysis in Elixir

First, we create a new project. I will call it lisp-eval

mix new lisp_eval

Then we create a module called lexical_analysis.ex

current path: lisp_eval/lib/lisp_eval/lexical_analysis.ex

And then we declare our module LexicalAnalysis

defmodule LexicalAnalysis do   end

And after it, we add two basic sets to help us in the labeling phase

@operators will be our set of operators allowed in our evaluator and will contain the main four arithmetic operators [“/”, “+”, “-”, “*”]

@numbers will contain a set of numbers that go from 0 to 9, and we are going to use this set to label any valid integer value

notice that in @numbers we created a set [0,1,..9] and turn it into a string because our program will be a string, so we'll have to work in the context of strings.

defmodule LexicalAnalysis do   
@operators ["/", "+", "-", "*"]
@numbers 0..9 |> Enum.map(&(Integer.to_string(&1)))
end

Now let's add our lexical analysis function to label our values:

defmodule LexicalAnalysis do   
@operators ["/", "+", "-", "*"]
@numbers 0..9 |> Enum.map(&(Integer.to_string(&1)))

def lexical_analysis(["(" | tail]), do: lexical_analysis(tail) ++ [{:open_paren, "("}]
def lexical_analysis([")" | tail]), do: lexical_analysis(tail) ++ [{:close_paren, ")"}]
def lexical_analysis([head | tail]) when head in @operators do
lexical_analysis(tail) ++ [{:operator, head}]
end
def lexical_analysis([head | tail]) when head in @numbers do
lexical_analysis(tail)
end
def lexical_analysis([" " | tail]), do: lexical_analysis(tail)
def lexical_analysis([]), do: []
def lexical_analysis([s | _]), do: raise "Symbol(#{s}) not expected"
end

Notice that we created a recursive function that uses pattern matching and guards to label the values and notice that we are checking character by character of our program in the function and labeling it like “(” -> {:open_paren, “(“}

Also notice that we are ignoring the number values because to label numbers we are going to need another DFA to label them properly

Now we need a finite automaton to label our integer values:

defmodule LexicalAnalysis do   
@operators ["/", "+", "-", "*"]
@numbers 0..9 |> Enum.map(&(Integer.to_string(&1)))

def lexical_analysis(["(" | tail]), do: lexical_analysis(tail) ++ [{:open_paren, "("}]
def lexical_analysis([")" | tail]), do: lexical_analysis(tail) ++ [{:close_paren, ")"}]
def lexical_analysis([head | tail]) when head in @operators do
lexical_analysis(tail) ++ [{:operator, head}]
end
def lexical_analysis([head | tail]) when head in @numbers do
number_automaton(tail, head)
end
def lexical_analysis([" " | tail]), do: lexical_analysis(tail)
def lexical_analysis([]), do: []
def lexical_analysis([s | _]), do: raise "Symbol(#{s}) not expected"
def number_automaton([head | tail], number) when head in @numbers
do
number_automaton tail, (number <> head)
end
def number_automaton(ls, number), do: lexical_analysis(ls) ++ [{:number, number}]
end

So here we have our numbers DFA that concatenates every character in the @numbers set and if the character is not a number our DFA just calls the lexical analysis step.

Our lexical analyzer is done, so let's test our program:

iex "./lib/lisp_eval/lexical_analysis.ex"

in iex we will create our sample program

iex(1)> program = "(+ 5 5)" |> String.split("", trim: true)

after the split, our program will be a list of characters and with this list, we simply call our lexical analyzer:

iex(2)> output = LexicalAnalysis.lexical_analysis(program)

And if we type:

IO.inspect output

We should get a result like this:

[{:close_paren, ")"}, {:number, "5"}, {:number, "5"}, {:operator, "+"}, {:open_paren, "("}]

A labeled list of our simple (+ 5 5) expression.

I really expect you could make it until here, and I hope I could get the basics of lexical analysis covered in a didactic way and approach you a bit more to the compiler theory.

See you at the next one!

--

--

Kevin Da Silva

I'm a back-end developer, functional programming lover, fascinated by computer science and languages. From the south part of Brazil to the world