ButterSus's Personal Website Help

Lexical Analysis

In this topic, we will discover the theory behind lexical analysis, and the most common algorithms used to implement it.

In my opinion, you can skip this topic if you are only interested in the practical side of compilers. However, it's still a good idea to read it, as in universities, it's usually taught before the practical side.

Overview

We will discover the next topics in this order:

  • Foundations

  • Implementation

  • Implementation Details

  • Practical Problems

    Foundations

    Lexical analysis is the first phase of a front-end.

    Compiler Structure

      It takes the program as a sequence of characters, and produces a sequence of symbols called tokens.

      Goals of lexical analysis:

      • Functionality: Produce tokens from the input program. It should be able to handle all valid programs.

      • Programmer Support: Provide useful error messages, correct common mistakes, and provide useful information to the programmer.

      • Remove Redundancy: Remove redundant information from the input program, such as comments and whitespace.

      • Identify Keywords: Identify keywords in the input program and separate them from identifiers. It's important as parser uses token types to determine the structure of the program.

      • Construct Symbol Table: Construct a symbol table for the input program. It's typically a map with all strings in the program, and instead of storing the string itself, it stores a key to the string in the symbol table. (Strings appear multiple times)

        Notations

        Before we start, it's nice to have a common understanding of regular expressions and their practical use.

        Math behind Lexical Analysis

        We are going to use a lot of mathematical notations in this topic, most of them are related to set theory.

        Alphabets

        The first thing we need to define is the alphabet of the language.

          You can think of an alphabet as a set of characters. Characters are the smallest unit of a language, and they are used to construct words. The alphabet of a language is finite.

          Now let's take a look at operations on alphabets.

            They are pretty straightforward, these operations can remind you of regular expressions. Basically, now we are dealing with sets of strings.

            Also, we can use default operations on sets, such as union, intersection, etc.

              Examples

                Words and Letters

                In the previous section, we defined alphabets using characters and strings.

                I prefer to use greek letters for words, and latin letters for characters.

                  But it's more correct to call them letters and words.

                  More formally, word is a concatenation of letters.

                  It's convenient to use the second notation, as it's more compact.

                    Concatenation of words is also a word.

                      Besides concatenation, there are also few other operations on words.

                        These are not words anymore, it's languages.

                        Examples

                          Languages

                          We call a set of words a language. L is a common notation for languages.

                            What is funny, it's already enough to define simple constructions, such as numbers.

                              Operations on languages produce new languages.

                                Examples

                                  Regular Languages

                                  Regular languages are languages that can be described using regular expressions.

                                  Regular Language

                                    More formally, regular languages more likely can find practical use, as they are easier to implement using finite state machines.

                                    Regular Expressions

                                    Each regular language can be described using regular expressions.

                                    Expression

                                    Language

                                    Example words

                                    a | b

                                    {a, b}

                                    a, b

                                    ab*a

                                    {a}{b}*{a}

                                    aa, aba, abba

                                    (ab)*

                                    {ab}*

                                    ε, ab, abab, ...

                                    a?

                                    {ε, a}

                                    ε, a

                                    We guess that you are already familiar with regular expressions, so we won't go into details.

                                    To show that a language is regular, we can use next notation:

                                      Automata

                                      Finite state machines also known as automata are often used to describe regular languages. However, usage of automata is not limited to regular languages.

                                      In the case of lexers, we are going to think of automata as program that implements regular expressions. This is the main part of the lexer.

                                      Facts about automata:

                                      • It makes transitions from one configuration to another. (They are called states)

                                      • Each configuration consists of (the rest of) the input and some memory. (We use this information to determine the next state)

                                      • The type of memory determines its ability to recognize a class of languages.

                                        A Non-Deterministic Finite-State Machine (NFSM)

                                        One of the simplest automata is a non-deterministic finite-state machine.

                                          Other parts will be clear after we see an example.

                                          Example of NFSM

                                          We use directed graph to represent NFSM.

                                          NFSM Example

                                          Let's take a look at fragments of NFSM:

                                          • NFSM - initial state (first state when we start the program)

                                          • NFSM - alphabet

                                          • NFSM and NFSM - final states (with double stroke)

                                          • NFSM - transition relation (table)

                                          • NFSM and NFSM - transition requirements as part of the transition relation

                                          • states are represented as circles

                                          In this example, we have NFSM that recognizes fractional numbers with scientific notation. 2.14e3 as example.

                                            The only problem with NFSM is that its time complexity is exponential, as we need to check all possible paths.

                                            FSM to Lexical Analyzer

                                            Of course, we can't use FSM (NFSM) as a lexical analyzer without any modifications.

                                            Here's my diagram of how FSM is used in the lexical analyzer.

                                            NFSM to Lexer

                                            As you can see, each time we end new token, we need to reset the FSM to the initial state with the rest of the input.

                                              Maximal Munch Strategy

                                              Note, our FSM uses greedy strategy; it tries to match as many characters as possible.

                                              It's shown in the diagram as saving the last end token position (if it's a final state).

                                              Example

                                              Let's assume we have the next input: 2.14e+s:

                                              1. We save the last end token position a few times, the last one is 2.14.

                                              2. We can save 'e+' as part of the number if it ends with a digit like so: 2.14e+3.

                                              3. After it gets to the s, we can't find any legal transition, so we end the token.

                                              4. It's important that in this case we are using non-deterministic automata, so we can try to find another token with e+s.

                                              5. We will discover other approaches to handle this case later when we talk about deterministic automata.

                                                The Language Accepted by the FSM

                                                Firstly, let's define what configurations are.

                                                  Configuration is a pair of state and the rest of the input. It's used to desribe steps in formal way. Like so:

                                                    That operation is called transitive closure. To show more than one step, we can use reflexive transitive closure.

                                                      We can also define language using FSM.

                                                        Theorem 1

                                                        Finally, we can give a formal definition of regular languages.

                                                        Regular Language

                                                          Regular Expression

                                                            Constructive Proof 1

                                                            Firstly, we need to construct automata with one final state and one transition: FSM

                                                            Then we can decompose r and develop the NFSM according to the following rules: (until the whole expression is decomposed)

                                                            FSM

                                                            RegEx

                                                            Meaning

                                                            FSM

                                                            Union of two languages

                                                            FSM

                                                            Concatenation of two languages

                                                            FSM

                                                            Kleene star of a language (We are using epsilon transition to avoid collision with other transitions of `q` & `p`)

                                                            Deterministic Finite-State Machine (DFSM)

                                                            In modern compilers, NFSM is not used anymore, since its time complexity won't allow us to implement big languages.

                                                            As you can guess, we are going to use deterministic finite-state machine:

                                                            Theorem 2

                                                            The good news is that DFSM can recognize the same languages as NFSM.

                                                              Constructive Proof 2

                                                              We can just think of each state as a set of states of NFSM. The DFSM simulates all possible transition paths under an input word in parallel.

                                                              Let's determine a set of new states:

                                                                Before we look at the formal algorithm, let's take a look at the example.

                                                                Example

                                                                Let's construct NFSM for the next regular expression: a(a|b)*.

                                                                Example

                                                                State

                                                                a

                                                                b

                                                                ε

                                                                0

                                                                1

                                                                1

                                                                2, 8

                                                                2

                                                                3, 5

                                                                3

                                                                4

                                                                4

                                                                7

                                                                5

                                                                6

                                                                6

                                                                7

                                                                7

                                                                2, 8

                                                                It's convenient to use a table to represent the transition relation. By the way, diagram was generated by this online tool.

                                                                To construct set of states, we need to start from the initial state 0 and find all possible transitions, including epsilon transitions.

                                                                State

                                                                a

                                                                b

                                                                0

                                                                {1, 2, 3, 5, 8}

                                                                {1, 2, 3, 5, 8}

                                                                {2, 3, 4, 5, 7, 8}

                                                                {2, 3, 5, 6, 7, 8}

                                                                {2, 3, 4, 5, 7, 8}

                                                                {2, 3, 4, 5, 7, 8}

                                                                {2, 3, 5, 6, 7, 8}

                                                                {2, 3, 5, 6, 7, 8}

                                                                {2, 3, 4, 5, 7, 8}

                                                                {2, 3, 5, 6, 7, 8}

                                                                We can apply one obvious optimization: remove duplicates.

                                                                State

                                                                a

                                                                b

                                                                0

                                                                {1, 2, 3, 5, 8}

                                                                {{1, 2, 3, 5, 8}, ...}

                                                                {2, 3, 4, 5, 7, 8}

                                                                {2, 3, 5, 6, 7, 8}

                                                                Now we have only two states, and we can rename them to q_0 and q_1. Unoptimized and optimized versions are shown below.

                                                                Example Example

                                                                Go play with this online tool to get more experience.

                                                                  Formally, we can define the algorithm as follows:

                                                                    DFSM Minimization

                                                                    The last thing we need to do is to minimize DFSM. It means that we need to remove redundant states.

                                                                    Last modified: 12 January 2024