Parsing Binary Operations
We'll start by parsing expressions that contain a single operator, such as +
, that has two arguments. For example, we'll parse 4 + 5
or x * 3
, but not 4 + x * 3
.
This already introduces quite a bit of complexity. In the two examples above we have:
- literal numbers such as
3
and5
; - a variable,
x
; and - the binary operators
+
and*
.
We'll need to define parsers for each of these components and then compose them together to produce our overall parser.
Let's start with the atomic elements: numbers and variables. We can use the following definitions:
- a literal number is 1 or more digits; and
- a variable is 1 or more alphabetic characters.
This naturally requires us to define digits and alphabetic characters:
- a digit is the character
0
,1
,2
,3
,4
,5
,6
,7
,8
, or9
; and - an alphabetic character is one of the lowercase characters between
a
andz
.
This is sufficient for us to start writing parsers. I'll get you started.
val alphabetic: Parser[String] =
List('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z')
.foldLeft(Parser.fail[String]){ (parser, char) =>
parser.orElse(Parser.string(char.toString))
}
val variable = alphabetic.and(alphabetic.repeat)
Now you go ahead and implement parsers for digits and literals.
Abstract Syntax Trees
Now we can parse very simple expressions, we should create a data structure that stores the result of our parse. In the compiler literature these data structures are known as abstract syntax trees (ASTs). They represent the syntax of the input we parsed, and are usually organized as a tree. In Scala we usually represent abstract syntax trees as algebraic data types (ADTs).
Here's an AST, which is also an ADT, for representing the expressions we described above:
sealed trait Expression {
import Expression._
def +(that: Expression): Expression =
Add(this, that)
def *(that: Expression): Expression =
Multiply(this, that)
}
object Expression {
def literal(value: Int): Expression = Literal(value)
def variable(value: String): Expression = Variable(value)
final case class Literal(value: Int) extends Expression
final case class Variable(value: String) extends Expression
final case class Add(left: Expression, right: Expression) extends Expression
final case class Multiply(left: Expression, right: Expression) extends Expression
}
Notice that I've made a bit of effort to make this more pleasant to use. The methods +
and *
make it easy to construct compound expressions. The "smart constructors" on the companion object return objects of type Expression
, which avoids some issues with type inference that you probably won't even realize we've side-stepped.
Also note that this AST is more expressive than the expressions we're trying to parse. We're only parsing expressions that contain a single operator, but our AST can represent arbitrarily nested operators.
Update your parsers so they produce values of type Expression
on a successful parse. We don't yet have parsers for compound expressions (addition and multiplication) but that's our next step.
Binary Operations
Our next step is to parse the binary operations +
and *
. You might already see how to do this. If so, try it yourself before reading on. If not, I'll show how to implement a parser for addition. Multiplication is a straightforward modification of it.
An addition is a literal or variable, followed by the addition sign, followed by another literal or variable. The actually not entirely correct. There can be whitespace separating any of these elements, and our parser needs to account for that.
Here's my implementation.
def charIn(char: Char, chars: Char*): Parser[String] =
chars.foldLeft(Parser.string(char.toString)){ (parser, char) =>
parser.orElse(Parser.string(char.toString))
}
val whitespace = charIn(' ', '\t', '\n').repeat
val factor: Parser[Expression] = literal.orElse(variable)
val addition = (factor, (whitespace *> charIn('+') <* whitespace), factor)
.mapN((l, _, r) => l + r)
A few points:
- I defined a parser
factor
to refer to a literal or a variable. - I created a utility method,
charIn
, to make creating parsers easier. - I've also used
mapN
,*>
, and<*
, which are methods defined by Cats for all applicatives, to make the code a bit more compact. You should be familiar withmapN
, but*>
and<*
may be new to you. The way to read them is as binary operations that only keep the value pointed to by the arrow. So inwhitespace *> charIn('+') <* whitespace
I only keep the result of parsing the+
and discard the result of parsing any whitespace. These methods are on the edge for me: they are not so well known that I feel confident using them in all codebases. In this context, though, it's an opportunity for you learn so I've used them.
Lessons on API Design
Our work so far as exposed some flaws in our current design. Here are the issues I noted:
-
We have to use a
Parser[String]
just to look for a singleChar
. AString
contains 0 or more characters, while aChar
is exactly one character. It would be better if could use the more precise typeChar
when we want exactly oneChar
. -
charIn
is the kind of method that should be provided by our parser library, as it's a case that frequently occurs. There are at least two variants we can envisage: one where we specify the characters we're looking for, and one where we specify a predicate that istrue
if the character is one we're looking for. -
The construction
parser.and(parser.repeat)
is an inconvenient way to express one or more repetitions of a parser, and is certainly a bit opaque to someone new to the library. This operation is sometimes known as the Kleene plus and is very common. We should consider directly supporting it.
Designing using type classes gets us some of the way to a good API, but not all of the way. We still need domain specific methods like the Kleene star. Working on examples is a good way to discover missing methods, as we've done here.
I added the following constructors:
Parser.char(value: Char): Parser[Char]
Parser.charIn(char: Char, chars: Char*): Parser[Char]
Parser.stringIn(string: String, strings: String*): Parser[String]
To keep the interpreter small I haven't directly reified charIn
or stringIn
, instead implementing them in terms of simpler methods.
def charIn(char: Char, chars: Char*): Parser[Char] =
chars.foldLeft(Parser.char(char)) { (parser, char) =>
parser.orElse(Parser.char(char))
}
def stringIn(string: String, strings: String*): Parser[String] =
strings.foldLeft(Parser.string(string)) { (parser, string) =>
parser.orElse(Parser.string(string))
}
I then reworked the repeat
combinator:
- I added
repeatAtLeast(minimum: Int)
, which allows the caller to specify the minimum number of repeats for a succesful parse; - I added
zeroOrMore
andoneOrMore
as special cases ofrepeatAtLeast
; and - I added
repeatBetween(min: Int, max: Int)
to allow specifying a maximum number of repeats.
As with charIn
and stringIn
, I implemented repeatAtLeast
in terms of the existing ParserRepeat
reificiation and and
.
def repeatAtLeast(minimum: Int)(implicit m: Monoid[A]): Parser[A] = {
def loop(count: Int): Parser[A] =
if (count == 0) ParserRepeat(this, m)
else ParserAnd(this, loop(count - 1), m)
loop(minimum)
}
As we add a richer API to our library we'll find there is a tension between keeping the internals simpler by implementing new methods in terms of existing ones, and improving performance by taking advantage of the additional information we have when implementing a new method directly. We'll see more of this in the chapter on optimization.