“Parse, don’t validate” has been one of my favourite programming articles for some time. The main gist of the article is that, when writing in a type-driven fashion, your snappy slogan should be:
Parse, don’t validate.
The core difference between parsing and validating can be explained by looking at two very similar functions:
parseInt :: String -> Maybe Int
parseInt str = Text.Read.readMaybe str
validateInt :: String -> Bool
validateInt str = Text.Read.readMaybe str /= Nothing
As you can see, they look very similar. The main difference is that
parseInt
returns a useful value, the Int
that we wanted to parse, while
validateInt
takes that useful value and throws it away. This is also
mentioned in the wonderful Haskell Mini-Patterns
Handbook
as the “Evidence” pattern.
The key issue here is that by calling a function that returns Bool you lose the information about earlier performed validation. Instead, you can keep this information by explicitly pattern-matching on the validation or result.
In this post, I would like to go through a practical example that shows the power of bringing this concept to its limits. Which brings us to…
Advent of Code 2020, Day 4
We are tasked with parsing a batch of passports composed of these fields:
- byr (Birth Year)
- iyr (Issue Year)
- eyr (Expiration Year)
- hgt (Height)
- hcl (Hair Color)
- ecl (Eye Color)
- pid (Passport ID)
- cid (Country ID)
All the fields are required except for the cid
field, which is optional.
Note that the fields can be written in any order, this will be important
later. Our batch is composed of multiple passports separated by empty lines
(the input.txt
):
ecl:gry pid:860033327 eyr:2020 hcl:#fffffd
byr:1937 iyr:2017 cid:147 hgt:183cm
iyr:2013 ecl:amb cid:350 eyr:2023 pid:028048884
hcl:#cfa07d byr:1929
hcl:#ae17e1 iyr:2013
eyr:2024
ecl:brn pid:760753108 byr:1931
hgt:179cm
hcl:#cfa07d eyr:2025 pid:166559648
iyr:2011 ecl:brn hgt:59in
- The first passport is valid - all required fields are present.
- The second passport is invalid - it is missing
hgt
. - The third passport is interesting: the only missing field is the optional
cid
, which makes it valid. - The fourth passport is missing two fields,
cid
andbyr
. Missingcid
is fine, but missingbyr
is not, so this passport is invalid.
The challenge is to count how many passports are valid in the given batch.
Let’s write some code to open the file and parse each group of passport fields:
module Main where
import qualified Data.List.Split as S
main :: IO ()
main = do
contents <- readFile "input.txt"
let entries = map parseEntry (S.splitOn "\n\n" contents)
print entries
data PassportEntry = PassportEntry
deriving (Show)
parseEntry :: String -> PassportEntry
parseEntry text = undefined
Nothing too fancy here, we’re using Data.List.Split
from the split
package to do the heavy lifting. And the implementation of parseEntry
has
been conveniently stubbed so that the code compiles.
Now how should our PassportEntry
data structure look like? I’d love to
eventually represent passports as:
data Passport = Passport
{ birthYear :: Int,
issueYear :: Int,
expirationYear :: Int,
height :: String,
hairColor :: String,
eyeColor :: String,
passportId :: String,
countryId :: Maybe Int
}
If we imagine parsing each field sequentially, we can see that we won’t be
able to construct this data structure in a single operation. We’ll have to
accumulate the data up until we’re ready to create a proper Passport
.
One way to store the fields is to insert them into a hash. First of all,
we’re going to use a custom data type to represent the keys of the hash.
Why is that? We really don’t want to be making typos later when comparing
raw strings like "ecl"
and "elc"
. We’ll use a HashMap
from the
Data.HashMap.Strict
module:
import qualified Data.HashMap.Strict as HM
data PassportField
= BirthYear
| IssueYear
| ExpirationYear
| Height
| HairColor
| EyeColor
| PassportId
| CountryId
deriving (Eq, Show)
type PassportEntry = HM.HashMap PassportField String
Of course things can’t be that easy. We also need to make our type
implement the Hashable
typeclass:
{-# LANGUAGE DeriveGeneric #-}
import qualified Data.HashMap.Strict as HM
import Data.Hashable
import GHC.Generics (Generic)
data PassportField
= BirthYear
| IssueYear
| ExpirationYear
| Height
| HairColor
| EyeColor
| PassportId
| CountryId
deriving (Eq, Show, Generic)
instance Hashable PassportField
type PassportEntry = HM.HashMap PassportField String
Don’t worry about what we’ve added. Just take them as God-given truths. 👼
Okay, now we can implement our parseEntry
function:
import Data.Maybe (mapMaybe)
import qualified Data.Char as Char
parseEntry :: String -> PassportEntry
parseEntry line =
HM.fromList $
mapMaybe parseTag $
S.splitWhen Char.isSpace line
parseTag :: String -> Maybe (PassportField, String)
parseTag value =
case S.splitOn ":" value of
["byr", byr] ->
Just (BirthYear, byr)
["iyr", iyr] ->
Just (IssueYear, iyr)
["eyr", eyr] ->
Just (ExpirationYear, eyr)
["hgt", height] ->
Just (Height, height)
["hcl", color] ->
Just (HairColor, color)
["ecl", color] ->
Just (EyeColor, color)
["pid", pid] ->
Just (PassportId, pid)
["cid", cid] ->
Just (CountryId, cid)
_ ->
Nothing
We try to parse each field (such as byr:2002
) into a PassportField
type, then end up building a hash using HM.fromList
. We can take this for
a spin:
Prelude> :l Main.hs
*Main> main
[ fromList
[ (CountryId, "147"),
(BirthYear, "1937"),
(IssueYear, "2017"),
(HairColor, "#fffffd"),
(ExpirationYear, "2020"),
(EyeColor, "gry"),
(Height, "183cm"),
(PassportId, "860033327")
],
fromList
[ (CountryId, "350"),
(BirthYear, "1929"),
(IssueYear, "2013"),
(HairColor, "#cfa07d"),
(ExpirationYear, "2023"),
(EyeColor, "amb"),
(PassportId, "028048884")
],
fromList
[ (BirthYear, "1931"),
(IssueYear, "2013"),
(HairColor, "#ae17e1"),
(ExpirationYear, "2024"),
(EyeColor, "brn"),
(Height, "179cm"),
(PassportId, "760753108")
],
fromList
[ (IssueYear, "2011"),
(HairColor, "#cfa07d"),
(ExpirationYear, "2025"),
(EyeColor, "brn"),
(Height, "59in"),
(PassportId, "166559648")
]
]
Nice and tidy! 🦾
Now our goal is to verify which one of these groups is valid. First of all, we should define a list of required fields:
requiredFields :: [PassportField]
requiredFields =
[ BirthYear,
IssueYear,
ExpirationYear,
Height,
HairColor,
EyeColor,
PassportId
]
We can then define a validation function:
isEntryValid :: PassportEntry -> Bool
isEntryValid entry =
all (`HM.member` entry) requiredFields
And change our main function to use that:
main :: IO ()
main = do
contents <- readFile "input.txt"
let entries = map parseEntry (S.splitOn "\n\n" contents)
print $ length $ filter isEntryValid entries
Running this yields 2
, which is the correct answer!
Here is
all the code we have written so far, if you’re feeling like you need a
refresher.
Advent of Code 2020, Day 4, Part II
In the second part of the challenge, these new rules are added:
- byr (Birth Year) - four digits; between 1920 and 2002.
- iyr (Issue Year) - four digits; between 2010 and 2020.
- eyr (Expiration Year) - four digits; between 2020 and 2030.
- hgt (Height) - a number followed by either cm or in:
- If cm, the number must be between 150 and 193.
- If in, the number must be between 59 and 76.
- hcl (Hair Color) - a '#' followed by six chars 0-9 or a-f.
- ecl (Eye Color) - one of: amb blu brn gry grn hzl oth.
- pid (Passport ID) - a nine-digit number.
- cid (Country ID) - ignored, missing or not.
Here is a new batch for us to peruse:
pid:087499704 hgt:74in ecl:grn iyr:2012 eyr:2030 byr:1980
hcl:#623a2f
eyr:1972 cid:100
hcl:#18171d ecl:amb hgt:170 pid:186cm iyr:2018 byr:1926
- the first passport is valid
- the second passport is invalid (look at the
eyr
field)
These new requirements are a bit annoying. Our simple approach of checking
if all required fields are present won’t work any longer. We can instead
implement a isFieldValid
function to check if all fields are valid.
isFieldValid :: (PassportField, String) -> Bool
isFieldValid (field, value) =
case field of
BirthYear ->
let v = toInt value
in length value == 4 && v >= 1920 && v <= 2002
IssueYear ->
let v = toInt value
in length value == 4 && v >= 2010 && v <= 2020
ExpirationYear ->
let v = toInt value
in length value == 4 && v >= 2020 && v <= 2030
Height ->
case span Char.isDigit value of
(num, "cm") ->
let n = toInt num
in n >= 150 && n <= 193
(num, "in") ->
let n = toInt num
in n >= 59 && n <= 76
_ ->
False
HairColor ->
case (length value, value) of
(7, '#' : rest) ->
all (`elem` allowedHexChars) rest
_ ->
False
EyeColor ->
value `elem` validEyeColors
PassportId ->
length value == 9 && all Char.isDigit value
CountryId ->
all Char.isDigit value
toInt :: String -> Int
toInt = read
validEyeColors :: [String]
validEyeColors =
["amb", "blu", "brn", "gry", "grn", "hzl", "oth"]
allowedHexChars :: [Char]
allowedHexChars =
['0' .. '9'] <> ['a' .. 'f']
Then we can change our isEntryValid
to use this function:
isEntryValid :: PassportEntry -> Bool
isEntryValid entry =
requiredFieldsPresent && allFieldsValid
where
requiredFieldsPresent =
all (`HM.member` entry) requiredFields
allFieldsValid =
all isFieldValid (HM.toList entry)
Running this program on our second data sample yields 1
, and it will be
good enough to solve the Advent of Code challenge and get us those sweet
sweet stars.
🎉 🎉 🎉
A moment of reflection
If we look back at the current state of our code, we can see that we are doing a lot of validations.
We do a lot of work to verify if something is valid, then throw it all out
of the window to return a meagre Bool
. German folks from the sixteenth
century
would have told us snarkily:
das Kind mit dem Bade ausschütten
Yes, literally throwing the baby out with the bathwater.
This would be even more true if we were given a new challenge:
Now find the unique set of eye colors in all valid passports
With our current code, we know which passport is valid, but we have no way
of extracting the eye color of a valid passport. This is why earlier we
were mentioning this sort of Passport
representation:
data Passport = Passport
{ birthYear :: Int,
issueYear :: Int,
expirationYear :: Int,
height :: String,
hairColor :: String,
eyeColor :: String,
passportId :: String,
countryId :: Maybe Int
}
If we had a function like parsePassport
that went from String
to Maybe
Passport
we could then write some code like this:
Set.fromList $
map eyeColor $
mapMaybe parsePassport (S.splitOn "\n\n" contents)
But let’s not get too ahead of ourselves. Let’s try to refactor our current code to do something similar. First we can try to write a function like this one:
entryToPassport :: PassportEntry -> Maybe Passport
This function takes the intermediate representation of a collection of
passport fields and returns a ‘validated’ passport. We can also reuse our
isFieldValid
function by using this trick:
parseField ::
(PassportField, String) -> Maybe (PassportField, String)
parseField tuple =
if isFieldValid tuple
then Just tuple
else Nothing
We are still reusing the validating logic, but we end up returning something useful instead. Remember, we are slowly migrating our code from validating data to parsing data.
Using our new helper we can finally implement the entryToPassport
function. We’ll do that in two separate steps. First we’ll get all the
values of the required fields:
getAllRequiredFields :: PassportEntry -> Maybe [String]
getAllRequiredFields e =
traverse
( \field -> do
v <- HM.lookup field e
(_field, text) <- parseField (field, v)
return text
)
requiredFields
The traverse
magic ensures that we either get all the values we’re
looking for wrapped in a Just
, or Nothing
if any of those fields were
invalid. Ok, we’re ready to roll now!
entryToPassport :: PassportEntry -> Maybe Passport
entryToPassport entry = do
case getAllRequiredFields entry of
Just [byr, iyr, eyr, hgt, hcl, ecl, pid] ->
Just $
Passport
{ birthYear = toInt byr,
issueYear = toInt iyr,
expirationYear = toInt eyr,
height = hgt,
hairColor = hcl,
eyeColor = ecl,
passportId = pid,
countryId = toInt <$> HM.lookup CountryId entry
}
_ ->
Nothing
We end up having to pass String
values around, which need to be parsed
again into the exact types that we desire. Also we need to pass these
values into a list and hope not to mess up the ordering of the fields. So
it’s far from perfect, but we’re getting somewhere.
In order to use this function in our main, we replace the last line of the main fuction from:
print $ length $ filter isEntryValid entries
to:
print $ length $ mapMaybe entryToPassport entries
Running
this on
our test batch still returns 1
, which is a good sign we haven’t broken
anything.
Still, there is one thing that I’m particularly unhappy about in this code:
we use an intermediate representation of the passport that has no real
domain value. Nobody cares about PassportField
and PassportEntry
, but
we need to have these types in order to build our Passport
.
Not only that, but having these intermediate types means that there are bugs waiting to happen when we transform them to our desired data type:
- getting the order of the fields wrong
- parsing strings in inconsistent ways
- forgetting to validate the presence of required fields
This is also known as shotgun parsing:
Shotgun parsing is a programming antipattern whereby parsing and input-validating code is mixed with and spread across processing code—throwing a cloud of checks at the input, and hoping, without any systematic justification, that one or another would catch all the “bad” cases.
In “Parse, don’t validate”, Alexis King goes on to describe how this is specifically related to parsing and validating:
It may not be immediately apparent what shotgun parsing has to do with validation—after all, if you do all your validation up front, you mitigate the risk of shotgun parsing. The problem is that validation-based approaches make it extremely difficult or impossible to determine if everything was actually validated up front or if some of those so-called “impossible” cases might actually happen. The entire program must assume that raising an exception anywhere is not only possible, it’s regularly necessary.
How can we avoid doing this? Let’s try something new!
A new perspective
We’re going to write the same program using parsec, a monadic parser combinator library. I’ve recently bumped into this excellent walkthrough to parser combinators, which I thouroughly recommend reading.
In short a parser is a type like this one:
type Parser a = Parser {
runParser :: String -> (String, Either ParseError a)
}
It works by consuming input characters from the input string and returning a tuple with two values:
- The first value is what’s left of the input string, so that other parsers can keep parsing the rest of the input.
- The second value contains either a parse error or a properly parsed value of type
a
.
We’ll need to install parsec
and add some imports:
import qualified Text.Parsec as P
import Text.Parsec.String (Parser)
import Control.Monad (guard)
Now let’s implement parsers for some fields:
-- byr (Birth Year) - four digits; between 1920 and 2002.
byrParser :: Parser Int
byrParser = do
P.string "byr"
P.char ':'
value <- P.count 4 P.digit
P.spaces
let int = read value
guard (int >= 1920 && int <= 2002)
return int
-- iyr (Issue Year) - four digits; between 2010 and 2020.
iyrParser :: Parser Int
iyrParser = do
P.string "iyr"
P.char ':'
value <- P.count 4 P.digit
P.spaces
let int = read value
guard (int >= 2010 && int <= 2020)
return int
-- eyr (Expiration Year) - four digits; between 2020 and 2030.
eyrParser :: Parser Int
eyrParser = do
P.string "eyr"
P.char ':'
value <- P.count 4 P.digit
P.spaces
let int = read value
guard (int >= 2020 && int <= 2030)
return int
Here we use the guard
function to introduce an assertion that will make
the parser fail when the condition is not met. In general I feel that this
code is quite readable, but we might want to extract a reusable helper to
parse years:
yearParser :: String -> (Int, Int) -> Parser Int
yearParser value (rangeStart, rangeEnd) = do
P.string value
P.char ':'
value <- P.count 4 P.digit
P.spaces
let int = read value
guard (int >= rangeStart && int <= rangeEnd)
return int
byrParser :: Parser Int
byrParser = do
yearParser "byr" (1920, 2002)
iyrParser :: Parser Int
iyrParser = do
yearParser "iyr" (2010, 2020)
eyrParser :: Parser Int
eyrParser = do
yearParser "eyr" (2020, 2030)
🎉
This is the beauty of writing parser combinators. They are extremely easy to reuse and combine.
Now we want to write a parser for the height field. It might be nice to use a more specialized data type to represent that:
data Height
= InCms Int
| InInches Int
deriving (Eq, Show)
Here we go!
-- hgt (Height) - a number followed by either cm or in:
-- If cm, the number must be between 150 and 193.
-- If in, the number must be between 59 and 76.
heightParser :: Parser Height
heightParser = do
P.string "hgt"
P.char ':'
digits <- P.many1 P.digit
let value = read digits
result <- unitParser value
case result of
InCms _ ->
guard (value >= 150 && value <= 193)
InInches _ ->
guard (value >= 59 && value <= 76)
P.spaces
return result
What’s that unitParser
you ask?
unitParser :: Int -> Parser Height
unitParser value =
let cmParser = do
P.string "cm"
return (InCms value)
inParser = do
P.string "in"
return (InInches value)
in P.choice [cmParser, inParser]
The rest of the parsers are basically a step by step translation of the requirements in English:
-- hcl (Hair Color) - a '#' followed by six chars 0-9 or a-f.
hairColorParser :: Parser String
hairColorParser = do
P.string "hcl"
P.char ':'
P.char '#'
v <- P.count 6 (P.oneOf "0123456789abcdef")
P.spaces
return v
-- pid (Passport ID) - a nine-digit number.
passportIdParser :: Parser String
passportIdParser = do
P.string "pid"
P.char ':'
v <- P.count 9 P.digit
P.spaces
return v
-- cid (Country ID) - ignored, missing or not.
countryIdParser :: Parser Int
countryIdParser = do
P.string "cid"
P.char ':'
value <- P.many1 P.digit
P.spaces
return $ read value
-- ecl (Eye Color) - one of: amb blu brn gry grn hzl oth.
eyeColorParser :: Parser String
eyeColorParser = do
P.string "ecl"
P.char ':'
v <-
P.choice $
map
(P.try . P.string)
[ "amb",
"blu",
"brn",
"gry",
"grn",
"hzl",
"oth"
]
P.spaces
return v
You will notice we had to use this mysterious P.try
function in the last
snippet. This is very useful when we need to look ahead in the input
string. Consider the example of blu
and brn
: after consuming an initial
b
character we land in the blu
branch. If at that point we encounter a
r
character, we realize we need to go back and choose the brn
branch
instead. But by default the parsing would stop because we have already
consumed the first character. P.try
will make it so our parser pretends
it hasn’t consumed any input so that we can keep trying other
alternatives.
We have now written parsers for each individual field. So now it’s time to combine them all together…
Another problem?!
Remember that the fields can be written in any order? Uh oh.
Since our parser tries to consume input one character at a time, how the heck can we write one that has to deal with randomly ordered input?
It is impossible, right?
No, it’s possible!
The parsec
library includes a wonderful Text.Parsec.Perm
module:
This module implements permutation parsers. A permutation phrase is a sequence of elements (possibly of different types) in which each element occurs exactly once and the order is irrelevant. Some of the permutable elements may be optional.
Let’s import it:
import Text.Parsec.Perm (permute, (<$$>), (<|?>), (<||>))
Woah, calm down, that’s lots of operators there mate.
permute
is the last call that will wrap everything up and return a parser of something.<$$>
is used to assign all the fields that we parsed to something. In our case it will be aPassport
.<||>
is used to describe a required field<|?>
is used to describe an optional field
Ready for the big reveal? 🥁🥁🥁
passportParser :: Parser Passport
passportParser =
permute $
Passport <$$> byrParser
<||> iyrParser
<||> P.try eyrParser
<||> P.try heightParser
<||> P.try hairColorParser
<||> P.try eyeColorParser
<||> passportIdParser
<|?> (Nothing, Just <$> countryIdParser)
This parser can parse passports which fields are written in any random order, as long as each required field is present once. Pretty slick, eh?
We just need to wire it up in our main:
import Data.Either (rights)
main :: IO ()
main = do
contents <- readFile "input.txt"
let passports =
rights $
map
(P.parse passportParser "")
(S.splitOn "\n\n" contents)
print $ length passports
That’s all we need! We can now get rid of the intermediate PassportField
and PassportEntry
types and all that validation code. Welcome to the
wonderful world of parsing!
If you glance over the final listing you’d be surprised to see how tidy it is. We have a description of the input that we’d like to parse and nothing more. No validations, no transformations, no other massaging of the types. That’s the beauty of parsing and expressing real world problems in terms of parsing.
I hope you enjoyed this deep dive into parsing, thanks for reading!