Just a Theory

By David E. Wheeler

Posts about Mark Jason Dominus

HOP Markdown Followup

Just to follow up on my post from last week, in which I was banging my head on parsing, I had an insight over the weekend that helped me to solve the problem. All I can say is, thank god I’ve written a lot of huge regular expressions in the past, as that experience really helped me to thinking more like a parser. I wanted to share that insight here.

The issue I had came down to an inability to parse both standalone emphasis characters — that is, _ or * with whitespace on both sides — and doubled-up emphasis characters, such as ***, ___, **_, or _**, among a few others. It seemed that I could handle one or the other issue, but not both.

The insight was that I could add more tokens to match doubled emphasis characters before single emphasis characters. Then I could just process them separately. So whereas I previously had

    [ EMLOP => qr/(?<=[^\s_])[_]{1,2}|(?<=[^\s*])[*]{1,2}/ ],
    [ EMROP => qr/[_]{1,2}(?=[^\s_])|[*]{1,2}(?=[^\s*])/ ],

I now have a regular expression to match doubled-up emphasis operators:

my $stem_re = qr{
        (?:[*]{2}|[_]{2})[*_]
    | [*_](?:[*]{2}|[_]{2})
}x;

And I use it like so:

    [ STEMLOP => qr/$stem_re(?=[^\s_*])/ ],
    [ STEMROP => qr/(?<=[^\s_*])$stem_re/ ],

    [ EMLOP => qr/[_]{1,2}(?=[^\s*_])|[*]{1,2}(?=[^\s*_])/ ],
    [ EMROP => qr/(?<=[^\s*_])[_]{1,2}|(?<=[^\s*_])[*]{1,2}/ ],

Yes, I’m now also matching “middle” emphasis operators — that is, those with non-whitespace on both sides. But now I’m able to get pretty much exactly what I need with a rule like this:

my $lstar = match EMLOP => '*';
my $rstar = match EMROP => '*';
my $lline = match EMLOP => '_';
my $rline = match EMROP => '_';
my $not_em;
my $Not_em = parser { $not_em->(@_) };

my $emphasis = T(
    alternate(
        concatenate( $lstar,  $Not_em, $rstar  ),
        concatenate( $lline,  $Not_em, $rline  ),
    ),
    sub { "<em>$_[1]</em>" }
);

Pretty much the same as before. But now I also have this rule, to deal with the combined strong and emphasis tokens:

my $lstem = match 'STEMLOP';
my $rstem = match 'STEMROP';
my $mstem = match 'STEMMOP';
my $not_stem;
my $Not_stem = parser { $not_stem->(@_) };

my $stem = T(
    concatenate(
        alternate($lstem, $mstem),
        $Not_stem,
        alternate($rstem, $mstem)
    ),
    sub {
        my @c = split //, shift;
        return $c[0] eq $c[1]
            ? "<strong><em>$_[1]</em></strong>"
            : "<em><strong>$_[1]</strong></em>";
    },
);

In truth, I ended up with something much more complicated than this, as it needed to make sure that the operators were balanced (e.g., you can’t do it ***like this___), but the overall idea is the same. The emphasis, strong, and strong/em parsers also handle mid-word markers, such as in un*frigging*believable, that I’m not showing here. But the overall approach is fundamentally the same: I was having a problem with two patterns getting in the way of my parser, so I simply wrote a separate parser to handle one of those patterns. Ultimately, all I had to do was break the problem down into smaller parts and solve each part individually. It works pretty well.

As a side note, at one point I had the a lexer that used this code ref:

my $stem_split = sub {
    my $l = shift;
    my @c = split //, shift;
    my $pos = substr($l, 4);
    return $c[0] eq $c[1]
        ? ( [ $l => "$c[0]$c[1]"], [ $l => $c[2]]         )
        : ( [ $l => $c[0] ],       [ $l => "$c[1]$c[2]" ] );
};

Those STEMOP rules then looked like so:

    [ EMMOP => qr/(?<=[^\s_*])$stem_re(?=[^\s_*])/, $stem_split ],
    [ EMROP => qr/(?<=[^\s_*])$stem_re/, $stem_split ],

The cool thing about this was that I didn’t need a separate strong/emphasis parser; these lexer rules were just returning the appropriate emphasis tokens, and then I was able to just let the emphasis and strong parsers do their things.

But I ran into issues when I realized that it the left and right versions were coming out the same, so for ***foo***, rather than getting <strong><em>foo</em></strong>, I was getting <strong><em>foo</strong></em>. Oops. I could solve this by using different splitters for left and right, but once I added the middle token, wherein there are no whitespace characters on either side, it’s impossible to tell which token to return first. So I went back to the separate strong/emphasis operators.

Another path I started down was separate tokens for strong and emphasis markers. It looked something like this:

    [ STLOP => qr/[_]{2}(?=[_]?\S)|[*]{2}(?=[*]?\S)/ ],
    [ STROP => qr/(?<=\S)[_]{2}|(?<=\S)[*]{2}/ ],
    [ EMLOP => qr/[_*](?=\S)/ ],
    [ EMROP => qr/(?<=\S)[_*]/ ],

With this approach, I thought I could match the strong and emphasis operators separately in the lexer. But as I described on the HOP-discuss mail list, this runs up to limitations of the lexer. For example, for these rules, the string _*foo*__ yields the proper tokens:

[['STLOP','_'],['EMLOP','*'],'foo,['EMROP','*'],['STROP','_']];

But *__foo__* does not:

['*',['STLOP','_'],'foo,['STLOP','_'],'*'];

The problem is that the lexer splits the string up into tokens and leftovers after each rule is parsed. So after STLOP and STROP are parsed, the stream looks like this:

['*',['STLOP','_'],'foo,['STLOP','_'],'*'];

In other words, exactly the same. When EMLOP and EMROP are applied to ‘*’, there is no non-space character before or after it, so it’s left alone. But clearly this is an invalid lexing.

So I had to once again go back to the more complicated solution of a separate parser for combined span operators. For now. I’m hoping someone can come up with a solution to make this lex better, because I’m going to be adding more span operators, such as for code, deleted text, and added text, and it will very quickly get complicated parsing for those combinations of characters. It’d be much better if the lexer could see something like

+**_`test *markdown*`_**+

And correctly give me:

[
    [ ADDLOP => '+'                 ],
    [ STLOP  => '**'                ],
    [ EMLOP  => '_'                 ],
    [ CODE   => '`test *markdown*`' ],
    [ EMROP  => '_'                 ],
    [ STROP  => '**'                ],
    [ ADDROP => '+'                 ],
]

And then the parsing would be quite straight-forward. Because otherwise my parser is just going to get more complex and harder to maintain.

Looking for the comments? Try the old layout.

Issues Parsing Markdown with HOP::Parser

Since I had some ideas for features to add on to Markdown, and since I have been wanting to learn more about parsing, I picked up my copy of Higher-Order Perl with the aim of writing a proper parser for Markdown. I’ve made a decent start, with support for simple paragraphs, code spans, escapes, and a few other things. Then I took on emphasis spans and ran smack into the limits of the current implementation of HOP::Parser.

It started out simply enough. I added this tokens to my lexer:

        [ EMOP => qr/[_]{1,2}|[*]{1,2}/ ],

The Markdown syntax calls for emphasized text to be bracketed one star or underscore, and strong text to be bracketed by two stars or underscores. With this simple “emphasis operator” token, I was able to write an emphasis parser like this:

my $joiner  = sub { join '', @_ };
my $sstar   = absorb match EMOP => '*';
my $suscore = absorb match EMOP => '_';
my $not_em;
my $Not_em = parser { $not_em->(@_) };

my $emphasis = T(
    alternate(
        concatenate( $sstar,   $Not_em, $sstar   ),
        concatenate( $suscore, $Not_em, $suscore ),
    ),
    sub { "<em>$_[0]</em>" }
);

# omitted: definition of $strong;

$not_em = T(plus( T( alternate(
    $text, $code, $strong
), $joiner, ) ), $joiner);

(The plus() parser is in my fork of HOP::Parser on GitHub.) The parser for <strong> is similar. And it works reasonably well for simple examples such as:

  • *this*
  • _this_
  • un*frigging*believable
  • un_frigging_believable
  • *this* and *that*
  • *this* and _that_

It even works when strong and emphasis are mixed:

  • “***this***” yields <strong><em>this</em></strong>
  • “___this___” yields <strong><em>this</em></strong>
  • “*this **and** that*” yields <em>this <strong>and</strong> that</em>
  • “*this __and__ that*” yields <em>this <strong>and</strong> that</em>

But then came the need to support properly parsing non-emphasizing instances of the emphasis characters. For example, each of these should yield no emphasis:

  • * not em *
  • _ not em _

Instead, they should be parsed as literal stars and underscores. This is because opening emphasis operators must be followed by a non-space character, and closing ones must be preceded by a non-space character. So my first thought was to use lookahead and lookbehind in the parser to find left and right emphasis operators, like so:

        [ EMLOP => qr/(?<=[^\s_])[_]{1,2}|(?<=[^\s*])[*]{1,2}/ ],
        [ EMROP => qr/[_]{1,2}(?=[^\s_])|[*]{1,2}(?=[^\s*])/ ],

And then I changed the parser to this:

my $lstar  = absorb match EMLOP => '*';
my $rstar  = absorb match EMROP => '*';
my $lscore = absorb match EMLOP => '_';
my $rscore = absorb match EMROP => '_';

my $emphasis = T(
    alternate(
        concatenate( $lstar,  $Not_em, $rstar  ),
        concatenate( $lscore, $Not_em, $rscore ),
    ),
    sub { "<em>$_[0]</em>" }
);

Again, this works with the simple examples, but now I’m getting different issues. For example, whereas “__*word*__” should be lexed as

[
    ['EMLOP,   '__'  ],
    ['EMLOP',  '*'   ],
    ['STRING', 'this'],
    ['EMROP',  '*'   ],
    ['EMROP,   '__'  ],
]

But instead comes out as:

[
    ['STRING', '__'  ],
    ['EMROP',  '*'   ],
    ['STRING', 'this'],
    ['EMROP',  '*'   ],
    ['STRING', '__'  ],
]

Note that it’s not finding any left operators there! There are a number of examples where the lexed tokens are just inadequate, leading to parse failures.

The whole problem with identifying the left and right emphasis operators is where they are relative to whitespace or line boundaries. Even trickier, however, is the mid-word emphasis, such as in “un*frigging*believable,” where, to know whether an operator is left or right, you have to actually be tracking whether or not a left one has been found. For example, if you find a star, it’s a rightop if a previously-found leftop star was found; otherwise it’s a leftop. So the issue is state, which of course the Lexer cannot track (once you capture a token, it’s gone; you can’t do a lookbehind).

So I thought that the solution would be to start generating whitespace tokens. It’s likely I’d have to do this anyway, to deal with code blocks and lists, though I had hoped to avoid it even then, since it means a lot more tokens and a lot more work for the lexer. But I decided to give it a try. I changed the relevant bits of the lexer to:

        [ SPACE => qr/[\t ]+/ ],
        [ EMOP  => qr/[_]{1,2}|[*]{1,2}/ ],

(I’m ignoring newlines because they’re already handled elsewhere in the lexer.) This at least makes the lexing much simpler, and there are no unexpected tokens. With that, I went about trying to coerce the parser to properly deal with those tokens:

my $space     = match 'SPACE';
my $neg_space = neg_lookahead $space;
my $sstar     = absorb match EMOP => '*';
my $suscore   = absorb match EMOP => '_';
my $not_em;
my $Not_em = parser { $not_em->(@_) };

my $emphasis = T(
    alternate(
        concatenate( $sstar,   $neg_space, $Not_em, $sstar   ),
        concatenate( $suscore, $neg_space, $Not_em, $suscore ),
    ),
    sub { "<em>$_[0]</em>" }
);

That neg_lookahead() parser-builder was my attempt to implement a negative lookahead assertion. This is so that the left emphasis operator is only identified as such if it is not followed by a space. It looks like this:

sub neg_lookahead {
    my $p = ref $_[0] eq 'CODE' ? shift : lookfor @_;
    parser {
        my $input = shift or return;
        my @ret = eval { $p->($input) };
        return @ret ? () : (undef, $input);
    },
}

I also had to change my $text parser, which is included in $not_em, to recognize EMOP tokens and stringify them, since they can now sometimes be ignored by the emphasis parser. Still, this doesn’t quite get me where I want to be. For example, “*this*” is not interpreted as having any emphasis at all. The reason? because the $not_em parser now recognizes EMOP tokens, and it’s a greedy parser. So while the parser does fine the first star and the intervening text, it fails to find the second star, because the $not_em parser has already eaten it!

At this point, I’m pretty frustrated. I really want to get this right, but I’m at the limits of both what I understand about parsing and, I think, the current implementation of HOP::Parser (which has no way to turn off greediness, as far as I know). At first I thought that adding backtracking as described in section 8.8 of Chapter 8 might help, but I couldn’t figure out how to add it to HOP::Parser without fundamentally changing how the parser works. But then I realized that it wouldn’t be able to backtrack to find the last token eaten by the $text parser anyway, so the issue is moot.

What I really need is some help better understanding how to think about parsing stuff like this. It took me years to really understand regular expressions, so I don’t doubt that it will take me a long time to train my mind to think like a parser, but hints and suggestions would be greatly appreciated!

If you’re curious enough, the code, in progress, is here.

Looking for the comments? Try the old layout.