Creating a pygments lexer

Pygments is one of the most popular code highlighters available. It has support for over 200 file formats.

Pygments has nice documentation for much of the process. You should read through their site before you read my documentation. I wrote this while developing a simple mscgen lexer which was accepted upstream. I assume you already read through the pygments documentation listed below.

Writing your own lexer
What to include in the lexer
APIs
List of formatters. On Debian, it is in /usr/share/pyshared/pygments/formatters
Adding a lexer plugin (optional)
Example using a plugin (optional). A comment on the pygments site pointed to this email.
Available lexers. You are better off checking the hg repository for the file pygments/lexers/_mapping.py
Quickstart for running

Create a test lexer

First, use the simple diff lexer from the documentation. Add a __main__ check so you can run it directly without worrying about setuptools or editing their code. You just want to see how to test a lexer after you develop yours.

Use the following as an example of a main function. This allows you to run it on all of the test cases that you have.

# Based on pygments documentation
from __future__ import print_function
from pygments.lexer import RegexLexer
from pygments.token import *

class MyDiffLexer(RegexLexer):
    name = 'Diff'
    aliases = ['diff']
    filenames = ['*.diff']

    tokens = {
        'root': [
            (r' .*\n', Text),
            (r'\+.*\n', Generic.Inserted),
            (r'-.*\n', Generic.Deleted),
            (r'@.*\n', Generic.Subheading),
            (r'Index.*\n', Generic.Heading),
            (r'=.*\n', Generic.Heading),
            (r'.*\n', Text),
        ]
    }

if __name__ == '__main__':
    from pygments import highlight
    from pygments.lexers import PythonLexer
    from pygments.formatters import HtmlFormatter, Terminal256Formatter
    from pygments.formatters import RawTokenFormatter
    from sys import argv

    if len(argv) > 1:
        import io

        for arg in argv[1:]:
            input = io.open(arg, 'r')
            code = input.read(-1)
            print("Highlighting " + arg)
            #print(highlight(code, MyDiffLexer(encoding='chardet'),
            #      HtmlFormatter(encoding='utf-8')))
            print(highlight(code, MyDiffLexer(encoding='chardet'),
                  Terminal256Formatter(encoding='utf-8')))

    else:
        code = """--- 1	2012-06-10 03:53:11.018755408 -0400
+++ 2	2012-06-10 03:53:15.906701085 -0400
@@ -2,6 +2,6 @@
 
 here too
 
-my my
+ny my
 
 wow
""";
        # print(highlight(code, MyDiffLexer(), HtmlFormatter()))
        print(highlight(code, MyDiffLexer(), Terminal256Formatter()))

If something doesn't look right, use the RawTokenFormatter so you can see what's happening. Assuming you saved the above as diff.py, you can try out the lexer:

$ python diff.py
$ python diff.py *.diff

Pygments lexer tips

When you are ready to start developing your own lexer, here are some tips:

Pygments has support for over 200 formats. Make sure you aren't duplicating what is already available! There are three places to check:
Check the latest source in mercurial for pygments/lexers/_mapping.py
Check the pull requests on bitbucket
Check the mailing list to see if anyone mentioned it
I recommend using these formatters while you are developing the lexer
Terminal256Formatter is easy to view in xterm while you are developing
- This quickly shows you what things are being recognized and whether it looks okay. Don't worry too much about the actual colors. You can always write or update a style class to handle it.
RawTokenFormatter is easy to do diffs on your lexer while you are developing it. All of the other formatters will add a lot of junk that you don't care about when you are trying to find the differences.
HtmlFormatter is good once you want to double check that the colors look fine.
The lexer rules are scanned in order. If you have a more specific regex, it should go up top in the state. If your output is getting highlighted but with the wrong names, this is probably the reason.
This can also lead to situations where part of what you meant to highlight is not highlighted. This means a more general rule is above your specific rule (or you didn't write it properly).
Make sure you add \b after a regex with a keyword when nothing else follows it. This will prevent you from accidentally matching boxing when your word is box.
If your lexer appears to be in an infinite loop, you probably have a stray | or .* somewhere.
While you are developing, add a (r'.+', Error) at the bottom of your lexer so anything you don't catch will be flagged as an error.
Be sure to remove this before you submit it because upstream wants it to gracefully fail
Combined with the argv reader above, you can check many test cases and then grep for err.
Make sure you follow the guidelines of pygments.
They do not want you to try to hack a parser with the lexer. While there is an error token, you should avoid it in general. They don't want you to try to validate the file.
As per above, try to avoid using the Error token unless you need it
Only use bygroups, scopes or code other than extending one of the existing lexers when you have to.
You only need to specify lexical tokens and their format names.
If you need to add bygroups or scopes to be able to assign better names, go for it. But don't try to add validation to the lexer.
Try to use one of the standard formatting names. You can use anything from the token.py file (On Debian: /usr/share/pyshared/pygments/token.py)
Look for STANDARD_TYPES in that file.
It is possible to create your own tokens, but then you have to either edit or write or your own style class.
Member variables with a leading underscore are useful to avoid reptetition in rules.
Adding separate states can be useful to avoid repetition. For instance, you can create a comments state that is reused.
Pygments has example input files but they are not included in the Debian package. It is in the mercurial tree. If you have suitably licensed examples, you should add those when you commit. Based on the existing files in mercurial, I would upload one complex file instead of a number of smaller inputs.
Test your lexer with both python 2.x and 3.x to make sure you didn't introduce any incompatibilities.

Prepare for upstream

Once you are confident in your lexer, it's time to start working it into pygments so upstream can consider including it in a new release.

Check out the latest code from mercurial. Make sure your changes are clean diffs and don't introduce unnecessary changes.

After installing mercurial and setting up your config, run this command:

$ hg clone http://bitbucket.org/birkenfeld/pygments-main pygments

There are a couple of files that you need to edit. The easiest way to find out what to do is to search for a lexer that isn't extended by others. In my case, I searched for DelphiLexer and it currently has 2 meaningful hits:

pygments/lexer/_mapping.py - This is an autogenerated file. Do not edit it manually.
pygments/lexer/<file>.py - Add the lexer to the list of __all__ and place your class definition in this file.

There are two additional locations that you should make changes

pygments/tests/examplefiles/ to add an example file. Based on the current examples, favor larger complex files rather than a large number of small files.
AUTHORS with your name

To add the example file in hg, run hg add <your file>

You'll have to read through the lexer files to find out which file is appropriate for your lexer. Upstream combines similar lexers into a single file so you shouldn't add a new file unless you have to.

Once you have edited the appropriate pygments/lexer/<file>.py file to add your lexer, you should regenerate the _mapping.py file:

$ cd pygments/lexer
$ python _mapping.py

Run hg status to make sure you only changed two files plus added your test file. Then run hg diff to see what you changed. It's okay if _mapping.py has changes other than your lexer. It probably means that either the file definition changed or someone updated it by mistake.

In my case, it looks like this:

user$ hg status
M pygments/lexers/_mapping.py
M pygments/lexers/other.py
A tests/examplefiles/testinput2.msc

Now you can test to see if pygments recognizes your new lexer. Be sure to use the ./ prefix to avoid using your system installed version (if any).

$ ./pygmentize -L lexers | grep <your lexer name>

To test it, run it against the input file that you added. In my case:

$ ./pygmentize -f terminal256 -l mscgen tests/examplefiles/testinput2.msc

Once you are satisfied that it works, it's time to fork the project so you can issue a pull request to the pygments team. You may want to browse a few pull requests that were approved with new lexers.

The main project is located at bitbucket. Go sign up for a bitbucket account if you don't already have one. Once you login, click on the 'fork' button to create a fork of the project.

Now you are ready to commit it back to your repository. Ideally, you should create a branch for your lexer. Once it is available on bitbucket, you can issue a pull request.

To get setup with a bitbucket account and get familiar with mercurial, you can use atlassian's tutorial.