Pygments is one of the most popular code highlighters available. It has support for over 200 file formats.

Pygments has nice documentation for much of the process. You should read through their site before you read my documentation. I wrote this while developing a simple mscgen lexer which was accepted upstream. I assume you already read through the pygments documentation listed below.

Create a test lexer

First, use the simple diff lexer from the documentation. Add a __main__ check so you can run it directly without worrying about setuptools or editing their code. You just want to see how to test a lexer after you develop yours.

Use the following as an example of a main function. This allows you to run it on all of the test cases that you have.

# Based on pygments documentation
from __future__ import print_function
from pygments.lexer import RegexLexer
from pygments.token import *

class MyDiffLexer(RegexLexer):
    name = 'Diff'
    aliases = ['diff']
    filenames = ['*.diff']

    tokens = {
        'root': [
            (r' .*\n', Text),
            (r'\+.*\n', Generic.Inserted),
            (r'-.*\n', Generic.Deleted),
            (r'@.*\n', Generic.Subheading),
            (r'Index.*\n', Generic.Heading),
            (r'=.*\n', Generic.Heading),
            (r'.*\n', Text),
        ]
    }

if __name__ == '__main__':
    from pygments import highlight
    from pygments.lexers import PythonLexer
    from pygments.formatters import HtmlFormatter, Terminal256Formatter
    from pygments.formatters import RawTokenFormatter
    from sys import argv

    if len(argv) > 1:
        import io

        for arg in argv[1:]:
            input = io.open(arg, 'r')
            code = input.read(-1)
            print("Highlighting " + arg)
            #print(highlight(code, MyDiffLexer(encoding='chardet'),
            #      HtmlFormatter(encoding='utf-8')))
            print(highlight(code, MyDiffLexer(encoding='chardet'),
                  Terminal256Formatter(encoding='utf-8')))

    else:
        code = """--- 1	2012-06-10 03:53:11.018755408 -0400
+++ 2	2012-06-10 03:53:15.906701085 -0400
@@ -2,6 +2,6 @@
 
 here too
 
-my my
+ny my
 
 wow
""";
        # print(highlight(code, MyDiffLexer(), HtmlFormatter()))
        print(highlight(code, MyDiffLexer(), Terminal256Formatter()))

If something doesn't look right, use the RawTokenFormatter so you can see what's happening. Assuming you saved the above as diff.py, you can try out the lexer:

$ python diff.py
$ python diff.py *.diff

Pygments lexer tips

When you are ready to start developing your own lexer, here are some tips:

  • Pygments has support for over 200 formats. Make sure you aren't duplicating what is already available! There are three places to check:
  • Check the latest source in mercurial for pygments/lexers/_mapping.py
  • Check the pull requests on bitbucket
  • Check the mailing list to see if anyone mentioned it
  • I recommend using these formatters while you are developing the lexer
  • Terminal256Formatter is easy to view in xterm while you are developing
    • This quickly shows you what things are being recognized and whether it looks okay. Don't worry too much about the actual colors. You can always write or update a style class to handle it.
  • RawTokenFormatter is easy to do diffs on your lexer while you are developing it. All of the other formatters will add a lot of junk that you don't care about when you are trying to find the differences.
  • HtmlFormatter is good once you want to double check that the colors look fine.
  • The lexer rules are scanned in order. If you have a more specific regex, it should go up top in the state. If your output is getting highlighted but with the wrong names, this is probably the reason.
  • This can also lead to situations where part of what you meant to highlight is not highlighted. This means a more general rule is above your specific rule (or you didn't write it properly).
  • Make sure you add \b after a regex with a keyword when nothing else follows it. This will prevent you from accidentally matching boxing when your word is box.
  • If your lexer appears to be in an infinite loop, you probably have a stray | or .* somewhere.
  • While you are developing, add a (r'.+', Error) at the bottom of your lexer so anything you don't catch will be flagged as an error.
  • Be sure to remove this before you submit it because upstream wants it to gracefully fail
  • Combined with the argv reader above, you can check many test cases and then grep for err.
  • Make sure you follow the guidelines of pygments.
  • They do not want you to try to hack a parser with the lexer. While there is an error token, you should avoid it in general. They don't want you to try to validate the file.
  • As per above, try to avoid using the Error token unless you need it
  • Only use bygroups, scopes or code other than extending one of the existing lexers when you have to.
  • You only need to specify lexical tokens and their format names.
  • If you need to add bygroups or scopes to be able to assign better names, go for it. But don't try to add validation to the lexer.
  • Try to use one of the standard formatting names. You can use anything from the token.py file (On Debian: /usr/share/pyshared/pygments/token.py)
  • Look for STANDARD_TYPES in that file.
  • It is possible to create your own tokens, but then you have to either edit or write or your own style class.
  • Member variables with a leading underscore are useful to avoid reptetition in rules.
  • Adding separate states can be useful to avoid repetition. For instance, you can create a comments state that is reused.
  • Pygments has example input files but they are not included in the Debian package. It is in the mercurial tree. If you have suitably licensed examples, you should add those when you commit. Based on the existing files in mercurial, I would upload one complex file instead of a number of smaller inputs.
  • Test your lexer with both python 2.x and 3.x to make sure you didn't introduce any incompatibilities.

Prepare for upstream

Once you are confident in your lexer, it's time to start working it into pygments so upstream can consider including it in a new release.

Check out the latest code from mercurial. Make sure your changes are clean diffs and don't introduce unnecessary changes.

After installing mercurial and setting up your config, run this command:

$ hg clone http://bitbucket.org/birkenfeld/pygments-main pygments

There are a couple of files that you need to edit. The easiest way to find out what to do is to search for a lexer that isn't extended by others. In my case, I searched for DelphiLexer and it currently has 2 meaningful hits:

  • pygments/lexer/_mapping.py - This is an autogenerated file. Do not edit it manually.
  • pygments/lexer/<file>.py - Add the lexer to the list of __all__ and place your class definition in this file.

There are two additional locations that you should make changes

  • pygments/tests/examplefiles/ to add an example file. Based on the current examples, favor larger complex files rather than a large number of small files.
  • AUTHORS with your name

To add the example file in hg, run hg add <your file>

You'll have to read through the lexer files to find out which file is appropriate for your lexer. Upstream combines similar lexers into a single file so you shouldn't add a new file unless you have to.

Once you have edited the appropriate pygments/lexer/<file>.py file to add your lexer, you should regenerate the _mapping.py file:

$ cd pygments/lexer
$ python _mapping.py

Run hg status to make sure you only changed two files plus added your test file. Then run hg diff to see what you changed. It's okay if _mapping.py has changes other than your lexer. It probably means that either the file definition changed or someone updated it by mistake.

In my case, it looks like this:

user$ hg status
M pygments/lexers/_mapping.py
M pygments/lexers/other.py
A tests/examplefiles/testinput2.msc

Now you can test to see if pygments recognizes your new lexer. Be sure to use the ./ prefix to avoid using your system installed version (if any).

$ ./pygmentize -L lexers | grep <your lexer name>

To test it, run it against the input file that you added. In my case:

$ ./pygmentize -f terminal256 -l mscgen tests/examplefiles/testinput2.msc

Once you are satisfied that it works, it's time to fork the project so you can issue a pull request to the pygments team. You may want to browse a few pull requests that were approved with new lexers.

The main project is located at bitbucket. Go sign up for a bitbucket account if you don't already have one. Once you login, click on the 'fork' button to create a fork of the project.

Now you are ready to commit it back to your repository. Ideally, you should create a branch for your lexer. Once it is available on bitbucket, you can issue a pull request.

To get setup with a bitbucket account and get familiar with mercurial, you can use atlassian's tutorial.

After working with Gnucash for a while, I decided that I wanted something different. I don't want a GUI to get in my way and I want to keep the file in an easy to view format. After looking through many different open source accounting packages, I decided on ledger.

LWN has an article about their shift away from QuickBooks for various reasons. One of the comments in that article mentioned that the Software Freedom Law Center uses ledger or hledger for its accounting. Here's the post describing SFLC's usage of ledger.

I was looking to track personal finances with a command line tool and ledger fits the bill. It's a double-entry accounting package that uses a plain text file which I track in git. It allows me to easily use the accrual method with minimal fuss.

I currently use Wiegley's C++ version, ledger. I would like to use the Haskell inspired version, hledger, but it is missing some of the features from the C++ version such as budgeting and forecasting.

I have had a second hand Dell computer that I was using as my router/firewall for years. It's a big, loud beige box running a PII 450. It's much louder than my other more capable machines. I got sick of it and picked up a Soekris net5501 about six months ago. I'm relatively happy with the Soekris device. I could have purchased cheaper alternatives like ALIX, but I knew I wanted to run OpenBSD and the Soekris devices seem to be favored by the OpenBSD community.

I bought a net5501-60 which has a 433MHZ CPU, 256MB of RAM and four 10/100 Ethernet ports. There's a newer model, net6501, which has gigabit Ethernet but it's a lot more expensive. One nice feature on the net6501 is the ability to boot off of a USB device. The installation on a net5501 is a bit of a pain since you have to use PXE booting over TFTP. Since you only need it for installation it's not that bad. Also, you only need it for the first installation with OpenBSD. After that, you can use bsd.rd to either upgrade or reinstall instead of using PXE/TFTP. I upgraded from OpenBSD 4.9 to 5.0 this way.

The console is curiously slow since they picked a rate of 19200. You can feel how slow the console is when you use it. I didn't change this since I interact with the device so seldom through the serial port. After you install you can use ssh to administer everything.

I have to connect to the net5501 from a different computer since it only has a serial port. My laptop doesn't have a serial port so I had to buy a USB to serial cable. There's a large distance between the laptop and firewall so I'm using a null modem extension cable. I have used this setup to connect from Debian (minicom) and FreeBSD (cu, tip, minicom) without any problems. As long as you use 19200, it should work.

The Soekris box is nice but overpriced. If I was to do it over again, I might buy something cheaper and spend some time getting OpenBSD to work on it. However, I have a lot of projects I'm working on and I didn't particularly want to pick up this one. I will be very excited once the Raspberry Pi devices come out. If they made a RPi model B with 3-4 Ethernet ports, I wouldn't have a need for this Soekris machine. It would be a fraction of the cost.

I'm using an 8GB sandisk CF card for storage. I was using an old 160 GB SATA laptop hard drive but that failed at some point. The net5501 gets really hot and it lacks active cooling so I think that may have contributed to the problem. My laptop has a 500GB hard drive so I didn't really care if the 160GB failed. I have been using the CF card alone ever since.

I'm now running OpenBSD 5.0 on the device. I have been lucky with the 4.9 release. This is the first time there hasn't been any errata for a release. I haven't had to rebuild the system to patch a security or reliability fix.