Tuesday, April 17, 2012

The Curious Case of the Chessbase MegaDatabase

The German company ChessBase maintains a database of chess games. Once a year, they publish their up-to-date database on DVD; it comes in two versions.

The Big Database contains about 5.1 million chess games and costs about 60 Euros (incl. VAT). The Mega Database contains the same games, but now about 66 thousand of them are annotated. There are also other bonuses like player profiles and online updates. These extras come at quite a cost: the Mega Database costs about 2,5x as much as the Big Database.

Whichever version you choose, the database comes in a proprietary ChessBase format. However, you can use either ChessBase Light (as included on the DVD), the full version of ChessBase, or a copy of the Fritz chess program to export the games to PGN format, which is the Lingua Franca of chess-playing computer programs.

Exporting the complete set of games results in a PGN file of a hefty 4.3 gigabytes. In fact, it is necessary to export the data in parts and concatenate them afterwards, as ChessBase cannot export files are larger than 2 gigabytes, even on an NTFS filesystem that easily handles much bigger files.

Analysing a database of over five million chess games is a tantalising prospect, especially for someone like me whose daytime job often involves analysis of large volumes of data. As a first step, I set out to ingest all these games using my home-grown chess library. This turned out to be more work than I had hoped.

For starters, it is not quite trivial to properly parse fully correct PGN; I will write about that at a later time. As it turns out, however, the PGN that is exported by ChessBase tools is not completely standards-compliant, which makes the task even more difficult. Some of the problems I encountered:
  • ChessBase does not properly escape quoted strings.
  • In proper PGN, curly brackets ('{' and '}') are used to delimit comments. In ChessBase-exported PGN, the curly closing bracket sometimes occurs within comments, which deeply confuses the parser.
  • The ChessBase PGN exporter performs move disambiguation, but not according to the rules specified by the PGN standard. Move disambiguation is the process where one specifies the originating rank or file of a piece in addition to its destination, because more than one instance of the piece type can move to the specified target square.
  • Some ChessBase games contain null moves, denoted as '--', where the player having the move 'passes'. This happens, for example, in certain old games where a strong player gives the advantage to a weaker player, who may make two moves at the start of the game, e.g. "1. e4 -- 2. d4". Null moves are also sometimes used to correct positions where actual illegal moves are made, e.g. a player that castles for the second time.
  • The MegaDatabase contains a handful of games of Chess960, formerly known as "Fisher Random Chess". This is troublesome because castling in these games works differently than in a regular chess game.
The first two points are simple errors, and I e-mailed to ChessBase with a detailed list of about ten games where manual editing of the game's notation is needed to correct the problem. (Unfortunately, I received no reply to that.) The third point is a ChessBase programming bug, but it is easy to work around.

I actually like the fact that ChessBase supports null moves and games where the starting position is not the default chess starting position - the latter is, in fact, a supported feature of the PGN standard. These two features make it possible to include some interesting games of historic interest, by such illustrious chess giants as Philidor, Morphy, Steinitz, and Lasker.

I am not sure what to think about the inclusion of Chess960 games. I wonder if these games were included on purpose; they are, strictly speaking, not chess games, and there are only eight of them in the database.

Some statistics

The MegaDatabase 2012 contains a grand total of 5155359 items. 701 of these items are classified as "Text"; these are mostly short descriptions of certain tournaments. Excluding these 701 items, 5154658 items remain. These are precisely the 5154658 games that are written when exporting the entire database to PGN.

Eight of these are the aforementioned Chess960 games which my chess software cannot process, so I leave them out. This leaves 5154650 chess games that I can read and reproduce using my chess software.

However, to do that, I must take special precautions to handle the 1052 positions that have a non-standard starting position, the 337 games that contain null moves, and the 30 games that actually have both a non-default start position and one or more null moves. I will omit these 1419 non-standard games from further discussion, leaving 5153231 regular chess-games to consider.

It is interesting to note that out of these 5153231 regular chess games, only 5066888 are actually unique; there are 7665 unique chess-games that have been played more than one time. For example, the unique game that has no moves at all was "played" 58890 times, and the 20 possible games where only White performs a single move together account for 8540 database games. In most of these games, at least one of the players simply did not show up, and they were included in the database only to give a full tournament record.

The most often occurring non-trivial game is this:

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 d6 8. c3
O-O 9. h3 Bb7 10. d4 Re8 11. Ng5 Rf8 12. Nf3 Re8 13. Ng5 1/2-1/2


Apparently, the resulting position is so yawn-inciting that 190 games end then and there - in a draw.

I won't bore you with the most often-played game that leads to a mate; it is a 9-move variation of the Scholar's Mate theme, that occurs 52 times. The most-often occurring game ending in stalemate is rather more interesting:

1. e3 a5 2. Qh5 Ra6 3. Qxa5 h5 4. h4 Rah6 5. Qxc7 f6 6. Qxd7+ Kf7 7. Qxb7 Qd3
8. Qxb8 Qh7 9. Qxc8 Kg6 10. Qe6 1/2-1/2


It ends in the earliest known possible stalemate position:

Black to move - but he cannot!

This peculiar game was first described by the American puzzle and chess composer Sam Loyd as the 'earliest possible stalemate'. The players of these games apparently agreed to play that very game before starting; the MegaDatabase has this game no less than 5 times.

Out of the 5153231 regular games, 221806 end in checkmate (4.3%), and a mere 3861 end in a stalemate; a surprisingly low number - at least to me.

Most games by far end in resignation (3405949, 66.1%) or a draw that is not a stalemate (1517378, 29.4%). Note that the latter category includes both agreed draws and draws due to e.g. insufficient material on the board (more formally, Article 9.6 of the Laws of Chess).

Of the 5153231 games, 2005196 are won by White (39.9%), and 1622519 are won by Black (31.5%); the result is a draw in 1521273 cases (29.5%). 4243 games have an indeterminate result. It is clear from these numbers that playing White is indeed a rather significant advantage!

Finally, we note that the average number of moves in a single game is about 37.6, where a single 'move' entails both White's and Black's turn. Adding up all the positions in all the regular chess games, we arrive at a grand total of 395,362,767 positions in the MegaDatabase! Note that these are not necessarily distinct; for example, the default chess starting position alone accounts for almost 3% of that number, because it occurs in all games.

In conclusion

If you made it this far: congratulations! This has been a rather long and boring essay, detailing some of the MegaDatabase statistics obtained while performing some basic analysis. I promise that the next post about the database will be more exciting, and will have graphs.

As in most big datasets, there are some issues regarding data quality. For example, the results as given with the games do not always correspond to the final board state when replaying the moves, which is suspicious! There are 33 games listed as "0-1" where the final position shows that White actually checkmated Black; and 27 games where it is the other way around. In a similar vein, there are 35 games listed as a draw that actually appear to be won for White (6) or Black (29) when looking at the final board positions.

These are probably just administrative or data-entry mistakes, and to have less that 100 such mistakes in a 5 million game database is perhaps to be expected. However, we can only detect these issues in games that actually ended in checkmate or stalemate, and there are only 225667 of those; this suggests that the error rate of the reported result could be as high as 0.05%. Still, that is quite good.

Things like these remind us that the database is not perfect. However, I feel that the ChessBase database is a tremendous asset for learning about chess - both by examining the many interesting games it contains, as well as by looking at chess 'in the large', i.e., gathering statistics from a large number of human-vs-human games. In some of the upcoming blog posts, I will do just that.

7 comments:

  1. I think you made a small typo when talking about games that occur more than once. It can not be correct to have '7665 games that occur more than once' when this is due to 'games that have no moves at all (58890)'

    ReplyDelete
  2. I see why this is confusing. There are actually 7665 unique chess games that have multiple occurrences in the database; one of them is the "empty" chess game that occurs 58890 times.

    I'll try to reword, thanks for pointing it out.

    ReplyDelete
  3. Hi Sidney!
    Very good article. I wonder if you have any .si4 database(the scid database format) ready with the optimization you made.

    ReplyDelete
  4. Your "idea" to convert the Chessbase proprietary format .cbv into .pgn is the equivalent of trying to turn a Mercedes into a Volkswagon, and paying money to do it. It doesn't make a shred of sense at all. You're going on and on about all the problems and issues involved in performing the conversion, yet you never mention a possible reason why anyone would want to do it. .cbv indexes faster, takes up less space, and has many other advantages over .pgn.

    ReplyDelete
    Replies
    1. Hi Anonymous,

      In contrast to Chessbase's proprietary format, PGN is an open standard. This is important, since I wanted to ingest the data into my own chess engine.

      I noted the problems to demonstrate that ChassBase's built-in export-to-PGN functionality is (a bit) buggy.

      > yet you never mention a possible reason why anyone would want to do it.

      Hmmm. To write a blog post about the data contained in the database, perhaps?

      Delete
    2. I would want to convert .CBH to .PGN so that I don't have to use/buy ChessBase and use a different program (or to calculate interesting statistics like the author).

      I never really liked CB, way too buggy, and they still can't figure out how to design user interfaces. There are places where CB doesn't calculate statistics properly. Also, the search feature of CB leaves much to be desired.

      Delete
    3. You may be interested in the free site I've recently found: http://www.chessbites.com - it is a searchable database site and many other goodies for all levels of players.

      Delete