Tuesday, April 17, 2012

The Curious Case of the Chessbase MegaDatabase

The German company ChessBase maintains a database of chess games. Once a year, they publish their up-to-date database on DVD; it comes in two versions.

The Big Database contains about 5.1 million chess games and costs about 60 Euros (incl. VAT). The Mega Database contains the same games, but now about 66 thousand of them are annotated. There are also other bonuses like player profiles and online updates. These extras come at quite a cost: the Mega Database costs about 2,5x as much as the Big Database.

Whichever version you choose, the database comes in a proprietary ChessBase format. However, you can use either ChessBase Light (as included on the DVD), the full version of ChessBase, or a copy of the Fritz chess program to export the games to PGN format, which is the Lingua Franca of chess-playing computer programs.

Exporting the complete set of games results in a PGN file of a hefty 4.3 gigabytes. In fact, it is necessary to export the data in parts and concatenate them afterwards, as ChessBase cannot export files are larger than 2 gigabytes, even on an NTFS filesystem that easily handles much bigger files.

Analysing a database of over five million chess games is a tantalising prospect, especially for someone like me whose daytime job often involves analysis of large volumes of data. As a first step, I set out to ingest all these games using my home-grown chess library. This turned out to be more work than I had hoped.

For starters, it is not quite trivial to properly parse fully correct PGN; I will write about that at a later time. As it turns out, however, the PGN that is exported by ChessBase tools is not completely standards-compliant, which makes the task even more difficult. Some of the problems I encountered:
  • ChessBase does not properly escape quoted strings.
  • In proper PGN, curly brackets ('{' and '}') are used to delimit comments. In ChessBase-exported PGN, the curly closing bracket sometimes occurs within comments, which deeply confuses the parser.
  • The ChessBase PGN exporter performs move disambiguation, but not according to the rules specified by the PGN standard. Move disambiguation is the process where one specifies the originating rank or file of a piece in addition to its destination, because more than one instance of the piece type can move to the specified target square.
  • Some ChessBase games contain null moves, denoted as '--', where the player having the move 'passes'. This happens, for example, in certain old games where a strong player gives the advantage to a weaker player, who may make two moves at the start of the game, e.g. "1. e4 -- 2. d4". Null moves are also sometimes used to correct positions where actual illegal moves are made, e.g. a player that castles for the second time.
  • The MegaDatabase contains a handful of games of Chess960, formerly known as "Fisher Random Chess". This is troublesome because castling in these games works differently than in a regular chess game.
The first two points are simple errors, and I e-mailed to ChessBase with a detailed list of about ten games where manual editing of the game's notation is needed to correct the problem. (Unfortunately, I received no reply to that.) The third point is a ChessBase programming bug, but it is easy to work around.

I actually like the fact that ChessBase supports null moves and games where the starting position is not the default chess starting position - the latter is, in fact, a supported feature of the PGN standard. These two features make it possible to include some interesting games of historic interest, by such illustrious chess giants as Philidor, Morphy, Steinitz, and Lasker.

I am not sure what to think about the inclusion of Chess960 games. I wonder if these games were included on purpose; they are, strictly speaking, not chess games, and there are only eight of them in the database.

Some statistics

The MegaDatabase 2012 contains a grand total of 5155359 items. 701 of these items are classified as "Text"; these are mostly short descriptions of certain tournaments. Excluding these 701 items, 5154658 items remain. These are precisely the 5154658 games that are written when exporting the entire database to PGN.

Eight of these are the aforementioned Chess960 games which my chess software cannot process, so I leave them out. This leaves 5154650 chess games that I can read and reproduce using my chess software.

However, to do that, I must take special precautions to handle the 1052 positions that have a non-standard starting position, the 337 games that contain null moves, and the 30 games that actually have both a non-default start position and one or more null moves. I will omit these 1419 non-standard games from further discussion, leaving 5153231 regular chess-games to consider.

It is interesting to note that out of these 5153231 regular chess games, only 5066888 are actually unique; there are 7665 unique chess-games that have been played more than one time. For example, the unique game that has no moves at all was "played" 58890 times, and the 20 possible games where only White performs a single move together account for 8540 database games. In most of these games, at least one of the players simply did not show up, and they were included in the database only to give a full tournament record.

The most often occurring non-trivial game is this:

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 d6 8. c3
O-O 9. h3 Bb7 10. d4 Re8 11. Ng5 Rf8 12. Nf3 Re8 13. Ng5 1/2-1/2


Apparently, the resulting position is so yawn-inciting that 190 games end then and there - in a draw.

I won't bore you with the most often-played game that leads to a mate; it is a 9-move variation of the Scholar's Mate theme, that occurs 52 times. The most-often occurring game ending in stalemate is rather more interesting:

1. e3 a5 2. Qh5 Ra6 3. Qxa5 h5 4. h4 Rah6 5. Qxc7 f6 6. Qxd7+ Kf7 7. Qxb7 Qd3
8. Qxb8 Qh7 9. Qxc8 Kg6 10. Qe6 1/2-1/2


It ends in the earliest known possible stalemate position:

Black to move - but he cannot!

This peculiar game was first described by the American puzzle and chess composer Sam Loyd as the 'earliest possible stalemate'. The players of these games apparently agreed to play that very game before starting; the MegaDatabase has this game no less than 5 times.

Out of the 5153231 regular games, 221806 end in checkmate (4.3%), and a mere 3861 end in a stalemate; a surprisingly low number - at least to me.

Most games by far end in resignation (3405949, 66.1%) or a draw that is not a stalemate (1517378, 29.4%). Note that the latter category includes both agreed draws and draws due to e.g. insufficient material on the board (more formally, Article 9.6 of the Laws of Chess).

Of the 5153231 games, 2005196 are won by White (39.9%), and 1622519 are won by Black (31.5%); the result is a draw in 1521273 cases (29.5%). 4243 games have an indeterminate result. It is clear from these numbers that playing White is indeed a rather significant advantage!

Finally, we note that the average number of moves in a single game is about 37.6, where a single 'move' entails both White's and Black's turn. Adding up all the positions in all the regular chess games, we arrive at a grand total of 395,362,767 positions in the MegaDatabase! Note that these are not necessarily distinct; for example, the default chess starting position alone accounts for almost 3% of that number, because it occurs in all games.

In conclusion

If you made it this far: congratulations! This has been a rather long and boring essay, detailing some of the MegaDatabase statistics obtained while performing some basic analysis. I promise that the next post about the database will be more exciting, and will have graphs.

As in most big datasets, there are some issues regarding data quality. For example, the results as given with the games do not always correspond to the final board state when replaying the moves, which is suspicious! There are 33 games listed as "0-1" where the final position shows that White actually checkmated Black; and 27 games where it is the other way around. In a similar vein, there are 35 games listed as a draw that actually appear to be won for White (6) or Black (29) when looking at the final board positions.

These are probably just administrative or data-entry mistakes, and to have less that 100 such mistakes in a 5 million game database is perhaps to be expected. However, we can only detect these issues in games that actually ended in checkmate or stalemate, and there are only 225667 of those; this suggests that the error rate of the reported result could be as high as 0.05%. Still, that is quite good.

Things like these remind us that the database is not perfect. However, I feel that the ChessBase database is a tremendous asset for learning about chess - both by examining the many interesting games it contains, as well as by looking at chess 'in the large', i.e., gathering statistics from a large number of human-vs-human games. In some of the upcoming blog posts, I will do just that.

Monday, April 16, 2012

Chess is an infinite game

Chess is an infinite game. To be precise: there is no bound on the number of moves that a chess game can have.

Surprisingly, many chess players are not aware of that fact. This is mostly due to an incomplete understanding of the two rules in chess that are devised to prevent infinite games: the threefold repetition rule and the 50-move rule. They are covered by Articles 9.2 and 9.3 of the FIDE Laws of Chess.

The thing that many chess players do not realize is that both these rules specify the need for a claim by one of the players. If both players decide not to claim, the game goes on.

In cases where the 50-move rule is relevant, the chess position is often asymmetric; one player will be trying to mate while the other player will try to avoid mate, and meticulously keeps count of the moves by his opponent. This situation typically occurs in non-trivial pawnless endgames, such as king+bishop+knight vs. king, or king+queen vs. king+rook. These are classic endgames that are winnable in less than 50 moves; but still, many club-level players do not known how to pull it off.

In case of the threefold repetition rule, both players tend to realise full well that they are repeating positions when it happens, and by repeating they will usually indicate their willingness to accept a draw. However, I maintain that making the draw claim in such a case is the wrong thing to do, from a theoretical, fully rational viewpoint - at least, if time is no constraint. Consider, for example, the position below:


White and Black's Kings are both imprisoned; the only thing they can do is helplessly alternate between the a- and b-files.

Both players do, however, have a second option: they can advance their respective pawns on the right of the board. Of course, that would be very unwise: if either Black or White advanced their right-side pawn, it would immediately be captured by the opponent's pawn, and that pawn now has a free walk towards promotion and winning the game.

So, what to do? After moving around with the king for a few moves, should you just claim the draw? If you value your time, or if you are running out of time, then yes - you should. But if there is even the slightest chance that your opponent will be stupid enough to advance his pawn, don't! You can still claim a draw at any point in the future if both players make their optimal move, but you may actually win if your opponent blacks out and makes a fatal mistake.

Of course, this should be weighed against the risk that you will make a fatal mistake. This is not entirely out of the question; if you have alternated your king five thousand times between a1 and b1, and then your opponent deviously and suddenly advances his pawn, it will take quite a bit of vigilance to not touch the king out of habit.

One slightly more realistic scenario where this could happen is when two chess computers meet; they don't tire, after all. It would be quite epic to see two computers reaching a position that is completely repetitive, and where both may claim the draw - but then they don't, instead opting to play on at a million moves a second, until one of them has only milliseconds left on the clock.

Introducing the Chess Musings blog

This blog will discuss the game of chess. In particular, I will focus on the rules of chess and their sometimes strange consequences, computer chess, weird chess positions, and exploring the statistics that can be extracted from large chess databases, as well as the chess gems hidden therein.

A large inspiration for me is Tim Krabbé's Chess Curiosities site - I love his articles and I recommend them to any chess enthusiast. Mr. Krabbé is a much more able chess player than I am, but still I hope to bring to light some new things that he may have missed.

I am a software engineer by heart and by profession, and I wrote my first piece of chess software about 10 years ago. It started out as a humble move generator, i.e., a subroutine that enumerates all legal chess moves given a certain position. It was surprisingly hard to get this 100% right, and the process of tracking down bugs got me thinking about the chess rules, and the subtle ways you can go wrong while implementing them. Some of the lessons learned I will share with you on these pages.

On top of the move generator core I built some actual programs, such as a chess problem solver, and a program to count all possible games that can be played from a given position up to a certain depth. I also implemented a chess-playing program, but I never brought it to a level where it posed a serious challenge even for me. It would usually protect its queen against capture, and find mate-in-2 when the opportunity was there, but that was about it.

At some point I stopped playing around with computer chess and moved on to other things. But recently I had a bit of free time, so I decided to revive my little chess project.

I originally wrote my chess code in the programming language C, and although at the time I thought I was a pretty decent programmer, it turned out that there was quite a bit of room for improvement. Furthermore, I have since come to appreciate the limitations of C, so I decided to convert all of it to C++.

One of the things I originally did was to write a program to read databases of chess games given in Portable Game Notation (PGN). I re-implemented that program as well, and I set it loose on the well-known MegaDatabase that is published by German chess-software company ChessBase. This database contains over five million games of chess, most of which are games between very decent players. Analysing this large body of chess data yields some very surprising insights; more on that later.

So much for the introduction. I hope I piqued your interest, and I will present my first article with actual chess content later today.