Monday, May 05, 2014

Recovering from ElasticSearch Recoveries

We recently had a problem with ElasticSearch's snapshots where a shard (a directory) was failing because it was missing the metadata file and data files.

This leads to a couple of criticisms of the snapshot directory format.  Primarily, it takes files with reasonable extensions, generally Lucene files, and creates files like "__1" and then records a mapping from "__1" to "_def.fdt".  For example:

{
  "name" : "es-trk_allindices_2014-01-01_0000est",
  "index-version" : 78683,
  "files" : [ {
    "name" : "__0",
    "physical_name" : "_abc_0.pay",
    "length" : 2012,
    "checksum" : "13m617n",
    "part_size" : 104857600
    }, {
    "name" : "__1",
    "physical_name" : "_def.fdt",
    "length" : 97744833,
    "checksum" : "239wze",
    "part_size" : 104857600
    }
...

The files aren't event located together in the metadata file.  In Lucene, you have a group of files prefixed with say "_def" like fdt, fdx, tip, tim, del, nvm, and nvd in a single directory.  Losing the metadata file means not only losing the helpful filenames but also their groupings used by Lucene.

Luckily, ElasticSearch uses FDT files which have just enough information - the unique index identifier and then the payload - to turn them into a CSV or other file to be able to be reimport the data into ElasticSearch.  If you have the same problem you will have to force shard allocation or create an empty shard in a new cluster, delete the failed shard and copy that shard into the failed one.

The utility, es_fdr, reads FDT files and outputs them one field per line, it's available on the OtherLevels Github page.  I've also updated a related Lucene ticket.

No comments: