How Elasticsearch Snapshots Work
Elasticsearch is a powerful and dynamic distributed data system, and such things can be hard to backup, especially as they scale into the terabytes and beyond. Fortunately, the folks at Elastic have a spiffy system that may be the best large-scale backup system in existence. It’s even simple to use.
Snapshots are the back-up method, and this blog covers how they work, especially how they do incremental backups in a novel-but-reliable way, making it easy to manage very large-scale data. We cover a bit of the tech details, and like always, there’s more in the code itself.
Backups are darned important, so make sure they work, work well, and work right. Trust AND verify.
There is also a good, though older and more technical file-level discussion of snapshots in this Elastic Blog.
Elasticsearch’s goals are, of course, to get reliable backups, and by extension, reliable recoveries, for data stores of any size, while they are rapidly ingesting data under load. Plus, make it easy to setup, configure, and manage, as any complicated backup system is prone to failure.
Another goal is supporting optimal RPOs and RTOs. The RPO, or Recovery Point Objective, is about how much data can be lost, i.e. up to what point can we restore data, and closer to NOW (the crash/loss point) is better.
The RTO, or Recovery Time Objective, is how long it takes to do the recovery, e.g. an hour, a day or whatever, and faster is better, especially when you are loading terabytes of important data from remote storage.
Typically, a good RPO requires frequent incremental backups, at least daily, often hourly, and sometimes more frequently than that. Fortunately, Elasticsearch supports this well, though as we’ll see, we need to prune the backups lest the snapshot count get out of hand and slow things down.
Good RTO is all about speed, and often bandwidth limitations over long-distance links, such as from AWS S3, or even over local NFS links. Running in parallel can greatly speed this up, but both physics and the dynamics of TCP/IP, IO subsystems, etc., are often bottlenecks. How Elasticsearch does this is a bit unclear, but presumably it’s mostly in parallel on a per-shard basis, about as fast as it can be.
Before we dive into how Elasticsearch does all this, we should review some important concepts and terminology. You probably know what a cluster is, i.e. each Elasticsearch system’s big unit, usually built from many nodes running instances of Elasticsearch. Snapshots are a cluster-wide process, managed at the cluster level, etc. but executed in parallel by the nodes. You can have only one active snapshot operation at a time per cluster.
Clusters, of course, have indexes, and each index is made of shards, usually both primary and replicas spread over the nodes in the cluster. Very importantly, each shard is made of segments, the lowest level storage object in Elasticsearch, and these are what the system is actually snapshotting and pushing to your backup repository. They are nearly invisible, but vitally important.
Segments are created during indexing or ingest of new data on a rolling basis, usually as an index ‘refreshes’ which is pretty often, usually every 1–60 seconds. Elasticsearch transparently merges these segments behind the scenes to keep things sane. Segments are immutable, i.e. they never change after they are written, which is critical to how Elasticsearch manages incremental snapshots, since once a segment is written to backup, it never has to be updated or written again. This is a GREAT thing, and key to how it all works.
Setting up Snapshots
Elasticsearch Snapshots are easy to setup, basically just adding a repository which is just a writable file or blob system (even on AWS S3, etc.) Each repository is a remote area used for MULTIPLE overlapping snapshots for a cluster. Thus, keeping track of all the moving parts is a critical part of the process, especially as the system grows into many TB of data and snapshots span months or years.
This setup is mostly done via the Register Process, which tells the cluster where and how to store things, plus provides access credentials (via the keys store). For shared file and some other repositories, there are also some settings in each nodes’ yml files, too, and all master and data nodes need to be able to write to the snapshot repository. Nodes test their repository access when you register, which is nice, as it makes sure it will work when it comes time to snapshot.
Very importantly, note the
40MB/sec per node default read/write rate - this may or may not be a lot for you, but be aware it's there in case you are on faster or slower networks (that's about 400Mbps of bandwidth, per node). You may need to adjust this, especially if you are in a datacenter with lots of nodes pushing to S3, for example; it's easy to eat up precious and expensive primary bandwidth.
Also, note several clusters can share a single repository, but only ONE can be allowed to write (all others must be set read-only) or very bad things can happen. It’s best you don’t do this except for restores, or perhaps for dev/test clusters that continually restore from production. Be careful.
Snapshots basically copy each indices’ shard’s segments to the remote storage repository, keeping track of which index, shard, and segment is part of which set of snapshots. Snapshots can include the whole cluster, i.e. all indexes and cluster metadata, or just some indexes.
Snapshots are instantaneous snaps of the state when the snapshot STARTS, so any data indexed after the snapshot start will NOT be included (get it next time). This is because the first step on each node is building the segment list, and by definition, any ‘new’ data after that point will be in new segments that aren’t included in the list.
This is even true for long-running snapshots that may take hours, in which case optimum RPO might mean starting a new snapshot almost immediately after a previous one completes, because the cluster may contain many hours of ‘new’ un-snapshotted data. Since a new snapshot can’t be started until the last one finishes, this may take some tuning or monitoring on your part; fortunately you can loop & monitor snapshot status.
When you start a snapshot, the Master node builds a list of primary shards for each index and the nodes they are on, so it’s clear which node is responsible for snapping which shards of which indices. Only primary shards are snapshotted, so if they are missing, the backup tracks that (if Initializing, the backup will wait for them to finish). Any primary shard that moves after the snapshot starts will fail in the backup and not be backed up; ideally you’ll get it next time. Replicas are ignored.
Once the master builds a list of indexes, primary shards, and nodes, each node figures out which index shard segments it has to write to the snapshot repository. Since segments are immutable, any of them already in the repo don’t need to be written again.
So the node builds a list of each shard’s segments on the node and also reads the list of this shard’s segments already in the snapshot repository. By comparing these two lists, it know which segments have to be written to the snapshot repository. That’s an elegant incremental backup process.
Each node then starts copying its segment files to the snapshot repository, updating the master with the status as it completes each shard. This is done by the SnapshotShardsService and each shard can be Success, Failed, or Missing.
Once all the shards on all the nodes have completed (or failed), the master will write the final metadata for the snapshot to the repository, including what indexes and segments are included in which snapshots. This is critical to correctly update and delete things later.
Note the nodes are writing to the snapshot repository simultaneously, as each node executes its snapshot related tasks and reports them to the master. Each node writes into directories for the shards they are responsible for, and only the master writes cluster-level snapshot metadata. This UUID-based path separation keeps everything organized.
Snapshots run on all nodes with primary shards for the indexes being snapshotted. Further, there seems to be some parallelism to use multiple threads, CPUs, and write processes, though each shard is processed by a single thread. The amount of parallelism is not clear, though each node has a scalable snapshot thread pool for this.
The docs indicate only a single snapshot can run in a cluster at any one time (though the code comments imply multiple snapshots can be in progress at once). So for long-running snapshots and deletes, you need to schedule and/or monitor them for completion and status. You should also make sure your snapshots actually complete successfully and investigate failures.
Running snapshots can be aborted, which ends the backup process, but still finalizes the metadata, and the already backed-up is NOT deleted. This protects the cluster’s data as best it can. You can also delete running snapshots which both aborts and deletes them.
Fortunately, there are sophisticated protections for master failover during and around snapshots so they don’t overwrite good backups with bad or corrupt data. This is part of the snapshot generational system which you’ll see if you look at (but do not touch) the actual data files in the repository.
You can change some index settings at restore time, which is nice, and useful if you are, say, restoring to a Dev/Test cluster and don’t want replicas. The master node simply uses these parameters when it recreates the restored index, though only some settings are eligible for this.
Remember you can add metadata to the snapshot, such as who took the snapshot, why, and any other useful info you may need later. Always useful.
Source Only Snapshots
Elasticsearch supports Source Only snapshots (actually source only repositories), which, as the name implies, only backup the source field (and _routing) of each document.
This can save a lot of space, at the expense of needing to reindex each index after you restore it. This can be useful, especially if you might also use it as an ETL method to push data into some data warehouse that will reindex or aggregate the data anyway.
It’s not clear how this source-only snapshot is done, in part because the relevant source code is almost completely undocumented. It’s an X-Pack feature and is actually an ‘intermediate’ virtual repository, which needs a real remote repository as its final storage place.
Presumably, the nodes write their segments to this virtual x-pack repository, which seems to query the underlying indexes and loop on their non-deleted docs, maybe to build a temp index and/or also accessing the real index segments, and then writes source only files or segments to the real repository. This would seem a more complex and slower method, but tests would tell.
As noted above, ALL Elasticsearch snapshots are incremental. Basically the cluster (actually each node) looks at what segments it has to snapshot vs. what segments are already in the repository, and writes the missing ones. Plus a bunch of references, states, and other metadata. That’s it.
It’s a very elegant solution that critically depends on immutable segments as the core writable object, and which are carefully tracked across many TB of data, hundreds of snapshots, and many, many thousands of files.
Deletes are run entirely by the master node, and works by first reading the repository to build a list of indices, shards, and segments in the to-be-deleted snapshot. The master then re-reads all the other snapshots’ metadata and importantly, the indexes and segments they contain. It compares the list from the to-be-deleted-snapshot to the used-by-other-snapshots list.
The ones not referenced by any remaining snapshot are marked for deletion, basically deleting segments no longer needed while keeping all those that are. It also deletes any indexes in the repo that are not referenced by the remaining snapshots. This elegant solution is very flexible and robust.
The challenge with this process is that it can be slow on remote repos like S3 and with lots of snapshots, as it has to read and build maps for every one of them to know what to delete — for thousands and indexes and hundreds or thousands of snapshots, this can take a long time (e.g. many hours).
The best way to avoid such long processing times is to continually purge both old ‘expired’ snapshots (such as older than 90 days) AND any unneeded intermediates, e.g. if you snap each hour, after a week, you can remove 23 of those in a day to cut down on the snapshot size. The goal is to keep the snapshot count under control.
Historically, you could only delete one snapshot at a time, and given it might take many minutes to hours each time; this made deleting dozens or hundreds very slow and tedious (and interferes with new snapshots since you can’t run a delete and create simultaneously).
Fortunately, in version 7.8.0, it became possible to delete multiple snapshots at once with wildcards and some performance improvements were made. You should still aggressively prune your snapshots, but at least you can remove whole blocks at once, such as a full day’s hourly snapshots.
Recoveries are straightforward, though less information is available on how they work (the core code is here and could use more comments). Basically, the master reads the snapshot metadata from the repository, builds a list of indexes, shards, and segments to restore, and makes that happen. The master also restores the cluster metadata if needed.
It’s not clear where the work is done, though the master does the coordination and certainly creates each restored index. However, it’s not clear if the master or routed data node for each shard does the loading — this is no obvious parallelism in the code, so it’s not clear if the shards load in parallel as the core index loop waits for each index and shards’ completion. Presumably it’s done via global state and the thread pools / queues on each node that will hold a restored primary shard.
Assuming that’s true, once the restored index is created by the master with its settings, and the shard routing determined so it knows what node will hold what shard, the shard restore is queued on the relevant nodes.
Then each node restores the shards in much the same way they are replicated from other shards in a cluster, the only apparent difference being their source, i.e. they are restoring segments from the repository instead of a ‘peer’ shard in the cluster.
The snapshot data in the repository has a friendly easy-to-understand format (See reference), the key parts being the following, HOWEVER, you should not touch these files. And there is no way to prune or clean up this storage outside of Elasticsearch as it’s all self-referential.
- index-N — List of all Snapshot IDs and the Indexes they contain. N is the file generation number. This file maps snapshots to indexes.
- index.latest — Contains a number of the latest Index-N file (used in repos that don’t allow list).
- incompatible-snapshots — List of no-longer compatible snapshots for this cluster version.
- snap-YYYYMMDD.dat — Snapshot metadata for this snapshot (not always date format name)
- meta-YYYYMMDD.dat — Global metadata if included in this snapshot (not always date format name)
- indices/ — Directory with all the index data, by shard
- 0McTFz3XRFSSMUIotGKKog — Repo-assigned Index Directory, one per Index
- meta-YYYYMMDD.dat — Metadata for this index, for a particular snapshot
- 0/ — Directory for shard 0 — this can have thousands of segment subdirectories
- snap-YYMMDD.dat — Maps segment files to the names in repo
- __VPO5oDMVT5y4Akv8T_AO_A — Segment files, see snap-* to map to actual segment files
That’s it. Elasticsearch has very powerful easy-to-mange backup and recovery systems that support a variety of target repositories. You should be using this system, now and forever.
From the ELKman.io blog — ELKman is the only commercial Elastic® Stack management tool. Free Trials available.