My Elasticsearch is Index Red / Yellow — Why?

Troubleshooting & Fixing Index Status

Elasticsearch is a great & powerful system, especially creating an extremely scalable distributed data store, and automatically track, managing, and routing all the data in your indexes.

But sometimes things go wrong, and indexes get into trouble, big and small. That usually ends up in them having a status, red or yellow. And the cluster will follow, as its status is simply the worst of any index, e.g. if one index is red, the cluster is red.

Then what do you do, if your cluster and some indexes are red or yellow? Well, you need to find out why. How do you do that? Well, you can use our ELKman tool, of course, which was built for this, but you can also do it yourself in Kibana’s Dev Tools tab, or using cURL.

What Does Red or Yellow Mean?

First, a word on what the colors mean, as they can seem complex, but in the end are simple:

  • Yellow — One or more indexes has missing (“unallocated”) replica shards. The index is still working and can fully index, search, and serve data, just not as fast nor reliably as we’d like.
  • The missing shards may be truly missing, damaged, or have other problems; or the cluster may just be in the middle of moving or rebuilding these missing shards.
  • Our job is to manually or automatically recreate these missing replicas to get to green.
  • Red — One or more indexes has missing primary shards and is not functional, i.e. it cannot index, search, or serve data.
  • Note this is on per-shard basis, so even with 50 shards it only takes one to be dead to turn the index and the cluster red.
  • Our job is to manually find or fix these missing primaries, if we can, else the index is lost and must be recreated from snapshots or original source data.

Finding Red & Yellow Indexes

1) The first step is to identify major issues you know about, such as a dead node, disk space issues, etc. that are likely to create problems. This helps inform what we look for and how we fix it later.

Sometimes you just need to be patient, as often the system will fix itself by moving data around, such as promoting replicas to primaries and then recreating new replicas, but this takes time, from minutes to much longer, depending on shard count and sizes, cluster loads, disk speeds, etc.

But you can’t count on this unless it’s clear the system is fixing itself. Sometimes things really are broken, which is why it’s good to know the history, since rebooting a node will certainly drive some indexes to yellow, but then green again in a few minutes.

2) The second step is to determine which, and how many, indexes are having trouble. The _cat API can tell us this, by status:

GET /_cat/indices?v&health=red

GET /_cat/indices?v&health=yellow

From that we get us a sense of how many problems we have, which is likely related to any recent events, discussed above. We also need this list so we can dig deeper into each index.

3) The third step is to see what shards are having trouble and why. This is related to the index list but the index list will only tell you which indexes have issues, and now we need a per-shard list of problems.

We use the _cat interface for this, ideally with a sort and some extra columns such as this which will list the indexes sorted by status including the basic reason why it’s unassigned — look for the UNASSIGNED status:

GET /_cat/shards? v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details&s=sto,index

That may be enough to know what is going on, with the Unassigned Details column there, and from that we can work to solve the problem. But some times we need more detail, especially if we have node routing or other more complex issues.

For this, we can ask the cluster to explain the current allocation situation and logic for a given shard. This is a bit messy as we need the both shard number (starting with 0) from the list above, and to know if we want to look at the primary or replicas, also from the list above.

The API call is this, where you need to set the index name, shard number, and primary true/false:

GET /_cluster/allocation/explain {“index”: “great-index-2020.07.10”, “shard”: 0,”primary”: true}

This will give you a much more detailed look at the situation, and what you do next depends on the reasons you find there.

Some common issues include:

  • Low Disk Space — No room to allocate
  • Shard Count Limits — Too many shards per node, common when new indexes are created or some nodes are removed and the system can’t find a place for them.
  • JVM or Heap Limits — Some versions can limit allocations when they are low on RAM
  • Routing or Allocation Rules — Common HA cloud or large complex systems
  • Corruption or Serious Problems — There are many more issues that can arise, each needing special attention or solutions, or, in many cases, just removing the old shards and adding new replicas or primaries.

Fixing Red & Yellow Indexes

The fourth step is to fix the problem. Fixes fall into a few categories:

  • Wait and let Elasticsearch fix it — For temporary conditions such as a node rebooting
  • Manually Allocate the Shard — Sometimes needed to fix things
  • Check Routing / Allocation Rules — Many HA or complex systems use routing or allocation rules to control placement, and as things change, this can create shards that can’t allocate. The explain should make this lear.
  • Remove all Replicas by setting number to 0 — Maybe you can’t fix the replica or manually move or assign it. In that case, as long as you have a primary (index is yellow, not red), you can always just set the replica count to 0, wait a minute, then set back to 1 or whatever you want, using: "index" : { "number_of_replicas" : 0 }

We’ll add more details for status and solutions as they arise, but this is a complex are and, like all systems, fixes vary based on the exact details and history of the problem.

ELKman is a professional Elastic® Stack DBA & Admin tool — www.ELKman.io