我试图理解什么shard和replica在Elasticsearch中,但我没有设法理解它。如果我下载Elasticsearch并运行脚本,那么根据我所知道的,我已经启动了一个具有单个节点的集群。现在这个节点(我的PC)有5个碎片(?)和一些副本(?)。

它们是什么,我有5个重复的索引吗?如果是,为什么?我需要一些解释。


索引被分解成碎片,以便分布它们和扩展它们。

副本是分片的副本,在节点丢失时提供可靠性。这个数字经常会引起混淆,因为副本计数== 1意味着集群必须有可用的分片的主副本和复制副本才能处于绿色状态。

为了创建副本,您的集群中必须至少有2个节点。

你可能会发现这里的定义更容易理解: http://www.elasticsearch.org/guide/reference/glossary/


我将试着用一个真实的例子来解释,因为你得到的答案和回复似乎对你没有帮助。

当您下载并启动elasticsearch时,您将创建一个elasticsearch节点,该节点将尝试加入现有集群(如果可用)或创建一个新集群。假设您用一个节点创建了自己的新集群,就是您刚刚启动的那个节点。我们没有数据,因此需要创建一个索引。

当您创建索引时(当您索引第一个文档时也会自动创建索引),您可以定义它将由多少个碎片组成。如果您没有指定一个数字,它将有默认的碎片数量:5个主。这是什么意思?

这意味着elasticsearch将创建5个包含你的数据的主分片:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

每次索引一个文档时,elasticsearch将决定哪个主分片应该保存该文档,并在那里索引它。主碎片不是数据的副本,它们就是数据本身!拥有多个分片确实有助于在一台机器上利用并行处理的优势,但关键是,如果我们在同一个集群上启动另一个elasticsearch实例,那么分片将以均匀的方式分布在集群上。

例如,节点1将只保存三个分片:

 ____    ____    ____ 
| 1  |  | 2  |  | 3  |
|____|  |____|  |____|

由于剩下的两个分片已经移动到新启动的节点:

 ____    ____
| 4  |  | 5  |
|____|  |____|

为什么会发生这种情况?因为elasticsearch是一个分布式搜索引擎,通过这种方式,您可以使用多个节点/机器来管理大量数据。

每个elasticsearch索引至少由一个主分片组成,因为数据存储在主分片中。但是,每个碎片都是有代价的,因此,如果你只有一个节点,而且没有可预见的增长,那就坚持使用一个主碎片。

另一种类型的碎片是副本。默认值为1,这意味着每个主分片将被复制到另一个包含相同数据的分片。副本用于提高搜索性能和故障转移。复制分片永远不会被分配到与相关主数据所在的同一节点上(这很像将备份数据放在与原始数据相同的磁盘上)。

回到我们的例子,对于1个副本,我们将在每个节点上拥有整个索引,因为将在第一个节点上分配2个副本碎片,并且它们将包含与第二个节点上的主碎片完全相同的数据:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4R |  | 5R |
|____|  |____|  |____|  |____|  |____|

第二个节点也一样,它将包含第一个节点上主碎片的副本:

 ____    ____    ____    ____    ____
| 1R |  | 2R |  | 3R |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

使用这样的设置,如果一个节点宕机,您仍然拥有整个索引。复制分片将自动成为主分片,即使节点故障,集群也能正常工作,具体如下:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

由于您有"number_of_replicas":1,因此不能再分配副本,因为它们永远不会被分配到主节点所在的同一节点上。这就是为什么你会有5个未分配的碎片,副本,集群状态将是黄色而不是绿色。没有数据丢失,但它可以更好,因为一些碎片无法分配。

一旦备份了离开的节点,它将再次加入集群,并再次分配副本。第二个节点上的现有分片可以加载,但它们需要与其他分片同步,因为写操作很可能发生在节点关闭时。操作结束时,集群状态将变为“GREEN”。

希望这能为你澄清一些事情。


如果你真的不喜欢看到它变黄。您可以将副本的数量设置为0:

curl -XPUT 'localhost:9200/_settings' -d '
{
    "index" : {
        "number_of_replicas" : 0
    }
}
'

请注意,您应该只在本地开发框上执行此操作。


在ElasticSearch中,在顶层,我们将文档索引为索引。每个索引都有若干个分片,这些分片内部分布数据,而这些分片内部存在Lucene段,这是数据的核心存储。因此,如果索引有5个分片,这意味着数据已经分布在各个分片上,并且分片中存在不同的数据。

请观看解释ES核心的视频 https://www.youtube.com/watch?v=PpX7J-G2PEo

关于多索引或多碎片的文章 弹性搜索,多个索引vs不同数据集的一个索引和类型?


索引被分解成碎片,以便分布它们和扩展它们。

副本是碎片的副本。

节点是弹性搜索的一个运行实例,属于一个集群。

集群由一个或多个具有相同集群名称的节点组成。每个集群都有一个由集群自动选择的主节点,如果当前的主节点发生故障,可以将其替换。


碎片:

Being distributed search server, ElasticSearch uses concept called Shard to distribute index documents across all nodes. An index can potentially store a large amount of data that can exceed the hardware limits of a single node For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone. To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Documents are stored in shards, and shards are allocated to nodes in your cluster As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced. A shard can be either a primary shard or a replica shard. Each document in your index belongs to a single primary shard, so the number of primary shards that you have determines the maximum amount of data that your index can hold A replica shard is just a copy of a primary shard.

副本:

Replica shard is the copy of primary Shard, to prevent data loss in case of hardware failure. Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short. An index can also be replicated zero (meaning no replicas) or more times. The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact. By default, each index in Elasticsearch is allocated 5 primary Shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.


I will explain this using a real word scenarios. Imagine you are a running a ecommerce website. As you become more popular more sellers and products add to your website. You will realize the number of products you might need to index has grown and it is too large to fit in one hard disk of one node. Even if it fits in to hard disk, performing a linear search through all the documents in one machine is extremely slow. one index on one node will not take advantage of the distributed cluster configuration on which the elasticsearch works.

So elasticsearch splits the documents in the index across multiple nodes in the cluster. Each and every split of the document is called a shard. Each node carrying a shard of a document will have only a subset of the document. suppose you have 100 products and 5 shards, each shard will have 20 products. This sharding of data is what makes low latency search possible in elasticsearch. search is conducted parallel on multiple nodes. Results are aggregated and returned. However the shards doesnot provide fault tolerance. Meaning if any node containing the shard is down, the cluster health becomes yellow. Meaning some of the data is not available.

To increase the fault tolerance replicas come in to picture. By deault elastic search creates a single replica of each shard. These replicas are always created on a other node where the primary shard is not residing. So to make the system fault tolerant, you might have to increase the number of nodes in your cluster and it also depends on number of shards of your index. The general formula to calculate the number of nodes required based on replicas and shards is "number of nodes = number of shards*(number of replicas + 1)".The standard practice is to have atleast one replica for fault tolerance.

设置碎片数量是一个静态操作,这意味着您必须在创建索引时指定它。在此之后的任何改变都需要完全重新索引数据,并且需要时间。但是,副本数量的设置是一个动态操作,也可以在索引创建后的任何时间完成。

您可以使用下面的命令为索引设置碎片和副本的数量。

curl -XPUT 'localhost:9200/sampleindex?pretty' -H 'Content-Type: application/json' -d '
{
  "settings":{
    "number_of_shards":2,
    "number_of_replicas":1
  }
}'

Elasticsearch is superbly scalable with all the credit goes to its distributed architecture. It is made possible due to Sharding. Now, before moving further into it, let us consider a simple and very common use case. Let us suppose, you have an index which contains a hell lot of documents, and for the sake of simplicity, consider that the size of that index is 1 TB (i.e, Sum of sizes of each and every document in that index is 1 TB). Also, assume that you have two Nodes each with 512 GB of space available for storing data. As can be seen clearly, our entire index cannot be stored in any of the two nodes available and hence we need to distribute our index among these Nodes.

在这种情况下,索引的大小超过了单个节点的硬件限制,Sharding就可以发挥作用。Sharding通过将索引划分为更小的块来解决这个问题,这些块被命名为Shards。


不是答案,而是ElasticSearch的核心概念的另一个参考,我认为它们非常清楚地补充了@javanna的答案。

碎片

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone. To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster. Sharding is important for two primary reasons: It allows you to horizontally split/scale your content volume. It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput.

副本

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short. Replication is important for two primary reasons: It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from. It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.


用最简单的术语来说,碎片只是存储在磁盘上一个分离文件夹中的索引的一部分:

这个截图显示了整个Elasticsearch目录。

如您所见,所有数据都进入data目录。

通过检查索引C-mAfLltQzuas72iMiIXNw,我们看到它有五个碎片(文件夹0到4)。

另一方面,JH_A8PgCRj-GK0GeQ0limw索引只有一个碎片(0文件夹)。

pri表示碎片的总数。