Vector似乎在Scala集合派对上迟到了,所有有影响力的博客帖子都已经离开了。
在Java中,ArrayList是默认的集合-我可能会使用LinkedList,但只有当我考虑过算法并足够关心优化时。在Scala中,我应该使用向量作为我的默认Seq,还是试图找出列表实际上更合适?
Vector似乎在Scala集合派对上迟到了,所有有影响力的博客帖子都已经离开了。
在Java中,ArrayList是默认的集合-我可能会使用LinkedList,但只有当我考虑过算法并足够关心优化时。在Scala中,我应该使用向量作为我的默认Seq,还是试图找出列表实际上更合适?
当前回答
Some of the statements here are confusing or even wrong, especially the idea that immutable.Vector in Scala is anything like an ArrayList. List and Vector are both immutable, persistent (i.e. "cheap to get a modified copy") data structures. There is no reasonable default choice as their might be for mutable data structures, but it rather depends on what your algorithm is doing. List is a singly linked list, while Vector is a base-32 integer trie, i.e. it is a kind of search tree with nodes of degree 32. Using this structure, Vector can provide most common operations reasonably fast, i.e. in O(log_32(n)). That works for prepend, append, update, random access, decomposition in head/tail. Iteration in sequential order is linear. List on the other hand just provides linear iteration and constant time prepend, decomposition in head/tail. Everything else takes in general linear time.
This might look like as if Vector was a good replacement for List in almost all cases, but prepend, decomposition and iteration are often the crucial operations on sequences in a functional program, and the constants of these operations are (much) higher for vector due to its more complicated structure. I made a few measurements, so iteration is about twice as fast for list, prepend is about 100 times faster on lists, decomposition in head/tail is about 10 times faster on lists and generation from a traversable is about 2 times faster for vectors. (This is probably, because Vector can allocate arrays of 32 elements at once when you build it up using a builder instead of prepending or appending elements one by one). Of course all operations that take linear time on lists but effectively constant time on vectors (as random access or append) will be prohibitively slow on large lists.
那么我们应该使用哪种数据结构呢? 基本上有四种常见的情况:
We only need to transform sequences by operations like map, filter, fold etc: basically it does not matter, we should program our algorithm generically and might even benefit from accepting parallel sequences. For sequential operations List is probably a bit faster. But you should benchmark it if you have to optimize. We need a lot of random access and different updates, so we should use vector, list will be prohibitively slow. We operate on lists in a classical functional way, building them by prepending and iterating by recursive decomposition: use list, vector will be slower by a factor 10-100 or more. We have an performance critical algorithm that is basically imperative and does a lot of random access on a list, something like in place quick-sort: use an imperative data structure, e.g. ArrayBuffer, locally and copy your data from and to it.
其他回答
如果你的编程是不可变的,并且需要随机访问,那么Seq是最好的选择(除非你想要一个Set,实际上你经常这样做)。除此之外,List工作得很好,只是它的操作不能并行化。
如果你不需要不可变的数据结构,坚持使用ArrayBuffer,因为它在Scala中相当于ArrayList。
Some of the statements here are confusing or even wrong, especially the idea that immutable.Vector in Scala is anything like an ArrayList. List and Vector are both immutable, persistent (i.e. "cheap to get a modified copy") data structures. There is no reasonable default choice as their might be for mutable data structures, but it rather depends on what your algorithm is doing. List is a singly linked list, while Vector is a base-32 integer trie, i.e. it is a kind of search tree with nodes of degree 32. Using this structure, Vector can provide most common operations reasonably fast, i.e. in O(log_32(n)). That works for prepend, append, update, random access, decomposition in head/tail. Iteration in sequential order is linear. List on the other hand just provides linear iteration and constant time prepend, decomposition in head/tail. Everything else takes in general linear time.
This might look like as if Vector was a good replacement for List in almost all cases, but prepend, decomposition and iteration are often the crucial operations on sequences in a functional program, and the constants of these operations are (much) higher for vector due to its more complicated structure. I made a few measurements, so iteration is about twice as fast for list, prepend is about 100 times faster on lists, decomposition in head/tail is about 10 times faster on lists and generation from a traversable is about 2 times faster for vectors. (This is probably, because Vector can allocate arrays of 32 elements at once when you build it up using a builder instead of prepending or appending elements one by one). Of course all operations that take linear time on lists but effectively constant time on vectors (as random access or append) will be prohibitively slow on large lists.
那么我们应该使用哪种数据结构呢? 基本上有四种常见的情况:
We only need to transform sequences by operations like map, filter, fold etc: basically it does not matter, we should program our algorithm generically and might even benefit from accepting parallel sequences. For sequential operations List is probably a bit faster. But you should benchmark it if you have to optimize. We need a lot of random access and different updates, so we should use vector, list will be prohibitively slow. We operate on lists in a classical functional way, building them by prepending and iterating by recursive decomposition: use list, vector will be slower by a factor 10-100 or more. We have an performance critical algorithm that is basically imperative and does a lot of random access on a list, something like in place quick-sort: use an imperative data structure, e.g. ArrayBuffer, locally and copy your data from and to it.
对于不可变的集合,如果你想要一个序列,你的主要决定是使用IndexedSeq还是LinearSeq,这对性能有不同的保证。IndexedSeq提供了元素的快速随机访问和快速长度操作。线性seq只提供了通过头部对第一个元素的快速访问,但也有一个快速的尾部操作。(摘自Seq文档。)
对于IndexedSeq,您通常会选择Vector。Ranges和WrappedStrings也是IndexedSeqs。
对于LinearSeq,您通常会选择一个列表或其惰性等效流。其他的例子是队列和堆栈。
所以在Java术语中,ArrayList类似于Scala的Vector, LinkedList类似于Scala的List。但在Scala中,我更倾向于使用List而不是Vector,因为Scala对包含序列遍历的函数有更好的支持,比如映射、折叠、迭代等。您将倾向于使用这些函数将列表作为一个整体来操作,而不是随机访问单个元素。
好吧,一个List可以非常快,如果算法可以单独用::,头部和尾部来实现。最近,我有了一个实例,当我通过生成一个列表而不是数组来击败Java的分裂时,其他任何东西都无法击败它。
然而,List有一个基本的问题:它不能与并行算法一起工作。我无法以有效的方式将List分割为多个段,或将其连接回去。
还有其他类型的集合可以更好地处理并行性——Vector就是其中之一。Vector也有很好的局部性——List没有——这对某些算法来说是一个真正的优点。
因此,考虑到所有因素,Vector是最佳选择,除非您有特定的考虑,使其他集合更可取——例如,如果您想要延迟计算和缓存(Iterator更快,但不缓存),则可以选择Stream,如果算法自然地使用我提到的操作实现,则可以选择List。
顺便说一下,最好使用Seq或IndexedSeq,除非你想要一个特定的API(比如List的::),或者如果你的算法可以并行运行,甚至使用GenSeq或GenIndexedSeq。
在涉及大量随机访问和随机突变的情况下,Vector(或者-正如文档所说- Seq)似乎是一个很好的折衷方案。这也是性能特征所暗示的。
此外,Vector类在没有太多数据复制的分布式环境中似乎表现得很好,因为不需要对完整的对象进行写时复制。(参见:http://akka.io/docs/akka/1.1.3/scala/stm.html # persistent-datastructures)