鉴于字符串在.NET中是不可变的,我想知道为什么它们被设计成string.Substring()需要O(substring.Length)时间,而不是O(1)?

也就是说,如果有的话,折衷是什么?


正是因为字符串是不可变的,. substring必须至少复制原始字符串的一部分。复制n个字节需要O(n)个时间。

你认为如何在常数时间内复制一堆字节?


编辑:Mehrdad建议完全不要复制字符串,而是保留对其中一部分的引用。

考虑在。net中,一个数兆字节的字符串,有人对它调用. substring (n, n+3)(对于字符串中间的任何n)。

现在,整个字符串不能被垃圾收集只是因为一个引用持有4个字符? 这似乎是对空间的荒谬浪费。

此外,跟踪对子字符串(甚至可能在子字符串内部)的引用,并试图在最佳时间复制以避免击败GC(如上所述),使这个概念成为噩梦。复制到. substring上并维护直接的不可变模型要简单得多,也更可靠。


编辑:这里有一篇关于在大字符串中保留子字符串引用的危险的小文章。


Java(相对于。net)提供了两种执行Substring()的方法,您可以考虑是只保留一个引用还是将整个子字符串复制到新的内存位置。

简单的.substring(…)与原始String对象共享内部使用的char数组,然后使用new String(…)可以在需要时将其复制到新数组(以避免阻碍原始数组的垃圾收集)。

我认为这种灵活性是开发人员的最佳选择。


更新:我非常喜欢这个问题,我刚刚把它写在博客上。参见字符串、不可变性和持久性


简单的回答是:如果n不变大,O(n)就是O(1)。大多数人从微小的字符串中提取微小的子字符串,因此复杂性如何渐近增长是完全无关的。

长一点的答案是:

在实例上的操作允许重用原始数据的内存,只需要少量的复制或分配(通常是O(1)或O(lgn)),这种不可变数据结构被称为“持久的”不可变数据结构。.NET中的字符串是不可变的;你的问题本质上是“为什么它们不是持久的”?

因为当你观察。net程序中通常对字符串执行的操作时,在各个相关方面,仅仅创建一个全新的字符串几乎不会更糟糕。构建复杂持久数据结构的成本和难度并不足以收回成本。

People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. The line will be maybe a couple hundred characters long, the name will be a couple dozen. String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; "fast enough" is by definition fast enough.

提取的子字符串通常大小较小,生命周期短;垃圾收集器很快就会回收它们,而且它们一开始在堆上并没有占用太多的空间。因此,使用鼓励重用大部分内存的持久化策略也不是一种胜利;您所做的一切只是让垃圾收集器变得更慢,因为现在它必须担心处理内部指针。

If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; it would be wasteful and foolish not to. But most line-of-business programmers do not do anything even vaguely like those sorts of things. .NET is not a platform that is tailored for the needs of the Human Genome Project; DNA analysis programmers have to solve problems with those string usage characteristics every day; odds are good that you do not. The few who do build their own persistent data structures that closely match their usage scenarios.

For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. It would be unrealistic to expect the built-in string type to solve this problem for us.


Java曾经引用更大的字符串,但是:

Java也将其行为更改为复制,以避免内存泄漏。

我觉得它还可以改进:为什么不只是有条件地复制呢?

如果子字符串至少是父字符串大小的一半,则可以引用父字符串。否则,你可以复制一份。这避免了大量内存泄漏,同时仍然提供了显著的好处。


这里没有一个答案解决了“括号问题”,也就是说。net中的字符串被表示为BStr(存储在指针“之前”内存中的长度)和CStr(字符串以“\0”结尾)的组合。

字符串“Hello there”因此表示为

0B 00 00 00 48 00 65 00 6C 00 6F 00 20 00 74 00 68 00 65 00 72 00 65 00 00 00

(如果在固定语句中赋值给char*,指针将指向0x48。)

这种结构允许快速查找字符串的长度(在许多上下文中很有用),并允许指针在P/Invoke中传递给Win32(或其他)api,这些api需要一个以空结束的字符串。

当你执行Substring(0,5)时,“哦,但我保证在最后一个字符之后会有一个空字符”规则说你需要创建一个副本。即使你在末尾得到了子字符串,也没有地方可以在不破坏其他变量的情况下放置长度。


有时候,你真的想要讨论“字符串的中间”,你不一定关心P/Invoke行为。最近添加的ReadOnlySpan<T>结构体可用于获取一个无复制子字符串:

string s = "Hello there";
ReadOnlySpan<char> hello = s.AsSpan(0, 5);
ReadOnlySpan<char> ell = hello.Slice(1, 3);

ReadOnlySpan<char> "substring"独立地存储长度,它不保证值的末尾有一个'\0'。它可以在许多方面“像字符串一样”使用,但它不是“字符串”,因为它既没有BStr特征,也没有CStr特征(更不用说两者都有)。如果你从来不(直接)P/Invoke,那么没有太大的区别(除非你想调用的API没有ReadOnlySpan<char>重载)。

ReadOnlySpan<char>不能用作引用类型的字段,因此还有ReadOnlyMemory<char> (s.AsMemory(0,5)),这是ReadOnlySpan<char>的间接方式,因此存在相同的差异-from-string。

Some of the answers/comments on previous answers talked about it being wasteful to have the garbage collector have to keep a million-character string around while you continue to talk about 5 characters. That is precisely the behavior you can get with the ReadOnlySpan<char> approach. If you're just doing short computations, the ReadOnlySpan approach is probably better. If you need to persist it for a while and you're going to keep only a small percentage of the original string, doing a proper substring (to trim off the excess data) is probably better. There's a transition point somewhere in the middle, but it depends on your specific usage.