空终止字符串的基本原理是什么?

尽管我很喜欢C和c++，但我还是忍不住对空结尾字符串的选择抓耳挠脑:

Length prefixed (i.e. Pascal) strings existed before C Length prefixed strings make several algorithms faster by allowing constant time length lookup. Length prefixed strings make it more difficult to cause buffer overrun errors. Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here. Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings. C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation. Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

其中一些东西比C语言出现得更晚，所以C语言不知道它们是有道理的。然而，在C语言出现之前，有些语言就已经很简单了。为什么会选择空终止字符串，而不是明显更好的长度前缀?

编辑:因为有些人问了关于我上面提到的效率点的事实(他们不喜欢我已经提供的事实)，他们源于以下几点:

使用空结尾字符串的Concat需要O(n + m)时间复杂度。长度前缀通常只需要O(m)。使用空结尾字符串的长度需要O(n)时间复杂度。长度前缀为O(1)。 Length和concat是迄今为止最常见的字符串操作。在一些情况下，以空结尾的字符串可能更有效，但这种情况发生的频率要低得多。

从下面的答案，这些是一些情况下，空终止字符串更有效:

When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules. In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

上面这些词都没有length和concat常见。

下面的答案中还有一个断言:

你需要把绳子的一端剪掉

但这个是不正确的——对于以null结尾的字符串和有长度前缀的字符串，它的时间是相同的。(以Null结尾的字符串只是在你想要的新结尾的地方插入一个Null，长度前缀只是从前缀中减去。)

当前回答

Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As Jörgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

C语言让程序员选择是否/在哪里/如何存储长度的设计更加灵活和强大。当然，程序员必须聪明。C语言惩罚愚蠢的程序崩溃，慢慢停止，或者让你的敌人扎根。

2010-12-11 22:10:58

其他回答

在很多方面，C语言是原始的。我很喜欢。

它比汇编语言高了一步，用一种更容易编写和维护的语言提供了几乎相同的性能。

空结束符很简单，不需要语言的特殊支持。

现在回想起来，似乎并不是那么方便。但我在80年代使用汇编语言，当时它似乎非常方便。我只是认为软件在不断地发展，平台和工具也在不断地变得越来越复杂。

2010-12-11 23:02:16

还有一点没有提到:当C语言被设计出来的时候，有很多机器的“char”不是8位的(即使是今天的DSP平台也不是8位的)。如果一个人决定字符串是长度前缀，应该使用多少'char'的长度前缀?使用two会人为地限制具有8位字符和32位寻址空间的机器的字符串长度，而在具有16位字符和16位寻址空间的机器上浪费空间。

If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.

2012-01-25 16:12:57

我觉得更好的问题是你为什么觉得C欠你什么?C语言的设计是为了满足你的需要，仅此而已。你需要摆脱那种认为语言必须为你提供一切的心态。或者只是继续使用你的高级语言，这将给你奢侈的字符串，日历，容器;而在Java中，你会得到一种千变万化的东西。多个类型字符串，多个类型的unordered_map(s)。

这对你来说太糟糕了，这不是C的目的。C并不是被设计成一种从大头针到锚的臃肿语言。相反，您必须依赖第三方库或您自己的库。没有什么比创建一个包含字符串及其大小的简单结构体更容易的了。

struct String
{
 const char *s;
 size_t len;
};

你知道问题出在哪里。它不标准。另一种语言可能决定将len组织在字符串之前。另一种语言可能决定使用指针来代替结束。另一个人可能决定使用六个指针来提高String的效率。然而，null结尾的字符串是字符串的最标准格式;你可以用它来与任何语言进行交互。甚至Java JNI也使用以空结尾的字符串。

Lastly, it is a common saying; the right data structure for the task. If you find that need to know the size of a string more than anything else; well use a string structure that allows you to do that optimally. But don't make claims that that operation is used more than anything else for everybody. Like, why is knowing the size of a string more important than reading its contents. I find that reading the contents of a string is what I mostly do, so I use null terminated strings instead of std::string; which saves me 5 pointers on a GCC compiler. If I can even save 2 pointers that is good.

2021-12-28 03:41:57

假设C以Pascal的方式实现字符串，通过前缀长度:7字符长字符串与3字符字符串的数据类型相同吗?如果答案是肯定的，那么当我将前者分配给后者时，编译器应该生成什么样的代码?字符串应该被截断，还是自动调整大小?如果调整大小，该操作是否应该被锁保护以使其线程安全?不管你喜不喜欢，C语言的方法回避了所有这些问题。

2010-12-12 04:26:17

与长度前缀相比，null终止的一个优点是字符串比较的简单性，这一点我没有看到任何人提到过。考虑比较标准，它返回小于、等于或大于的有符号结果。对于长度前缀，算法必须遵循以下几行:

Compare the two lengths; record the smaller, and note if they are equal (this last step might be deferred to step 3). Scan the two character sequences, subtracting characters at matching indices (or use a dual pointer scan). Stop either when the difference is nonzero, returning the difference, or when the number of characters scanned is equal to the smaller length. When the smaller length is reached, one string is a prefix of the other. Return negative or positive value according to which is shorter, or zero if of equal length.

将其与null终止算法进行对比:

扫描两个字符序列，在匹配的索引处减去字符[注意，移动指针处理得更好]。当差值非零时停止，返回差值。注意:如果一个字符串是另一个字符串的PROPER前缀，减法中的一个字符将为NUL，即零，比较将自然地停止在那里。如果差值为零，-only then-检查是否有字符为NUL。如果是，则返回0，否则继续到下一个字符。

以null结尾的情况更简单，并且非常容易用双指针扫描高效地实现。带长度前缀的大小写至少做同样多的工作，几乎总是更多。如果你的算法必须做大量的字符串比较[e。编译器!]，以null结尾的情况胜出。现在，这可能不那么重要了，但在过去，是的。

2021-09-14 14:57:16

空终止字符串的基本原理是什么?

推荐文章

最新文章

标签