尽管我很喜欢C和c++,但我还是忍不住对空结尾字符串的选择抓耳挠脑:

Length prefixed (i.e. Pascal) strings existed before C Length prefixed strings make several algorithms faster by allowing constant time length lookup. Length prefixed strings make it more difficult to cause buffer overrun errors. Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here. Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings. C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation. Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

其中一些东西比C语言出现得更晚,所以C语言不知道它们是有道理的。然而,在C语言出现之前,有些语言就已经很简单了。为什么会选择空终止字符串,而不是明显更好的长度前缀?

编辑:因为有些人问了关于我上面提到的效率点的事实(他们不喜欢我已经提供的事实),他们源于以下几点:

使用空结尾字符串的Concat需要O(n + m)时间复杂度。长度前缀通常只需要O(m)。 使用空结尾字符串的长度需要O(n)时间复杂度。长度前缀为O(1)。 Length和concat是迄今为止最常见的字符串操作。在一些情况下,以空结尾的字符串可能更有效,但这种情况发生的频率要低得多。

从下面的答案,这些是一些情况下,空终止字符串更有效:

When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules. In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

上面这些词都没有length和concat常见。

下面的答案中还有一个断言:

你需要把绳子的一端剪掉

但这个是不正确的——对于以null结尾的字符串和有长度前缀的字符串,它的时间是相同的。(以Null结尾的字符串只是在你想要的新结尾的地方插入一个Null,长度前缀只是从前缀中减去。)


当前回答

Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As Jörgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

C语言让程序员选择是否/在哪里/如何存储长度的设计更加灵活和强大。当然,程序员必须聪明。C语言惩罚愚蠢的程序崩溃,慢慢停止,或者让你的敌人扎根。

其他回答

围绕C语言的许多设计决策都源于这样一个事实:在最初实现C语言时,参数传递的代价有些昂贵。如果在两者之间作选择。

void add_element_to_next(arr, offset)
  char[] arr;
  int offset;
{
  arr[offset] += arr[offset+1];
}

char array[40];

void test()
{
  for (i=0; i<39; i++)
    add_element_to_next(array, i);
}

void add_element_to_next(ptr)
  char *p;
{
  p[0]+=p[1];
}

char array[40];

void test()
{
  int i;
  for (i=0; i<39; i++)
    add_element_to_next(arr+i);
}

后者会稍微便宜一点(因此是首选),因为它只需要传递一个参数而不是两个。如果被调用的方法不需要知道数组的基址,也不需要知道其中的索引,那么将这两个值组合在一起传递一个指针比分别传递值要便宜。

While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.

PS——为了在某些操作上稍微降低速度,以及在较长的字符串上稍微增加一点额外开销,可以让处理字符串的方法直接接受指向字符串的指针、经过边界检查的字符串缓冲区或标识另一个字符串的子字符串的数据结构。像“strcat”这样的函数看起来像[现代语法]

void strcat(unsigned char *dest, unsigned char *src)
{
  struct STRING_INFO d,s;
  str_size_t copy_length;

  get_string_info(&d, dest);
  get_string_info(&s, src);
  if (d.si_buff_size > d.si_length) // Destination is resizable buffer
  {
    copy_length = d.si_buff_size - d.si_length;
    if (s.src_length < copy_length)
      copy_length = s.src_length;
    memcpy(d.buff + d.si_length, s.buff, copy_length);
    d.si_length += copy_length;
    update_string_length(&d);
  }
}

比K&R strcat方法大一点,但它支持边界检查,而K&R方法不支持。此外,与当前的方法不同,它可以轻松地连接任意子字符串,例如。

/* Concatenate 10th through 24th characters from src to dest */

void catpart(unsigned char *dest, unsigned char *src)
{
  struct SUBSTRING_INFO *inf;
  src = temp_substring(&inf, src, 10, 24);
  strcat(dest, src);
}

注意,由temp_substring返回的字符串的生命周期将受到s和src的生命周期的限制,后者更短(这就是为什么该方法需要传入inf——如果它是本地的,它将在方法返回时死亡)。

In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).

当然,K&R并没有实现任何这样的东西,但这很可能是因为他们不想在字符串处理上花费太多精力——即使在今天,许多语言在这方面似乎都相当薄弱。

与长度前缀相比,null终止的一个优点是字符串比较的简单性,这一点我没有看到任何人提到过。考虑比较标准,它返回小于、等于或大于的有符号结果。对于长度前缀,算法必须遵循以下几行:

Compare the two lengths; record the smaller, and note if they are equal (this last step might be deferred to step 3). Scan the two character sequences, subtracting characters at matching indices (or use a dual pointer scan). Stop either when the difference is nonzero, returning the difference, or when the number of characters scanned is equal to the smaller length. When the smaller length is reached, one string is a prefix of the other. Return negative or positive value according to which is shorter, or zero if of equal length.

将其与null终止算法进行对比:

扫描两个字符序列,在匹配的索引处减去字符[注意,移动指针处理得更好]。当差值非零时停止,返回差值。注意:如果一个字符串是另一个字符串的PROPER前缀,减法中的一个字符将为NUL,即零,比较将自然地停止在那里。 如果差值为零,-only then-检查是否有字符为NUL。如果是,则返回0,否则继续到下一个字符。

以null结尾的情况更简单,并且非常容易用双指针扫描高效地实现。带长度前缀的大小写至少做同样多的工作,几乎总是更多。如果你的算法必须做大量的字符串比较[e。编译器!],以null结尾的情况胜出。现在,这可能不那么重要了,但在过去,是的。

C语言中没有字符串。C语言中的“string”只是一个指向char的指针。所以也许你问错问题了。

“省略字符串类型的基本原理是什么”可能更相关。对此,我要指出C不是面向对象的语言,只有基本的值类型。字符串是一个更高级别的概念,必须以某种方式组合其他类型的值来实现。C处于较低的抽象级别。

鉴于下面的狂风暴雨

我只是想指出,我并不是想说这是一个愚蠢或糟糕的问题,或者C语言表示字符串的方式是最好的选择。我试图澄清的是,如果考虑到C语言没有区分字符串作为数据类型与字节数组的机制这一事实,那么这个问题就会更简洁。考虑到今天计算机的处理和存储能力,这是最好的选择吗?可能不会。但事后诸葛总是20/20之类的。

懒惰、寄存器节俭和可移植性考虑到任何语言的汇编核心,尤其是C语言,它比汇编高出一步(因此继承了大量汇编遗留代码)。 你会同意null字符在那些ASCII的日子里是无用的,它(可能和EOF控件字符一样好)。

让我们看看伪代码

function readString(string) // 1 parameter: 1 register or 1 stact entries
    pointer=addressOf(string) 
    while(string[pointer]!=CONTROL_CHAR) do
        read(string[pointer])
        increment pointer

共使用1个寄存器

案例2

 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
     pointer=addressOf(string) 
     while(length>0) do 
         read(string[pointer])
         increment pointer
         decrement length

共使用2个寄存器

这在当时似乎是短视的,但考虑到代码和寄存器的节俭(这在当时是PREMIUM,那时你知道,他们使用穿孔卡)。因此,更快(当处理器速度可以以kHz计),这个“黑客”是相当不错的,可轻松移植到无寄存器处理器。

为了便于讨论,我将实现2个常见的字符串操作

stringLength(string)
     pointer=addressOf(string)
     while(string[pointer]!=CONTROL_CHAR) do
         increment pointer
     return pointer-addressOf(string)

复杂度O(n),在大多数情况下PASCAL字符串是O(1),因为字符串的长度是预先挂起的字符串结构(这也意味着该操作必须在更早的阶段进行)。

concatString(string1,string2)
     length1=stringLength(string1)
     length2=stringLength(string2)
     string3=allocate(string1+string2)
     pointer1=addressOf(string1)
     pointer3=addressOf(string3)
     while(string1[pointer1]!=CONTROL_CHAR) do
         string3[pointer3]=string1[pointer1]
         increment pointer3
         increment pointer1
     pointer2=addressOf(string2)
     while(string2[pointer2]!=CONTROL_CHAR) do
         string3[pointer3]=string2[pointer2]
         increment pointer3
         increment pointer1
     return string3

复杂度O(n)和预先设置字符串长度不会改变操作的复杂性,而我承认它会减少3倍的时间。

另一方面,如果你使用PASCAL字符串将不得不重新设计您的API来考虑在长度和bit-endianness注册,帕斯卡字符串的众所周知的限制255字符(0 xff)因为中存储的长度是1个字节(8位),而且你想要更长的字符串(16位- >任何)你必须考虑在一层的架构代码,这意味着在大多数情况下不相容的字符串API如果你想要更长的字符串。

例子:

One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes. Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble. 1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string. Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.

当然,C也有它的问题,但它不会受到这里提出的问题的影响。

这个问题是作为长度前缀字符串(LPS)与零终止字符串(SZ)的问题提出的,但主要暴露了长度前缀字符串的好处。这似乎有些势不可挡,但老实说,我们也应该考虑到LPS的缺点和SZ的优点。

在我看来,这个问题甚至可以被理解为一种带有偏见的提问方式:“零终止字符串的优势是什么?”

零终止字符串的优点(我看到了):

very simple, no need to introduce new concepts in language, char arrays/char pointers can do. the core language just include minimal syntaxic sugar to convert something between double quotes to a bunch of chars (really a bunch of bytes). In some cases it can be used to initialize things completely unrelated with text. For instance xpm image file format is a valid C source that contains image data encoded as a string. by the way, you can put a zero in a string literal, the compiler will just also add another one at the end of the literal: "this\0is\0valid\0C". Is it a string ? or four strings ? Or a bunch of bytes... flat implementation, no hidden indirection, no hidden integer. no hidden memory allocation involved (well, some infamous non standard functions like strdup perform allocation, but that's mostly a source of problem). no specific issue for small or large hardware (imagine the burden to manage 32 bits prefix length on 8 bits microcontrollers, or the restrictions of limiting string size to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago). implementation of string manipulation is just a handful of very simple library function efficient for the main use of strings : constant text read sequentially from a known start (mostly messages to the user). the terminating zero is not even mandatory, all necessary tools to manipulate chars like a bunch of bytes are available. When performing array initialisation in C, you can even avoid the NUL terminator. Just set the right size. char a[3] = "foo"; is valid C (not C++) and won't put a final zero in a. coherent with the unix point of view "everything is file", including "files" that have no intrinsic length like stdin, stdout. You should remember that open read and write primitives are implemented at a very low level. They are not library calls, but system calls. And the same API is used for binary or text files. File reading primitives get a buffer address and a size and return the new size. And you can use strings as the buffer to write. Using another kind of string representation would imply you can't easily use a literal string as the buffer to output, or you would have to make it have a very strange behavior when casting it to char*. Namely not to return the address of the string, but instead to return the actual data. very easy to manipulate text data read from a file in-place, without useless copy of buffer, just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment). prepending some int values of whatever size would implies alignment issues. The initial length should be aligned, but there is no reason to do that for the characters datas (and again, forcing alignment of strings would imply problems when treating them as a bunch of bytes). length is known at compile time for constant literal strings (sizeof). So why would anyone want to store it in memory prepending it to actual data ? in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings. in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.

也就是说,在标准C字符串确实效率低下的罕见情况下,没有必要抱怨。图书馆是可用的。如果我遵循这个趋势,我应该抱怨标准C不包括任何正则表达式支持函数……但实际上每个人都知道这不是一个真正的问题,因为有库可以用于此目的。因此,当字符串操作效率是需要的,为什么不使用像bstring库?或者甚至是c++字符串?

编辑:我最近看了看D弦。有趣的是,所选择的解决方案既不是大小前缀,也不是零终止。与C语言一样,双引号括起来的字面值字符串只是不可变字符数组的简写,并且该语言也有一个字符串关键字表示(不可变字符数组)。

但是D数组比C数组丰富得多。在静态数组的情况下,长度在运行时是已知的,因此不需要存储长度。编译器在编译时拥有它。在动态数组的情况下,长度是可用的,但D文档没有说明它保存在哪里。就我们所知,编译器可以选择将它保存在某个寄存器中,或者存储在远离字符数据的某个变量中。

正常char数组或非字符串没有最终为零,因此程序员必须把它本身如果他想叫一些C函数从D .字符串字面量的具体情况,然而D编译器仍然把零在每个字符串(允许容易把C字符串容易调用C函数?),但这零不是字符串的一部分(D不计算字符串大小)。

The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).