尽管我很喜欢C和c++,但我还是忍不住对空结尾字符串的选择抓耳挠脑:

Length prefixed (i.e. Pascal) strings existed before C Length prefixed strings make several algorithms faster by allowing constant time length lookup. Length prefixed strings make it more difficult to cause buffer overrun errors. Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here. Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings. C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation. Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

其中一些东西比C语言出现得更晚,所以C语言不知道它们是有道理的。然而,在C语言出现之前,有些语言就已经很简单了。为什么会选择空终止字符串,而不是明显更好的长度前缀?

编辑:因为有些人问了关于我上面提到的效率点的事实(他们不喜欢我已经提供的事实),他们源于以下几点:

使用空结尾字符串的Concat需要O(n + m)时间复杂度。长度前缀通常只需要O(m)。 使用空结尾字符串的长度需要O(n)时间复杂度。长度前缀为O(1)。 Length和concat是迄今为止最常见的字符串操作。在一些情况下,以空结尾的字符串可能更有效,但这种情况发生的频率要低得多。

从下面的答案,这些是一些情况下,空终止字符串更有效:

When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules. In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

上面这些词都没有length和concat常见。

下面的答案中还有一个断言:

你需要把绳子的一端剪掉

但这个是不正确的——对于以null结尾的字符串和有长度前缀的字符串,它的时间是相同的。(以Null结尾的字符串只是在你想要的新结尾的地方插入一个Null,长度前缀只是从前缀中减去。)


当前回答

围绕C语言的许多设计决策都源于这样一个事实:在最初实现C语言时,参数传递的代价有些昂贵。如果在两者之间作选择。

void add_element_to_next(arr, offset)
  char[] arr;
  int offset;
{
  arr[offset] += arr[offset+1];
}

char array[40];

void test()
{
  for (i=0; i<39; i++)
    add_element_to_next(array, i);
}

void add_element_to_next(ptr)
  char *p;
{
  p[0]+=p[1];
}

char array[40];

void test()
{
  int i;
  for (i=0; i<39; i++)
    add_element_to_next(arr+i);
}

后者会稍微便宜一点(因此是首选),因为它只需要传递一个参数而不是两个。如果被调用的方法不需要知道数组的基址,也不需要知道其中的索引,那么将这两个值组合在一起传递一个指针比分别传递值要便宜。

While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.

PS——为了在某些操作上稍微降低速度,以及在较长的字符串上稍微增加一点额外开销,可以让处理字符串的方法直接接受指向字符串的指针、经过边界检查的字符串缓冲区或标识另一个字符串的子字符串的数据结构。像“strcat”这样的函数看起来像[现代语法]

void strcat(unsigned char *dest, unsigned char *src)
{
  struct STRING_INFO d,s;
  str_size_t copy_length;

  get_string_info(&d, dest);
  get_string_info(&s, src);
  if (d.si_buff_size > d.si_length) // Destination is resizable buffer
  {
    copy_length = d.si_buff_size - d.si_length;
    if (s.src_length < copy_length)
      copy_length = s.src_length;
    memcpy(d.buff + d.si_length, s.buff, copy_length);
    d.si_length += copy_length;
    update_string_length(&d);
  }
}

比K&R strcat方法大一点,但它支持边界检查,而K&R方法不支持。此外,与当前的方法不同,它可以轻松地连接任意子字符串,例如。

/* Concatenate 10th through 24th characters from src to dest */

void catpart(unsigned char *dest, unsigned char *src)
{
  struct SUBSTRING_INFO *inf;
  src = temp_substring(&inf, src, 10, 24);
  strcat(dest, src);
}

注意,由temp_substring返回的字符串的生命周期将受到s和src的生命周期的限制,后者更短(这就是为什么该方法需要传入inf——如果它是本地的,它将在方法返回时死亡)。

In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).

当然,K&R并没有实现任何这样的东西,但这很可能是因为他们不想在字符串处理上花费太多精力——即使在今天,许多语言在这方面似乎都相当薄弱。

其他回答

不知怎的,我把这个问题理解为C中没有编译器支持以长度为前缀的字符串。下面的例子显示,至少你可以开始你自己的C字符串库,其中字符串长度在编译时计算,使用这样的构造:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

然而,这不会带来任何问题,因为你需要小心什么时候特别释放字符串指针,什么时候它是静态分配的(字面字符数组)。

编辑:作为对这个问题更直接的回答,我的观点是,这是C既可以支持可用的字符串长度(作为编译时间常数)的方式,如果你需要它,但如果你只想使用指针和零终止,仍然没有内存开销。

当然,使用以零结尾的字符串似乎是推荐的做法,因为标准库一般不接受字符串长度作为参数,而且提取长度的代码不像char * s = "abc"那样简单,正如我的示例所示。

我认为,这是有历史原因的,我在维基百科上找到了这个:

At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.

还有一点没有提到:当C语言被设计出来的时候,有很多机器的“char”不是8位的(即使是今天的DSP平台也不是8位的)。如果一个人决定字符串是长度前缀,应该使用多少'char'的长度前缀?使用two会人为地限制具有8位字符和32位寻址空间的机器的字符串长度,而在具有16位字符和16位寻址空间的机器上浪费空间。

If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.

即使在32位机器上,如果允许字符串的大小与可用内存相同,带前缀的长度字符串也只比以空结尾的字符串宽3个字节。

首先,对于短字符串来说,额外的3个字节可能是相当大的开销。具体来说,零长度字符串现在占用的内存是原来的4倍。我们中的一些人正在使用64位机器,因此我们要么需要8个字节来存储零长度的字符串,要么字符串格式无法处理平台支持的最长字符串。

可能还需要处理对齐问题。假设我有一个包含7个字符串的内存块,比如“solo\0second\0\0four\0five\0\0seventh”。第二个字符串从偏移量5开始。硬件可能要求32位整数以4的倍数的地址对齐,因此您必须添加填充,从而进一步增加开销。相比之下,C表示非常节省内存。(内存效率很好;例如,它有助于缓存性能。)

我不相信“C没有字符串”的答案。没错,C语言不支持内置的高级类型,但你仍然可以用C语言表示数据结构,这就是字符串。在C语言中,字符串只是一个指针,但这并不意味着前N个字节不能作为长度具有特殊意义。

Windows/COM开发人员将非常熟悉BSTR类型,它就像这样——一个有长度前缀的C字符串,其中实际的字符数据不是从字节0开始的。

因此,使用空终止符的决定似乎只是人们喜欢的,而不是语言的必要。