Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

The Most Expensive One-Byte Mistake

Soulskill posted more than 2 years ago | from the catchy-but-totally-misleading-internet-headline dept.

Programming 594

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"

cancel ×

594 comments

The Road Not Taken (5, Insightful)

symbolset (646467) | more than 2 years ago | (#36968460)

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.

- Robert Frost, 1920

This was modded offtopic (2, Insightful)

symbolset (646467) | more than 2 years ago | (#36968564)

Slashdot is lost.

Re:This was modded offtopic (1)

BenJCarter (902199) | more than 2 years ago | (#36968670)

Amen.

Re:This was modded offtopic (0)

Crimson Wing (980223) | more than 2 years ago | (#36968754)

Fixed. It's since been modded +5 Insightful. :)

Maybe there's hope yet.

Re:This was modded offtopic (-1, Offtopic)

symbolset (646467) | more than 2 years ago | (#36968908)

I hope that lost soul loses his modpoints for a good long time. It's not a trivial thing to search 100 years of art and find a text so relevant and poignant to put at the top of the comment tree in the time provided. If i have to praise myself, that was well done.

Re:The Road Not Taken (3, Interesting)

IICV (652597) | more than 2 years ago | (#36968880)

Everyone misunderstands that poem.

Robert Frost had a fairly depressing outlook on life, and the point of the poem is that it doesn't matter what road you take.

I mean, just pay attention to the narrative tense in the last stanza, the one people take to be so life-affirming and "do something different!". The narrator isn't saying "I did this, and I know it was important"; he's saying "I did this, and I think that in the future I'm going to tell people it was important".

The narrator is a vain, shallow individual who frets about insignificant decisions like this, thinking that they will have some gigantic impact on his life, and then later on blows those choices up to be of earthshattering proportions. This is all despite the fact that half the poem is about how the roads are effectively identical; and in the end, he doesn't even tell us what was important about the path he took, just that it was the "one less traveled by" (which makes no sense! They were "just as fair", they had been "worn ... really about the same", they "both that morning equally lay".)

Basically, if we apply this poem to the current situation, what it's saying is that in alternate 2011 we'd have an article about how null-terminated strings would have been better than Pascal strings. It doesn't matter what path you take, if you're the right kind of person you'll always blow up the significance of it in your mind later.

Re:The Road Not Taken (1)

jhoegl (638955) | more than 2 years ago | (#36968946)

Perhaps it means that regardless of which path you take, the one you do take(the decision you make), should be analyzed and reflected upon to verify it is inline with what you wish to accomplish.
Of course, I could continue and counter your you vain and shallow remarks and how they reflect upon a person taking literal interpretation of a poem and scrutinizing it with the inability of the author to respond.
But I digress.

Re:The Road Not Taken (2, Informative)

j. andrew rogers (774820) | more than 2 years ago | (#36968914)

As a nitpick, this poem is not from 1920. I have an original copy that was inscribed by the owner in 1919.

According to Wikipedia, the original poetry was published in 1916. The 1920 version was a second edition.

Why do I need a subject? (-1)

Anonymous Coward | more than 2 years ago | (#36968478)

Slashdot has become reddit on a 12 hour delay.

Missed the point (5, Informative)

mgiuca (1040724) | more than 2 years ago | (#36968492)

Interesting, but I think this article largely misses the point.

Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)

Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.

What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".

Re:Missed the point (5, Informative)

Anonymous Coward | more than 2 years ago | (#36968578)

Re:Missed the point (1)

mgiuca (1040724) | more than 2 years ago | (#36968584)

Thanks! +1

Re:Missed the point (1)

MrEricSir (398214) | more than 2 years ago | (#36968590)

"...it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer."

Compatible with what? Seems to me they could have just used continuation bit for the size field, much the way UTF-8 works to store non-ASCII characters.

Re:Missed the point (2)

snowgirl (978879) | more than 2 years ago | (#36968614)

"...it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer."

Compatible with what? Seems to me they could have just used continuation bit for the size field, much the way UTF-8 works to store non-ASCII characters.

This would still make the strings incompatible, because you would only have a 127-byte string length before the "continuation bit" comes into play and you need to switch to a 15-bit string length. All the previous code written with longer-than-127-byte strings would be incompatible.

Re:Missed the point (2)

MrEricSir (398214) | more than 2 years ago | (#36968902)

If we were to switch now, is that the compatibility you're referring to? Well sure.

But nobody's talking about switching now, the point of the topic is that C should have been designed differently. In those days there was very little backwards compatibility to worry about.

Re:Missed the point (2)

mgiuca (1040724) | more than 2 years ago | (#36968642)

They could have but they didn't (e.g., in Pascal, where strings actually are limited to 255 bytes). So, history has made some worse string representations than C.

Re:Missed the point (5, Informative)

dbc (135354) | more than 2 years ago | (#36968692)

Oh, Lordy, if you had ever programmed in a language with a 255 character limit for strings you would praise $DIETY every time you use a C string. Dealing with length limited strings is the largest PITA of any senseless and time-wasting programming task.

Suppose C had originally had a length for strings? The only thing that makes sense is for the string length count to be the same size as a pointer, so that it could effectively be all of memory. A long is, by C language definition, large enough to hold a pointer that has been cast into it. So string length computations all become longs. Not such a big deal for most of life... until.... 64 bit addressing. Then all sorts of string breakage occurs.

The bottom line is that in an application programming language strings need to be atomic, as they are in Python. You just should not care how strings are implemented, and you should never worry about a length limit. The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation. If you chose to use a systems programming language for application programming, well, then it sucks to be you. So why did we do that for so long? Because all the other alternatives were worse.

Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch. (Bonus points if you can tell me what that is and how to make one.) So a NUL terminated string looks positively modern compared to that.

Re:Missed the point (3, Interesting)

snowgirl (978879) | more than 2 years ago | (#36968596)

Not to mention the argument for "because space was at a premium" is specious, because either you had a 8-bit length prepended to the string, or you had an 8-bit special value appended to the end of the string. Both ways result in the same space usages.

From what I read in the summary, (didn't read TFA) this whole thing sounds like a propaganda piece supporting the idea that we should use length+string, by presenting it as "this should have been a no-brainer but the idiots making C screwed up."

As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

Re:Missed the point (4, Informative)

snowgirl (978879) | more than 2 years ago | (#36968630)

I'm correcting myself here... apparently they weren't considering going with a 255-byte limit, but a 65535-byte limit, which would have increased the size overhead by one.

Re:Missed the point (3, Informative)

arth1 (260657) | more than 2 years ago | (#36968804)

That's still an arbitrary limit.

The advantages that I see for counted length are:
- it makes copying easier - you know beforehand how much space to allocate, and how much to copy.
- it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.
- It makes reverse searches faster.
- You can put binary in a string.
But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register. Or having to convert text blocks to print them. Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in; you have to update a text length counter for every byte you make available. And... and...

Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option. Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

All in all, having worked with both systems, I find more advantages with null-termination.

There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

I'd still take NULL-terminated for most purposes.

Re:Missed the point (2)

mgiuca (1040724) | more than 2 years ago | (#36968664)

Good point.

As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

Yes, but perhaps the simplicity was partly why it caught on. The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been. Even if C had made a decision on all of those, how many implementations would have gotten it wrong?

Not just implementations, but individual programs. Assuming that in this hypothetical universe in which C doesn't use NUL terminated strings, but still assuming that C is a low-level unsafe language in general, how would this have been any different? Unlike C++ or Java, in C, programs manually construct strings. So we wouldn't have people forgetting to NUL-terminate strings. We would instead have people forgetting to set the length field, or setting the wrong length, or being given a 257-byte string and writing a "1" in the length field due to wraparound (granted, that wouldn't often be a security risk, just a bad result). If they had decided to use a variable-length length field, people would have found some way to screw that up. I'm sure hackers would have found a way to inject a long length into a short string and thus read past the end.

At the end of the day, the problem is that C lets programmers do whatever they want with memory, not the NUL terminator. And you can't really say "they should have designed it better," because it is rather the point of C that it lets you do this.

Re:Missed the point (0)

Anonymous Coward | more than 2 years ago | (#36968832)

The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been.

What if they had chosen 255 as a string terminator.

Re:Missed the point (1)

mini me (132455) | more than 2 years ago | (#36968930)

The C string has its place, but what I never understood is why the C standard library hasn't also included a string type. Something like the following with all of the accompanying bound checking functions to go along with it.


struct string
{
    size_t length;
    char *buffer;
};

There are several third party libraries that do just that, but it seems like something worthy of being there out of the box.

Re:Missed the point (2)

mgiuca (1040724) | more than 2 years ago | (#36968974)

Well C++ includes a class that is pretty much exactly what you ask for. It wouldn't make sense for C to include that, as the whole point is that C gives you the ability to manipulate data however you want. If C included that, it would be criticised for having two incompatible string types. If it only included that, it would be criticised for not being low-level enough (the programmer is forced to call all these inefficient string manipulation functions that do bounds checking).

You might ask why C doesn't include closures and list comprehensions: if you want high-level language features, then C isn't the language for you.

Re:Missed the point (0)

Anonymous Coward | more than 2 years ago | (#36968600)

Null termination sounds lovely when you've a teenager writing assembly and doing register allocation by hand, but it's obviously bad once you've seriously thought about runtimes, like after taking an algorithms class. You shouldn't need to traverse strings to determine their lengths.

I'd agree that C's elegance stems partially from pointers, meaning address+length string must be implemented higher up, meaning C++. oy!

As an assembly language programmer I resent that (1)

perpenso (1613749) | more than 2 years ago | (#36968792)

Null termination sounds lovely when you've a teenager writing assembly and doing register allocation by hand, but it's obviously bad once you've seriously thought about runtimes, like after taking an algorithms class.

I spent my formative programming years primarily writing code in assembly and I resent that statement. :-) Runtime is always in one's mind and optimizing for speed is the desired goal. Optimizing for size is something that is merely forced upon us by circumstances beyond our control. No true assembly programmer, nor any true Scotsman, would prioritize size over speed if avoidable.

Re:Missed the point (2)

e9th (652576) | more than 2 years ago | (#36968618)

My personal fave is strncpy(), which will silently not terminate the string if the buffer is too small, but if you give it a huge buffer it punishes you by NUL padding the string all the way to the end of the buffer.

Re:Missed the point (1)

dirtyhippie (259852) | more than 2 years ago | (#36968638)

What is so undesirable about making a string larger than a pointer?

Also, have a look at how mysql deals with varchars. There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc. Your arguments about what type of integer to use conveniently ignore conventions like network order. In short, it is not too hard to solve. Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

And no, the article didn't miss the "real security problems" caused by null termination. Where did you stop reading?

Re:Missed the point (1)

Rakishi (759894) | more than 2 years ago | (#36968734)

Fail, just fail.

Also, have a look at how mysql deals with varchars. There is no 255 byte limit

Before Mysql 5.0.3 the limit was 255 and 65535 afterward.

when length exceeds that value, you just go to 2 bytes of length, etc.

It does this because each column defines the maximum length for the varchar and the number of bytes used for length is fixed for each column. This however is also overhead, this information for the size of the length field needs to be stored for each variable. In C this means that each variable now has even more overhead (the actual amount depending on how you encode such information).

Re:Missed the point (1)

mgiuca (1040724) | more than 2 years ago | (#36968742)

Note that my post was not necessarily saying that NUL was the right decision. Just that it isn't a no-brainer -- going the other route has a lot of complications.

What is so undesirable about making a string larger than a pointer?

It would mean that the C library would need to declare a "string" struct instead of using char*. Now rather than passing a char* as an argument, you would have to decide whether it's worth passing the two word "string" struct, or a string* pointer (allowing it to fit into a register). It makes things more complicated.

Also, have a look at how mysql deals with varchars. There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc. Your arguments about what type of integer to use conveniently ignore conventions like network order. In short, it is not too hard to solve.

No, it isn't too hard to solve. But it is non-trivial. Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion. Remember that in C, programmers fabricate their own strings (there is a minimal string library, but often you will see people just allocating memory for strings, populating them, and storing a '\0' on the end). If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong. The same applies to NUL.

Here's a question: How much memory do you allocate for a string of N bytes? The NUL-termination answer: N + 1. The answer for your mysql variable-length length scheme: N + (N < 128 ? 1 : N < 16384 ? 2 : N < 2097152 ? 3 : .....) -- yes there is a correct answer, but it is much more complicated for the everyday programmer to deal with.

Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

I think the state of programming is so bad now that people wouldn't test it. A major security bug [openwall.com] in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.

Where did you stop reading?

The only security issues mentioned were buffer overruns, with gets taking most of the blame. As I said above, only some NUL errors are buffer overruns and only some buffer overruns are NUL errors, and gets errors are not anything to do with NUL.

Re:Missed the point (1)

dirtyhippie (259852) | more than 2 years ago | (#36968808)

Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion.

Is it, or are you just used to dealing with NUL-terminated strings?

If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong.

That's what libraries are for :-)

Here's a question: How much memory do you allocate for a string of N bytes? The NUL-termination answer: N + 1. The answer for your mysql variable-length length scheme: N + (N < 128 ? 1 : N < 16384 ? 2 : N < 2097152 ? 3 : .....) -- yes there is a correct answer, but it is much more complicated for the everyday programmer to deal with.

Again I say: libc.

I think the state of programming is so bad now that people wouldn't test it. A major security bug [openwall.com] in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.

Heh, a fair point. But if string handling is done in a library by the developer of the OS, and they don't get it right, nobody's going to buy their OS. "Joe average" programmer doesn't have to do it at all, they just call the moral equivalent of strlen(), strdup(), strchr(), strbrk() etc.

Re:Missed the point (2)

mgiuca (1040724) | more than 2 years ago | (#36968916)

Is it, or are you just used to dealing with NUL-terminated strings?

Nope, they are simpler. Re-read all of the questions I asked regarding design decisions that could be made around address+length formatted strings and tell me that they are just as simple. Now I think higher-level languages should be using lengths, because their libraries abstract the details (e.g., C++ or Java). But in a language where programmers fabricate their own strings, simplicity is best.

That's what libraries are for :-)

Well, let's assume a hypothetical universe in which C is still exactly the same C, only with length-delimited strings (still the same level of safety, still malloc and free, still pretty much the same library, only the string functions are implemented differently, etc). Could you write a library that abstracts over the string representation without ever requiring the user to manually read or write the string? I think if you did that (and certainly, C++ does that), you would have a much higher-level library. That isn't what C is good for. C is for when you need low-level access to the underlying representation.

The beauty of using C (and there aren't many) is that you can write your own efficient string manipulation code. For example, if you know you are going to concatenate three strings, you can allocate enough space for all three, then manually copy the bytes over and seal it with a NUL. In C++, you would probably have a stringstream and push each of the strings onto the end, but it would mean the library is internally adjusting lengths and so on -- the programmer can't make the code do exactly what he asks; there is a layer of abstraction. So you could change C's string representation and then provide a high-level API for manipulating it, but someone is going to get pissed off that the library doesn't do exactly what he wants, and dive down and do it himself. It would be very un-C-like to provide that API.

To put it another way, if you were going to provide a high-level string API for C and tell programmers "never ever manipulate strings on your own; use this library," then you might as well use NUL-terminated strings anyway, since the library will handle it, and programmers will never make a mistake. But again, that would be very un-C-like.

So once again, it comes down to this: NUL-terminated strings aren't the problem with C. C is the problem with C: the fact that it gives programmers a lot of power. You might argue that we should stop using C to write programs that don't need that speed or power. But there's no point arguing that C should have been a higher-level language, because then it wouldn't be C.

Re:Missed the point (0)

Anonymous Coward | more than 2 years ago | (#36968866)

Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

I think the state of programming is so bad now that people wouldn't test it. A major security bug [openwall.com] in Blowfish was just found last month caused precisely because of a signed/unsigned char mismatch.

Another problem with C. It does not natively define a 8-bit datatype so people use the char type when they need an 8-bit value.

Re:Missed the point (1)

phantomfive (622387) | more than 2 years ago | (#36968750)

And no, the article didn't miss the "real security problems" caused by null termination. Where did you stop reading?

The point was: the security problems the article mentions (buffer overflows/underflows) aren't actually caused by NULL terminated strings, they are caused by buffers that are allocated too small. If the buffer is too small, it won't matter if the string is measured at the beginning or terminated at the end. (It can be fixed by measuring the size of the buffer, but that is a different topic).

However there is a real security problem, as the GP described, although it really was a problem of mixing two standards, instead of a problem with NULL terminations.

The whole issue of which is better is a lot like big-endian or little-endian byte order: there are arguments both ways, but really it doesn't matter all that much.

Re:Missed the point (1)

0123456 (636235) | more than 2 years ago | (#36968874)

There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc.

So you have a 255 byte string. You append one byte to it. What do you do now?

Are you really suggesting that people should have to move all the bytes of the string one further along so they can increase the length field to two bytes, and then append the new character, and that programmers who can't remember to put a 0 at the end of a string can do that without screwing up?

Sure, you can force everyone to use library calls for all their string operations, but C was intended to be cheap, dirty and fast, which is why there is so much direct string access in C code. If you told them they'd have to use library calls they'd just write their own code instead for better performance, and get it wrong anyway.

Re:Missed the point (1)

alta (1263) | more than 2 years ago | (#36968732)

It was all good to the end there, then you started sending me these coded ssl certs, and I think you just hacked my computer. damn you smart people and your buffer overflows.

Re:Missed the point (0)

Anonymous Coward | more than 2 years ago | (#36968796)

it would have had the insane limit of 255-byte strings

I understood the following to mean 2 bytes for the length - 1 byte saved for no magic marker = 1 byte extra. So, 32K or 64K bytes total

Using an address + length format would cost one more byte of overhead than an address + magic_marker format

Re:Missed the point (0)

msobkow (48369) | more than 2 years ago | (#36968828)

The C language is just a meta-assembler for the PDP instruction set that hung around a lot longer than the machine. It's not an abstract language, as anyone who coded for a PDP can tell you.

Poor guy. I guess sooner or later he's going to have to learn how to manage his memory and understand how the underlying physical hardware works. That must be a real toughie for anyone who learned to "program" in the Java/C# world.

I think the bigger point that's missed is that if a size field were used, you'd still have the same buffer overflow problem if someone simply specified a size that didn't match the allocated memory, same as strncpy will happily try to keep writing to a buffer if you give it bad size information. What you really want to do is use a higher level language like C++ with StringBuffer and MemoryBuffer objects that keep track of not only the in-use size, but the allocated size of a buffer.

Oh yeah, those objects do exist. Doh!

Maybe he should RTFM.

Re:Missed the point (2)

mgiuca (1040724) | more than 2 years ago | (#36968950)

I think the bigger point that's missed is that if a size field were used, you'd still have the same buffer overflow problem if someone simply specified a size that didn't match the allocated memory, same as strncpy will happily try to keep writing to a buffer if you give it bad size information.

Exactly. The real problem* is that C lets programmers fabricate data however they want.

*I say "problem" but it really is the whole point of C. It is a dangerous and powerful tool. To make it less dangerous would make it less powerful, and if you wanted such a language, there are plenty available.

Why not both? (0)

Anonymous Coward | more than 2 years ago | (#36968502)

When you look at std::string it uses both, and is better for it; many uses are much easier and faster when we know the length and for others few things beat a null-terminated string.

Re:Why not both? (1)

MrEricSir (398214) | more than 2 years ago | (#36968628)

The wonderfully-named GString in GLib works the same way [gnome.org] .

The downside to this approach however is it requires some extra steps when retrieving a string from a C-based API. And of course if the external C-based library has a string handling bug, you're back to square one.

Re:Why not both? (2)

c0lo (1497653) | more than 2 years ago | (#36968886)

I'll argue that's the correct decision at a such low-level as C.

1. with NULL-terminated strings, there's no distinction (other than in the string.h and related library) between a char * and a other_type *. Inventing a "string" type in C (not C++) would have made the compiler more complex (see footnote **)
2. because char * is no different than other_type* , I can pass the address in the middle of the string char * for processing. Not so much for a std::string. How does it matter? Well, take parsing for example (the most trivial strtok) not only that one will need an extra string-len prefix, but you'll need to keep a separate "curr_pos".

If you have a NULL-terminated char* string, one can invent/use a std::string (or GString, or NSString, or Pascal-string). The reverse is not true: having the compiler accepting only Pascal-strings, it's not possible to start using the NULL-terminated convention.

many uses are much easier and faster when we know the length and for others few things beat a null-terminated string.

While in other cases (when you pass a std::string by-value and invoke the copy constructor, which tends to happen a lot), you have a hefty performance penalty.

Footnote ** - Dennis M. Ritchie on the C history.

C treats strings as arrays of characters conventionally terminated by a marker. [bell-labs.com] Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type.

Hehe, ACM mentions Slashdot (1)

Compaqt (1758360) | more than 2 years ago | (#36968506)

That's the way it happens in Soviet Russia, too.

Seriously, though, it's hard to know what language you as a system administrator should use for something like a data logger that has to run continuously (or cron every minute or so) other than C, but then there's the security problem that some user will come up with some weird filename hack to subvert the system.

Re:Hehe, ACM mentions Slashdot (1)

phantomfive (622387) | more than 2 years ago | (#36968756)

Isn't the filename hack a problem for any language? My simple method for avoiding it is putting all filenames in single-quotes, and filtering out any single-quotes from user input.

not just a memory issue (0)

Anonymous Coward | more than 2 years ago | (#36968524)

Doesn't the magic marker method give you string lengths limited only by available memory and not by the size of the piece of memory devoted to length?

goat (1)

For a Free Internet (1594621) | more than 2 years ago | (#36968528)

buttttttttttsexx0eFF!!@!

Mistake? (0)

Anonymous Coward | more than 2 years ago | (#36968536)

I wouldn't call this a mistake. The paradigm of programming in C is largely based on nuances like this. It makes you write code in a certain way that, in my opinion, is better suited for certain situations. The alternative mentioned in the summary would have made it a bit closer to OO programming as far as strings go, which one can argue would have been better, but I prefer to have differences like this in lower-level languages.

Maybe a better candidate (5, Interesting)

phantomfive (622387) | more than 2 years ago | (#36968546)

C. A. R. Hoare, the inventor of Quicksort, also invented the NULL pointer. Something he apologized for [wikipedia.org] :

I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

Re:Maybe a better candidate (0)

AK Marc (707885) | more than 2 years ago | (#36968816)

A project is never finished.

Sadly, I bet you make a mint as a project manager. I've never worked with a competent project manager. Quipps like that are directly opposite of all project management best practices, but believed by all poject managers I've worked with "in the wild." A project must end or it's a program, not a project. A program never ends. A project must have a defined end before it starts or it is not a project. But being wrong about everything must make one a good project manager, since I've never seen a project manager (usually making three times the next highest paid on their team) that was ever right on anything, ever.

Re:Maybe a better candidate (1)

phantomfive (622387) | more than 2 years ago | (#36968858)

Sadly, I bet you make a mint as a project manager.

lol I hope someday I can find out. I am but a programmer, and I grabbed a quote from fortune because I was tired of my other sig.

since I've never seen a project manager (usually making three times the next highest paid on their team) that was ever right on anything, ever.

Maybe the problem is you suck as a programmer. Just a thought.

Re:Maybe a better candidate (1)

GrandTeddyBearOfDoom (1483117) | more than 2 years ago | (#36968824)

The problem is people programming in low level languages who lack the mental discipline to do so. You may give excuses of time, lack of anyone better, etc. but fundamentally low level programming requires a disciplined and trained mind, and we never gave training such minds the priority it deserved: we just produced programmers the quick and easy way.

Hoare has nothing to apologise for. If NULL references weren't there, we'd be forced to jump backward somersaults through random hoops in order to achieve what they manage, which is to temporarily divorce reference from meaning. This is crucial to human thought and it was correct to have it in C, it just gives undisciplined programmers sufficient rope to very artistically hang themselves. What should have been written was a stdref library that abstracts all functionality of C references besides the NULL pointer and programmers taught that by default. Trying to take NULL out of C would be like trying to take 0 out of mathematics. Have a go, but don't expect things to be anywhere near as elegant.

Re:Maybe a better candidate (1)

RamblinWreck33 (2425478) | more than 2 years ago | (#36968970)

without null, I can't imagine what would have been the case for implementing so many programming constructs today. Most languages have some type of isempty() which can b seen as a continuation of null, and I for one wouldn't want to implement any sort of list without it. Programming in assembly, you don't get NULLs (at least not MIPS) and that's one of the difficulties (among many).

"typical and rational IT or CS decision" (0)

NapalmV (1934294) | more than 2 years ago | (#36968550)

They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?

Re:"typical and rational IT or CS decision" (2, Insightful)

perpenso (1613749) | more than 2 years ago | (#36968736)

They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?

Programmer for 20+ years here, BS and MS in CS. I used to share such opinions. Then I went to business school. I really enjoyed business school in part because I was constantly amused by how ignorant and wrong I had been regarding such opinions. May I be bold enough to suggest that the portrayal of MBAs in popular and nerd cultures are about as accurate as the portrayal of programmers in popular and non-nerd cultures.

None of the above should be interpreted to mean that business school makes one appreciate Dilbert any less. Dilbert is actually pretty popular with MBA types and their professors as well.

Re:"typical and rational IT or CS decision" (1)

RyuuzakiTetsuya (195424) | more than 2 years ago | (#36968910)

kind of agree with you and the parent.

I mean, I think what the parent's saying about management is largely true. They're out to save their own asses, but you're right that there's something else going on behind the scenes. They really do need to look after our own interests too. It wouldn't add up if it turned out somehow you had a rockstar manager but completely incompetent boobs for subordinates.

My old boss was friggin' awesome on this point. She made it a point to highlight the accomplishments of her team and not just present it as if she was some sort of managerial genius.

Which is worse? (0)

Anonymous Coward | more than 2 years ago | (#36968570)

Which is worse? Having it be O(N) to get a string length and having inexperienced programmers get confused and make mistakes? Or capping your maximum string length at 0xffff?

I'll take the former, please. I do a lot of string manipulation in C and when you're used to it, it's actually not that bad to get right and still be efficient. And it provides a useful shibboleth to detect people who are no good at C. :-) Just think of how much harder it would be to interview a C programmer if you couldn't give them a crazy string manipulation problem.

Re:Which is worse? (1)

The Dawn Of Time (2115350) | more than 2 years ago | (#36968612)

And of course, in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, your attitude is approximately as useful as a purse to a fish.

Re:Which is worse? (1)

c0lo (1497653) | more than 2 years ago | (#36968904)

And of course, in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, your attitude is approximately as useful as a purse to a fish.

Hmmm... let's not stop mid-way.

in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, you are approximately as useful as a purse to a fish.

FTFY: in a world who doesn't give a dam' about professionalism, being one is useless.

Re:Which is worse? (1)

snowgirl (978879) | more than 2 years ago | (#36968616)

Which is worse? Having it be O(N) to get a string length and having inexperienced programmers get confused and make mistakes? Or capping your maximum string length at 0xffff?

I'll take the former, please. I do a lot of string manipulation in C and when you're used to it, it's actually not that bad to get right and still be efficient. And it provides a useful shibboleth to detect people who are no good at C. :-) Just think of how much harder it would be to interview a C programmer if you couldn't give them a crazy string manipulation problem.

I don't know why you have to bring sibolezes into this...

Re:Which is worse? (0)

Anonymous Coward | more than 2 years ago | (#36968726)

Just think of how unnecessary it would be to interview C programmers if they didn't have to solve crazy string manipulation problems.

FTFY.

Re:Which is worse? (1)

Dadoo (899435) | more than 2 years ago | (#36968782)

I guess my question would be: what if I want a string that contains NULs? (Yes, I've had this situation, before.)

Re:Which is worse? (1)

AuMatar (183847) | more than 2 years ago | (#36968890)

Since NUL is an unprintable character with no meaning, there's no reason to do that. Now you may have needed a byte pointer with NULs in it, but that's not the same as needing a string with it.

Re:Which is worse? (1)

c0lo (1497653) | more than 2 years ago | (#36968920)

I guess my question would be: what if I want a string that contains NULs? (Yes, I've had this situation, before.)

Then you want a char array, not a string. How do you solve it when you need an int array?

Re:Which is worse? (1)

smellotron (1039250) | more than 2 years ago | (#36968922)

I guess my question would be: what if I want a string that contains NULs? (Yes, I've had this situation, before.)

Pass around a size_t with your pointer and use the mem*() family of functions instead of str*().

Whatever (4, Funny)

Old Wolf (56093) | more than 2 years ago | (#36968572)

Come on , this is complete rubbish___8^)_#;3,2,.3root>^$)(^(943hellomax0984)_))1..l2l2_}[[}{

Re:Whatever (0)

Anonymous Coward | more than 2 years ago | (#36968598)

he-freaking-larious!

Re:Whatever (0)

Anonymous Coward | more than 2 years ago | (#36968892)

This is the most interesting article to appear on Slashdot in years....

Actually tradeoff may not have been rational (1)

perpenso (1613749) | more than 2 years ago | (#36968580)

this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings. Perhaps this is actually an example of premature optimization, optimizing things before proper profiling and analysis has shown the problem exists and the proposed solution is beneficial.

Re:Actually tradeoff may not have been rational (2)

c0lo (1497653) | more than 2 years ago | (#36968942)

this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

Actually the tradeoff may not have been rational.

Actually, the choice was rational [bell-labs.com] (at least, on purpose) - you see, it's not about a single byte, it's about a new data type.

C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user.

Error! (1)

larry bagina (561269) | more than 2 years ago | (#36968588)

If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a VM (virtual memory) page and the next VM page is not defined, this would cause the process to die from an unwarranted "page not present" fault.

On all modern computers, the page size is a power of 2, cleanly divisible by, 32, 64, 128, 256, etc Modern computers have a terrible penalty (sometimes including SIGBUS) for memory accesses which aren't aligned on the native word size. Throw those two facts together and you can't accidentally read past the vm page.

online shopping (-1)

Anonymous Coward | more than 2 years ago | (#36968604)

Yumedeals is the Online, Retail marketing and Distribution venture,and It operates in a multimedia environment through web to provide high quality products and services directly to customers across the country.
Yumedeals is committed to providing a delightful customer experience, through entertaining and outstanding content on the Web and its high quality captive customer sales center please visit our web site...for on line shopping ....
read more....http://www.yumedeals.com

Worst mistake: Therac-25 (0)

Anonymous Coward | more than 2 years ago | (#36968608)

Unchecked boundary conditions, in the case of the Therac-25 an overflow of a one-byte counter, are a fatal flaw in poorly written software. In older 8-bit apps, this could wind up with random unexplained crashes. Well, in this case it caused people to be exposed to high-doses of radiation over large areas of their bodies and cost people lives. (and when I learned about this, it was when I decided I was much happier working on web/e-commerce stuff than working on embedded systems programming)

The cost of a byte - or was that the value? (2)

Teunis (678244) | more than 2 years ago | (#36968626)

hmm. marker character, or a length.

Marker: same type as string, so no need to worry about bit size, start/stop bits or other extraneous. String can be any size and only restricted by available memory. (given the ability to swap darn near unlimited pages in current hardware.... and the ability to virtualize across computers... this means strings have a potentially <i>infinite</i> limit)

Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

IMO, C and the null terminated string -saved- more than it cost. It's entirely (theoretically anyway) possible - given the kind of code I've seen in browsers and server code -that the web couldn't have existed without some of these assumptions. The "streaming" so core to unix depends on this... how else does one know when one hits the end of a file or a buffer?

When you mark cost, know what you pay. Not all costs are negative.

Re:The cost of a byte - or was that the value? (1)

Sloppy (14984) | more than 2 years ago | (#36968806)

Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

These aren't hard questions, IMHO. Just say the length is an int (or an unsigned int), and then assuming you didn't freak out when someone asked all those very same questions about ints, then you should be fairly happy with the result.

Re:The cost of a byte - or was that the value? (1)

Arlet (29997) | more than 2 years ago | (#36968884)

Instead of an int, shouldn't that be a size_t ?

Re:The cost of a byte - or was that the value? (1)

EvanED (569694) | more than 2 years ago | (#36968936)

String can be any size and only restricted by available memory.

It's not like you can't get that with counted strings. If you're in the "infinite" limit case, then you're already doing something very different than just treating a block of memory as a string, and so you can either use a terminated string in that (very unusual) case or allow for a variable-sized count field.

What's the size? What byte order? What bit size? How will this affect communications between platforms?

The size of the counter I'll grant you -- IMO this may be the biggest reason that I'm glad for historical reasons that C didn't go with count fields. (I'm worried that we'd still be using 2-byte fields or something nowadays.) But I think you're overstating the problems with it... you already have to worry about all of those problems.

It's entirely (theoretically anyway) possible - given the kind of code I've seen in browsers and server code -that the web couldn't have existed without some of these assumptions. The "streaming" so core to unix depends on this... how else does one know when one hits the end of a file or a buffer?

I don't buy that one iota.

So first, "how do you know when you git the end of a file"? That's not signaled by null in the first place, so the same way you do now. End of a buffer? Because you reached the count.

Second, it's not like if there was a situation where you'd frequently not know the size of the data a priori you wouldn't be able to change the protocol and include a terrminator in that instance. (You could use this to still provide something like find's -print0 and xarg's -0 if you didn't want lengths to show up on standard out.)

Third, think about what your assertion basically boils down to: that you can't do web programming in languages that give you counted strings. And of course that's crazy.

Personally I think there's something you don't see much in this debate: there are actually three pieces of information that matter: the string data, the length of the string, and the size of the buffer. It's always necessary to track the first, but any time you want to extend the length of the string you have to track the third. (And that's a fair bit.) In my ideal world, C's "standard" string representation (supported by the language-provided APIs) would have been like that. (Windows has it right [microsoft.com] .)

Re:The cost of a byte - or was that the value? (1)

c0lo (1497653) | more than 2 years ago | (#36968992)

Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

Adding: how do you pass to you the tail substring? (like: I parsed to here, take over from now on. Oh, yes, on top of the length, deal with another offset).

register starvation (1)

tabrisnet (722816) | more than 2 years ago | (#36968634)

The real problem with the addr+len approach is that now every string becomes a struct, or a structptr.

This means that when passing a string to a function, either the string takes up two register/stack slots, or you're passing around a const-ptr (but the contents of the struct are not const), which means one more memory access due to pointer indirection.

x86 and the PDP-11 are register-starved. the x86 has 8 registers, with 4 or 5 available as general-purpose registers.
The PDP-11 was similar with 8 registers total as well.

Re:register starvation (1)

perpenso (1613749) | more than 2 years ago | (#36968962)

My assembly class in college was on a PDP-11. I've done quite a bit of x86 assembly over the years. I'm confused as to why you think a pascal style string structure pointer requires any more registers or stack than a C character pointer. In assembly if I want the the length I would reference a size_t at the pointer address, and if I want text I would reference a char at pointer+offset where offset is sizeof(size_t).

Re:register starvation (1)

EvanED (569694) | more than 2 years ago | (#36968964)

x86's register situation, while not nearly as good as it should be (even x64 isn't all that good), is not nearly as bad as it seems. First, register renaming does a bit to help, but my understanding is that x86 chips pull a special trick: they are able to specially detect most reads and writes to the top several stack slots and redirect those accesses to a register as well. (It's been a while since I've read that, and I forget where.)

(BTW, your "4 or 5" is a little low: it's really 6 or 7 registers that are generally available. You definitely get eax through edx, esi, and edi. That's 6. If you turn on frame pointer optimization, you've also got ebp.)

Re:register starvation (0)

Anonymous Coward | more than 2 years ago | (#36968982)

Please... Please ... Mod Parent +1 informative. Please.... its actually brilliant.

Slashdot Sensation Prevention Section (4, Informative)

gmhowell (26755) | more than 2 years ago | (#36968644)

FTA:

We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.

In other words, Ken, Dennis, and Brian did the right thing.

Slashdot Sensation Prevention Section (1)

Target Practice (79470) | more than 2 years ago | (#36968698)

Wow, this place has come a long way from a simple news for nerds site. Now, the authors are placing disclaimers specifically addressed to us :)

Fair and balanced (0)

lucm (889690) | more than 2 years ago | (#36968706)

From the article:
> Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed [...]

This is the kind of factual, objective and unbiased content that gives credibility to an article.

Got it wrong (3, Insightful)

Spazmania (174582) | more than 2 years ago | (#36968716)

It probably wasn't about the bytes. The factors are:

1. Complexity. Without exception, every variable in C is an integer, a pointer or a struct. A null terminated string is a pointer to a series of integers -- barely one step more complex than a single integer. To keep the string length, you'd have to employ a struct. That or you'd have to create a magic type for strings that's on the same level as integers, pointers and structs. And you don't want to use a magic type because then you can't edit it as an array. Simplicity was important in C -- keep it close to the metal.

2. Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

3. Bytes. It's only one extra byte with a magic type or an advanced topic struct. In both cases with an assumption that the maximum possible length on which the standard string functions will work is 64kb. If you're talking about a more mundane struct then you're talking about an int and a pointer to a block of memory which has an extra set of malloc overhead. That's a lot of extra bytes, not just one.

For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.

Re:Got it wrong (2)

PCM2 (4486) | more than 2 years ago | (#36968758)

Beyond those points:

It is interesting to compare C's approach with that of two nearly contemporaneous languages, Algol 68 and Pascal [Jensen 74]. Arrays in Algol 68 either have fixed bounds, or are `flexible:' considerable mechanism is required both in the language definition, and in compilers, to accommodate flexible arrays (and not all compilers fully implement them.) Original Pascal had only fixed-sized arrays and strings, and this proved confining [Kernighan 81]. Later, this was partially fixed, though the resulting language is not yet universally available.

C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user. Nevertheless, C's approach to strings works well.

And that's coming from Dennis Ritchie, [bell-labs.com] who was there.

Re:Got it wrong (1)

Homburg (213427) | more than 2 years ago | (#36968794)

To keep the string length, you'd have to employ a struct.

No, strings with a listed length would also be pointers to a series of integers - it's just that, instead of giving a value special semantics (0 as end of string), you give a position in the series special semantics (store the length in the first two bytes). In both cases, you need your string-handling functions to be aware of whatever the convention is.

Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

I don't know that that's true. Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequent. What are the common cases you are thinking of where C-style strings are faster?

Re:Got it wrong (1)

Arlet (29997) | more than 2 years ago | (#36968848)

(store the length in the first two bytes)

So 65536 byte strings should be enough for anybody ?

Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequent. What are the common cases you are thinking of where C-style strings are faster?

C-style strings are simpler. That's the biggest advantage. For the few cases where performance matters, you can always define your own string type.

PHK wide of the mark (5, Insightful)

epine (68316) | more than 2 years ago | (#36968774)

Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.

C makes a big deal about the equivalence of pointers and arrays. Plus in C a string also represents every suffix string.

char string [] = { 't', 'e', 's', 't', '\0' };
char* cdr_string = string + 1;

Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)

Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.

No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.

If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.

Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?

PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.

For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.

And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).

The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.

BSD is dead! (0)

Anonymous Coward | more than 2 years ago | (#36968780)

Man, its a sad day on Slashdot that PHK says something and noone says that BSD is dead! You wingnuts are losing your edge.

The trouble is arrays, not strings. (3, Interesting)

Animats (122034) | more than 2 years ago | (#36968814)

The problem with C isn't strings. It's arrays. Strings are just a special case of arrays.

Understand that when C came out, it barely had types. "structs" were not typed; field names were just offsets. All fields in all structs, program-wide, had to have unique names. There was no "typedef". There was no parameter type checking on function calls. There were no function pointers. All parameters were passed as "int" or "float", including pointers and chars. Strong typing and function prototypes came years later, with ANSI C.

This was rather lame, even for the late 1970s. Pascal was much more advanced at the time. Pascal powered much of the personal computer revolution, including the Macintosh. But you couldn't write an OS in Pascal at the time; it made too many assumptions about object formats. In particular, arrays had descriptors which contained length information, and this was incompatible with assembly-language code with other conventions. By design, C has no data layout conventions built into the language.

Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program. The C compiler had to be crammed into that. That's why the original C is so dumb.

The price of this was a language with a built in lie - arrays are described as pointers. The language has no idea how big an array is, and there's not even a way to usefully talk about array size in C. This is the fundamental cause of buffer overflows. Millions of programs crash every day because of that problem.

That's how we got into this mess.

As I point out occasionally, the right answer would have been array syntax like

int read(int fd, char[n]& buf, size_t n);

That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on. The classic syntax,

int read(int fd, char* buf, size_t n);

is a lie - you're not passing a pointer by value, you're passing an array by reference.

C++ tries to wallpaper over the problem by hiding it under a layer of templates, but the mold always seeps through the wallpaper when a C pointer is needed to call some API.

Re:The trouble is arrays, not strings. (0)

Anonymous Coward | more than 2 years ago | (#36968984)

int read(int fd, char[n]& buf, size_t n);

That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on.

How do you figure there's no overhead? There's now an extra argument that needs to be passed to the caller so it can be evaluated at runtime. And what is the compiler to do with the type information, other than to verify naive const overindexing of the array?

I agree it would be a cool concept but the cost is not zero and the compile-time error checking it would provide would be negligible. Having the array size available for evaluation at runtime would be cool, but it's just syntactic sugar added to the standard:

int read(int fd, char *buf, size_t buf_max, size_t n);

First: READ TFA (1)

Hymer (856453) | more than 2 years ago | (#36968834)

PHK's articles are worth reading... always.

Second: there is a /. Sensation Prevention Section where he explains that NUL-terminated strings was the correct choice at the time, it just caused some unforeseen consequences.

String Nulls in SQL (1)

Tablizer (95088) | more than 2 years ago | (#36968840)

Die and rot in nullhell! The verbosity and work-arounds they force...

Were nul-terminated strings essential? (1)

davidgay (569650) | more than 2 years ago | (#36968844)

The real question nobody has addressed here: if C had gone for length+characters for its string, would it have succeeded?

David Gay, scarred by Pascal "strings"
PS: I've often wondered the same about that other decried C feature, the preprocessor.

Well I differ in my view. (3, Informative)

hamster_nz (656572) | more than 2 years ago | (#36968860)

After 25 years of using C, I don't mind the strings being terminated by nulls. If you want to do something else, just don't include string.h.

Terminating with a null is only a convention - the C language itself has no concept of strings. As others point out, it is either an array of bytes or a pointer to bytes.

it isn't forced on to you - you don't have to follow it.

Missing the point (1)

Casandro (751346) | more than 2 years ago | (#36968928)

It would have been more urgent to find out where an allocated part of RAM ends.

Or just like Integers and floats, strings could have been their very own basic type. Essentially leave the implementation of it to the compiler, so it can do range checks. Most C-programmers seem to believe that this is done already.

BTW range check on integers don't cost anything anymore. I've benchmarked some real-life code using large arrays (doing statistics on it) and range checks didn't cause any slow down. Essentially the compare operation can be done in parallel with the memory read operation.... that is when your language supports that at all.

Faster loops (4, Insightful)

Sloppy (14984) | more than 2 years ago | (#36968932)

TFA suggests the decision was to save a byte, but I don't believe that's the main reason it happened.

If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero, so that's your loop test right there. Branch if zero. If you have to compare an offset to a length on every iteration, then now you're having to store this offset in another register (great, like I have lots of registers to spare on 1970s CPUs?) and compare (i.e. subtract) to the length which is stored in memory (great, a memory access) or another register (oh great, I need to use another register in the 1970s?!) and the code is bigger and slower.

It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s). Kids these days: if you weren't there, you just don't know what it was like.

BTW, I have a hunch K & R didn't know they were building such an eternal legacy. It's reasonable to speculate that this is still going to be part of systems a hundred years from now, but in 1970 you would have been a mad man to suggest such a thing. (Not that this invalidates TFA's point at all; I'm just making excuses for K&R I guess.)

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...