C++ and copy-on-write data structures

I generally like C++ and the way C++ feels, but there is one specific quirk which is caused by STL designers’ choice of data structures which cannot be implemented with a copy-on-write (aka COW) semantic. Curious about which quirk? Keep on reading! COW might look like an implementation detail, but it can really change the way you design your APIs.

What is a COW?

An object with COW semantic (sometimes also called “implicit sharing”) has an interesting property: whenever you copy it, it does not really do a “deep copy” but simply shares the underlying data between the original object and its copy, and increments a reference counter to track this. Only at the moment that any of the copies is further modified (written to), the deep copy is really performed.

Now, it is important to realize that the overhead imposed by the COW semantic is usually very tiny: the object just needs to bookkeep a counter at every copy, and check it before any modification. The latter is even lighter than it sounds because the compiler can strip off many checks thanks to constant propagations.

But what are the advantages of using COW? The first and foremost advantage is the possibility to have a clean API with respect to return values.

Returning a COW

If you have programmed C++ long enough, you know that all C++ functions that should return a complex data structure (anything which is larger than a small structure, like a container, a memory buffer, and so on) will instead accept a reference or a pointer to an object that the caller must provide and that will be “filled in” as return value. This is absolutely awful. If you don’t see it as awful, it’s about time you take a break from C++ and switch to other languages for a while. For instance:

std::vector nodes;
cur_node.children(nodes);
printf("Number of children: %u\n", nodes.size());
The Node::children() function, in any other language, would have returned a vector of Nodes. But idiomatic C++ says that you should not impose the overhead of doing an additional copy of the vector onto your user, and thus ask him to pass you the container to fill. This is really unreadable because it breaks the common syntax of functions by intermixing return values to arguments. Moreover, it imposes additional troubles to the API; eg: what happens if you callNode::children() with a non-empty vector? Will the function just append its data to the vector? Or would it clear it at the beginning?

Now, think of a world where std::vector is a COW data structure. All these problems suddenly disappear because returning the vector would not cause a copy anymore. Bam, problem solved! And when you have a clear API, where return values really are return values, you get all the benefits from it:

printf("Number of children: %u\n", cur_node.children().size());
Three lines merged into one. No more temporary named variable. And if Node happens to internally store its children as a vector, this code is even faster than the non-COW version, because there is no copy performed at all: you get the same speed as if children() returned a reference to the internal vector, even though it does not.

COWs to the rescue

Other advantages of COW data structures:

  1. You can nest COW data structures without overhead. Think of vector<vector<int>>. Without COW, whenever the outer data structure resizes, the inner data structures will all be reallocated and copied. With COW, there is no copy at all: the internal data structures are basically moved over in constant time.
  2. You can stop abusing const references everywhere. In C++, most functions will receive arguments by const references instead of simple values. Again, this is to save the overhead of a copy, and again this is almost useless with COW data structures. Const references are also nasty because they represent a first leak of const-correctness in your code (I also hate the whole const-correctness in C++, but this is for another day; if you like it, you can still make your functions accept const values if you feel too).

Norwegian COWs

Nokia’s Qt has designed all objects and data structures with COW. They even made them reentrant (for multithreading purposes) by using atomic reference counting through native CPU instructions. Finally, they expose the guts of COW through a simple QSharedData template that you can use to reimplement your own COW objects.

I could not agree with Qt’s designers more.

This is another example of how COW can positively influence API design:

QByteArray fileHash(QString fn)
{
   QFile f(fn);
   QCryptographicHash h(QCryptographicHash::Sha1);
   while (!f.atEnd())
        h.addData(f.read(16*1024));
 
   return h.result();
}
See how I can simply pass the buffer returned by QFile::read() to addData() without having to worry about memory copying.

Counting COWs

So, the next time you design an API which asks for a return value among the arguments, think of a C++ world where COW is everywhere. And dream on.