Julia variable gotchas

By: Terence Copestake

Re-posted from: http://thenewphalls.wordpress.com/2014/04/07/julia-variable-gotchas/

As is typical for many languages, assigning one variable to another in Julia does not create a copy of the variable data, but rather a reference to the existing data. However, I learned the hard way whilst working on the CGI module* that Julia does not currently support a copy-on-write mechanism for collections.

Take the example code below:

n = [ 1, 2, 3 ]

m = n

As expected, m becomes a reference to the collection referenced by n. Working with any number of mainstream languages, one might expect a copy to be made of the data referenced by n if either n or m is modified, for example:

n = [ 1, 2, 3 ]

m = n

push!(n, 4)

# Expect n = [ 1, 2, 3, 4] and m = [ 1, 2, 3 ]

This is not the case for Julia. When the array pointed to by n is modified, m maintains its reference to that same array, giving both a value of [ 1, 2, 3, 4 ].

Problems in the wild

I encountered this quirk when working with binary data and UTF-8 strings.

n = Uint8[ 0x32, 0x33, 0x34, 0x61 ]

m = utf8(n)

empty!(n)

Having created a string using the utf8 function, I wanted to empty the original byte array to free those resources. After a few minutes of trying to figure out how a bounds error had crept in to my app, I narrowed it down to this deletion of the byte array.

Digging deeper into the Julia source, the utf8 function is just an alias for a conversion function.

utf8(x) = convert(UTF8String, x)
...
convert(::Type{UTF8String}, a::Array{Uint8,1}) = is_valid_utf8(a) ? UTF8String(a) : ...

You can see here that passing an array of Uint8 bytes to utf8() creates an instance of UTF8String with the Uint8 array as its data. The type definition for UTF8String is:

immutable UTF8String <: String
    data::Array{Uint8,1}
end

As was covered above, the UTF8String’s data field will be only a reference to the collection passed to the utf8 function. If that collection is modified in any way at any point during the program’s runtime, so too will be the returned string.

In closing

It seems that the solution at this time is to explicitly use the copy or deepcopy functions, where copies of data are required by the program logic.

The issue is explored in this Google Groups thread. If I’ve understood correctly, the gist of it is that Julia makes this sacrifice for the sake of performance. As this is a feature wanted by many, there’s a possibility of it being implemented in a later version of the language.

* Write-up to follow at a later date