Inheriting type behavior in Julia

Re-posted from: http://grollchristian.wordpress.com/2014/01/22/julia-inheriting-behavior/

In object oriented programming languages, classes can inherit from classes of objects on a higher level in the class hierarchy. This way, methods of the superclass will apply to the subclass as well, given that they are not explicitly re-defined for the subclass. In many regards, super- and subclasses hence behave similarly, allowing the same methods to be applied and similar access behavior. Joined together, they hence build a coherent user interface.

In Julia, such a coherent interface for multiple types requires a little bit of extra work, since Julia does not allow subtyping for composite types. Nevertheless, Julia’s flexibility generally allows composite types to be constructed such that they emulate the behavior of some already existing type. This only requires a little bit of extra coding, but can be implemented efficiently through metaprogramming.

1 Justification of inheritance

Such an emulation or inheritance of behavior is desired in basically two situations.

1.1 Implementing constrained subsets

The first situation where behavior inheritance could be useful is when interest lies only in a subset of all possible instances of an already existent type. For example, let’s assume that we want to create a new composite type Simplex, which shall contain all points on the $n$ -dimensional simplex $\Delta^{n}$ . Thereby, the simplex is defined as the set of $n$ -dimensional points with individual entries being positive and summing up to one:

$\displaystyle \Delta^{n}=\{x\in \mathbb{R}^{n}| \sum_{i=1}^{n}x_{i}=1, x_{i}>0\}$

Clearly, $\Delta^{n}$ is a subset of $\mathbb{R}^{n}$ . And, for points of $\mathbb{R}^{n}$ we already have a quite simple and suitable type in Julia: Array{Float64, 1}. However, representing the simplex in Julia, we would like to have a type that corresponds to only a subset of all instances of Array{Float64, 1}. This way, we can achieve more robust coding, since users will not be able to manually break the summation constraint $\sum_{i=1}^{n}x_{i}=1$ too easily. Also, the creation of a new composite type allows overloading function definitions. We hence could create a method plot(x::Simplex) which suits the specific characteristics of the subset of $\mathbb{R}^{n}$ . On the other hand, however, any instance of Simplex still represents an element of $\mathbb{R}^{n}$ , and hence ideally should behave as such in Julia as well. Any methods generally working on Array{Float64, 1} should also be working for instances of Simplex. In contrast to some object oriented languages, Julia does allow subtypes only for abstract types, and not for composite types. Hence, even though any mathematical operation on arbitrary points in $\mathbb{R}^{n}$ also holds simultaneously for points of $\Delta^{n}$ as well, methods of Array{Float64, 1} will not automatically apply for instances of type Simplex.

A naive way to fix this would be through definition of an appropriate conversion method, allowing to re-interpret instances of Simplex as instances of Array{Float64, 1}. This way, an already existing method of Array{Float64, 1} could be generally expressed for type Simplex:

function f(x::Simplex)
    xAsArray = convert(Array{Float64, 1}, x)
    return f(xAsArray)
end

Even more efficiently, one could incorporate a tight relation to Array{Float64, 1} into type Simplex from scratch up, defining Simplex such that it contains only one field, which is of type Array{Float64, 1}.

type Simplex
    points::Array{Float64, n}
end

This way, the conversion step becomes unnecessary, as methods only need to be delegated to the respective field of Simplex:

f(x::Simplex) = f(x.points)

Hence, we now already have a quite good starting point for inheritance of behavior. Instances of type Simplex, when plugged into the function, return the same value as if the $n$ -dimensional simplex point was simply represented as Array{Float64, 1}. Nevertheless, as we will see further on, a lot of intricacies are yet to come. But, before we get there, we first take a look at a second situation where inheritance of behavior might be desired.

1.2 Extending existing types

In the first case we where implementing a true subset relation, with $\Delta^{n}\subset \mathbb{R}^{n}$ . Here, we will take a given type and try to extend it with some new attribute. For example, let’s assume that we had to deal with numeric time series data that does not have any missing values. Of course, the data could be suitably stored and processed as instance of the very general type DataFrame. Time information would simply be stored in the first column, with values of the observations in the subsequent columns. However, data might be better stored as instances of a type that is more tailored to the specific characteristics of the data. For example, as the data values itself are numeric (except for the time column), they permit statistical operations like, for example, deriving mean, minimum and maximum values. Separating numeric observations from dates, code becomes more robust, as we are able to better distinguish between functions applying to the numerical part of the data and functions applying to the dates information.

For illustration, let’s look at two distinct ways of translating time series data into Julia data types. In the first case, we simply treat the data as true subset of type DataFrame, although in a newly created type TimeSeriesDf. This way, we still can overload functions and create methods that take into account the specific characteristics of the data. For example, a method plot(ts::TimeSeriesDf) would show date labels on the x-axis. Hence, we define:

type TimeSeriesDf
    vals::DataFrame
end

The problem is, that it becomes more tedious to separate dates and numeric data. For example, a method mean(ts::TimeSeriesDf) now might be implemented as

mean(ts::TimeSeriesDf) = mean(ts.vals[:, 2:end])

In contrast, we could also define a type TimeSeries that has two fields, where the first field deals with dates, and the second field contains numeric observations only.

type TimeSeries
    dates::Array{Date{ISOCalendar}, 1}
    vals::DataFrame
end

The distinction between dates and numeric data hence becomes more hard-coded into the data type. Here, a mean method would require no additional indexing:

mean(ts::TimeSeries) = mean(ts.vals)

In a way, instances of type TimeSeries are no true subtypes of type DataFrame anymore, as they are characterized through an additional dates field. It extends a simple DataFrame with an additional array of dates. Ideally, such a type should inherit behavior of both individual field types, making its application a naturally extension to already existing types. For example, adding a scalar value should take into account field vals only:

+(ts::TimeSeries, x::Numeric) = +(ts.vals, x)

This way, one can achieve a very nice and intuitive way of handling time series data. For more information, simply take a look at package TimeData, where you can see the benefits of type inheritance and type extension in a more elaborate use case.

In my opinion, these are the two most common cases where one would like to have a new composite type behaving like some already existent type. In both cases, the key lies in the delegation of functions to certain fields of the composite type.

2 Intricacies

Let’s now take a more detailed look at the intricacies that we have to expect in the course of implementation. Thereby, we want to look at the most general case, where we want to simultaneously create multiple new types. Each type shall borrow from the same existent type, and implement multiple functions with probably multiple methods per function. To avoid theorizing only, we want to look at a concrete example, where we will create types that inherit from DataFrame. The first type is called NumDf, and it represents DataFrames that consist of either numeric values or NAs only. The second type is called ArrayDf, and it even excludes NAs, too. Hence, for ArrayDf, the values itself could be stored as Array{Float64, 2}. However, relating it to DataFrames still allows to make use of column names. Furthermore, for reasons that will become clear later, we also render both types subtypes of an abstract type AbstractDf. For both types, the inner constructor will first need to check whether the constraints on the data are fulfilled.

abstract AbstractDf

type NumDf <: AbstractDf
    vals::DataFrame

    function NumDf(df::DataFrame)
        chkForNumericValues(df)
        return new(df)
    end
end

type ArrayDf <: AbstractDf
    vals::DataFrame

    function ArrayDf(df::DataFrame)
        chkForNumericValues(df)        
        chkNoNAs(df)
        return new(df)
    end
end

In the example, the newly created types consist of only one field which already is the same type as the type that should be emulated. However, this restriction is only for reasons of simplicity, and all subsequent propositions could easily be transferred to more complex structures, too. Again, if you are interested in this, simply take a look at package TimeData. It implements very similar types that also borrow functionality from DataFrames, but with an additional field for dates:

type Timenum <: AbstractTimenum
    vals::DataFrame
    dates::DataArray

    function Timenum(vals::DataFrame, dates::DataArray)
        chkDates(dates)
        chkNumDf(vals)
        if(size(vals, 1) != length(dates))
            if (length(dates) == 0) | (size(vals, 1) == 0)
                return new(DataFrame([]), DataArray([]))
            end
            error("number of dates must equal number of columns of data")
        end
        return new(vals, dates)
    end
end

2.1 Type preservation

We already have seen examples of functions that are easily delegated to the respective field of the new type. For functions like size() this makes perfect sense, and the method definition becomes:

size(nd::NumDf) = size(nd.vals)

However, not all the time this kind of perfect delegation is how we want methods to behave. Simply delegating all methods this way, we would end up permanently escaping our own data type. For example, simply delegating method exp(nd::NumDf) would indeed evaluate the exponential function on each individual entry. However, it would return these values as an instance of type DataFrame, as does method exp(df::DataFrame):

exp(nd::NumDf) = exp(nd.vals)
nd = NumDf(DataFrame(rand(3, 2)));
nd2 = exp(nd);

typeof(nd)
typeof(nd2)

NumDf (constructor with 1 method)
DataFrame (constructor with 22 methods)

Given that the returned values still meet the required constraints, we would like the method to return the resulting values as type NumDf again. Otherwise there would be no persistence in our data representation, as we would end up with type DataFrame sooner or later anyways.

Hence, we need to write some methods such that the original type of our data will remain unaffected. Methods get delegated to methods of the respective emulated type, but the returned value needs to be transferred back to the inheriting type. Therefore, we simply need to hand over the result to the constructor at the end:

function exp(nd::NumDf)
    valsDf = exp(nd.vals)
    return NumDf(valsDf)
end

We hence need to state more precisely what “inheriting behavior” should mean for any given method: should it return exactly the same output as the emulated type, or should it wrap up the resulting values at the end, in order to return an instance of the same type as the caller type? Both alternatives require slightly different code. Also, this distinction will become relevant in the course of meta-programming for more general cases comprising multiple functions and multiple new types.

2.2 Multiple method signatures

The next complication arises for functions with multiple method signatures. Due to multiple dispatch, functions may well have multiple methods defined that need to be delegated. For example, you can examine the size of DataFrames with two different methods size():

size(df::AbstractDataFrame)
size(df::AbstractDataFrame, i::Integer)

In principle, each of both methods could just be delegated for itself.

size(tn::NumDf) = size(tn.vals)
size(tn::NumDf, i::Integer) = size(tn.vals, i)

This, of course, seems to be redundant, as it is quite cumbersome to delegate each method individually. Hence, one might be tempted to tackle both methods in one assignment, making use of variable function arguments:

size(tn::NumDf, args...) = size(tn.vals, args...)

Although this seems quite reasonable, one still needs to be very careful with this approach, since it could easily lead to method ambiguities. For example, let’s take a look at function .== for DataFrames. Amongst others, there exist the following methods:

.==(a::DataFrame,b::NAtype)
.==(a::DataFrame,b::DataFrame)
.==(a::DataFrame,b::Union(Number,String))

Using variable function arguments, the delegation would be implemented as:

.==(x::NumDf, args...) = .==(x.vals, args...)

This, however, will lead to the following warning:

Warning: New definition 
    .==(NumDf,Any) at none:1
is ambiguous with: 
    .==(Any,AbstractArray{T,N}) at bitarray.jl:1450.
To fix, define 
    .==(NumDf,AbstractArray{T,N})
before the new definition.

Hence, there are two methods that possibly could be called for (NumDf,AbstractArray{T,N}), and Julia will kind of randomly pick one of them.

Although in this case both methods .== would most likely lead to the same result when called with .==(NumDf,AbstractArray{T,N}) anyways, it is considered poor style to simply ignore these warnings.

Furthermore, the ambiguity is not a side effect of the variable function argument only, but it also appears with the following method definition:

.==(x::NumDf, y::Any) = .==(x.vals, y)

In my opinion, it is better to refrain from extensive method delegations with args... or Any whenever possible. Even if your own code worked, excessive use of such extensive delegations could still impose problems for other people that want to build on your code.

Nevertheless, this does not mean that you have to type any individual method definition by itself! For example, a really large part of methods of DataFrames comes with the exact same method signatures:

f(b::NAtype,a::DataFrame)
f(a::DataFrame,b::NAtype)
f(a::DataFrame,b::DataFrame)
f(a::DataFrame,b::Union(String,Number))
f(b::Union(String,Number),a::DataFrame)

Using macros and metaprogramming, one can easily define all functions with equal method signatures simultaneously in one rush.

3 Implementation

Given these intricacies, let’s look at the actual implementation of inheritance for the most general case of multiple inheriting types, multiple functions and possibly multiple methods per function.

In a first step, all functions need to be classified with respect to two criteria:

the method signatures of the function
whether the function is type preserving or not

For example, for the case of type NumDf, the following table classifies some exemplary functions with respect to three different method signatures.

method signatures	type preserving	non-preserving
f(nd::NumDf)	:abs, :exp, :log	:length, :isempty

f(nd::NumDf)	:round, :floor	:size
f(nd::NumDf, x::Int)

f(b::NAtype,a::NumDf)	:+, :-, :*, :.^	:.<, :.>, :.==
f(a::NumDf,b::NAtype)
f(a::NumDf,b::NumDf)
f(a::NumDf,b::Union(String,Number))
f(b::Union(String,Number),a::NumDf)

Once that all desired functions are classified, functions within the same combination of method signatures and preservation kind can be simultaneously defined in a loop. One simply needs to iterate over all functions within a block, and interpolate the function names into an adequate macro.

Additionally, when multiple new types need to be defined simultaneously, one could also iterate over all types. However, for non-preserving functions, methods equivalently could be expressed with reference to an abstract supertype, so that the additional loop over all new types becomes unnecessary.

Let’s just make this more clear through an example implementation for the case of method signatures given by

f(nd::NumDf)
f(nd::NumDf, x::Int)

First, let’s look at the implementation for non-preserving functions. (In this case, the only function within this block is size. Nevertheless, we implement it as loop over all functions, since this provides more flexibility for later extensions.)

single_or_two_args_non_preserving = [:size]

for f in single_or_two_args_non_preserving
    eval(quote

        function $(f)(nd::AbstractDf)
            return $(f)(nd.vals)
        end

        function $(f)(nd::AbstractDf, i::Integer)
            return $(f)(nd.vals, i)
        end        

    end)
end

For the case of type preservation, one additionally needs to iterate over all new types.

single_or_two_args_type_preserving = [:round, :floor]

for t in (:NumDf, :ArrayDf)
    for f in single_or_two_args_type_preserving
        eval(quote

            function $(f)(nd::$(t))
                valuesDf = $(f)(nd.vals)
                return $(t)(valuesDf)
            end

            function $(f)(nd::$(t), i::Integer)
                valuesDf = $(f)(nd.vals, i)
                return $(t)(valuesDf)
            end

        end)
    end
end

3.1 Special cases

Using the instructions above, you’re now able to emulate behavior of a different type in almost all cases. Still, however, there are some special cases where you will most likely need to find your own solution.

3.1.1 Outer constructors

Outer constructors are different from general functions in that their name will naturally differ for each individual type. Hence, the function name (which is the name of the constructor) needs to be interpolated for each type as well.

for t in (:NumDf, :ArrayDf)
    eval(quote
        function $(t)(vals::Array{Float64, 2})
            $(t)(DataFrame(vals))
        end
    end)
end

3.1.2 Partially type preserving functions

In some cases, you either might want to have partially type preserving functions, or inherit from such functions. For example, take a look at getindex for DataFrames, which returns different types depending on the input:

using DataFrames
df = DataFrame(rand(4, 3));
typeof(df[2:4, 1:2])
typeof(df[:, 1])
typeof(df[1, 1])

DataFrame (constructor with 22 methods)
DataArray{Float64,1} (constructor with 1 method)
Float64

If you want to build on such a function, and you do not want to simply give back the same variation of types, you most likely need to manually adapt every single method. Sadly, these functions can be quite cumbersome to implement.

4 Conclusions

Although Julia does not come with a straightforward way to inherit behavior out of the box, its extensive meta-programming capabilities still provide the flexibility to achieve this with a minimal amount of effort. Still, however, meta-programming might represent quite a hurdle to overcome for beginners, and without it inheritance is rather cumbersome to implement.

Filed under: Julia Tagged: inheritance, types

juliabloggers.com

A Julia Language Blog Aggregator