Author Archives: Julia Computing, Inc.

Creating a New Programming Language (Keno Fischer Quora Session)

Keno Fischer, Julia Computing Co-Founder and Chief Technology Officer (Tools) participated in a Quora Session March 18-23. One of Keno’s responses was also featured in Forbes.

How do programming languages get created, and what goes into design decisions?

The first thing to think about in answering this question is: What is a programming language? If you ask Wikipedia that question, you will find that a programming language “is a formal language, which comprises a set of instructions that produce various kinds of output” which is of course true, but in true encyclopedia form also mostly unhelpful. It does give the right idea though. Just write down some instructions and some rules for what they do and viola you’ve created a programming language. If you write down these rules using slightly fancy language, you would call that the specification of your language and have a very good claim to have created a programming language.

Of course, in most instances, programming languages don’t start as exercises in specification writing. Instead, one starts with a program that actually does something with the programming language. Generally, this will either be a program that reads in some code written in the programming language and just does what the code says to do as it goes along (an “interpreter” – think following a recipe step by step) or one that translates the source code to the sequence of bits that the actual hardware understands (though this string of ones and zeros could also be considered a programming language that the hardware then interprets). There are a couple more exotic kinds of programs one could write to implement a programming language (e.g. type checkers, that just check that the source code is well-formed, i.e. allowed by the rules of the language, but don’t otherwise execute it) and various variations on compilers and interpreters (hybrid systems, compilers to “virtual hardware”, i.e. low level languages that are designed to be easy to map to actual hardware, compilers from one high level programming language to another, aka “transpilers”), but the key thing is that these programs “understand” the language in some way. The specification usually comes later, if ever.

Now, assuming you’ve started your own programming language, how does one decide what the language should be – what the available instructions are, what the rules and grammar of the language are, what the semantics of various things are, etc. There are a lot of things to consider when making these decisions: How does it work with the rest of the system? Is it self-consistent? Does it make sense to the user? Will the users be able to guess what’s going, just by looking at the code? Are we able to efficiently have the hardware do what the language says it should do? Is there precedent somewhere, e.g. in mathematics or in other programming languages that set users’ expectations for how thing should work? If so and we are deviating from that expectations, are there good reasons to [1]? If we are doing something different or unexpected, should we provide both or should we at least add something to make sure that users expecting the legacy behavior will easily find out what the legacy behavior is, etc? At the end, in every decision you make, you need to consider two things 1) The computer that has to run it and 2) The human that has to read it.

Both are extremely important, but there is of course a trade-off between them and languages differ where they fall on this spectrum. In Julia, we try very hard to make a program well understood by both (this was actually one of the original motivations for Julia). This isn’t easy and there are hard trade-offs to be made sometimes (e.g. it’d be nice to check overflow for all arithmetic operations, but doing this by default is too slow on current generation machines), but we try to make sure that a) We make reasonable choices by default and b) whenever we make a trade off in either directions there is ways to let the users make the opposite choice while being able to use the rest of the system without trouble. Julia’s multiple dispatch system is essential to making this work (though the details of that are a whole separate topic).

[1] E.g. we have a policy of generally spelling out names rather than using short abbreviations, so you might consider “sine” and “cosine” as more consistent names than “sin” and “cos”, but you’d be fighting against 100 years of mathematical notation. As an example on the other side, a lot of languages like to use “+” to concatenate strings. However, we considered that a serious mistake, since + is facially commutative and string concatenation is not, which is why we use “*” as our string concatenation operator.

If you were to design a new general-purpose programming language in the spirit of Julia today, what would you do differently?

Well, this one’s easy. The nice thing about having your own language is that anything you think is bad about it, you can just change without having to go through the effort of designing a whole new language from scratch. A more interesting question is what kind of language I would design if I were to do something completely different from what Julia does. Julia’s design optimizes for usability and performance. I think it would be fun to work on a language that aggressively optimizes for correctness. This probably means a language with built in proof assistant and formal verification capabilities (or something that statically enforces coverage of all corner cases). I haven’t thought too deeply about it, but I have used various proof assistants before and it’s always felt like writing assembly to me in its level of tedious. Perhaps that’s inherent to the domain, but I’d be interested in exploring what clean-room proof assistant with focus on usability would look like. Then again, there’s nothing really preventing us from bolting that kind of technology onto Julia to supplement or replace test coverage for particularly critical code. Perhaps language designers are just doomed to just keep working on the same language until legacy considerations make it impossible to change anything.

Open Source (Keno Fischer Quora Session)

Keno Fischer, Julia Computing Co-Founder and Chief Technology Officer (Tools) participated in a Quora Session March 18-23. One of Keno’s responses was also featured in Forbes.

Why do you support open source software and open collaboration?

I consider open source software a public and social good, similar to scientific research papers and I think access to both should be available as widely as possible to spur innovation. However, I also think there’s lot of practical reasons to prefer open source, even if you couldn’t care less about the social good argument:

  • Complexity. Modern software stacks are unfathomably complex and it is fairly common to have to hunt bugs through four or five layers of abstraction. If each of these layers is a separate, closed source project, you’ll spend years of your life exchanging emails with your vendor’s support engineers trying to get things debugged. I think from a practical perspective, this is probably among the top reason I refuse to work with closed source software (except in very, very, rare circumstances) that hasn’t been reverse engineered. The lost productivity and frustration is just not worth it.

  • Collective Action. Say I sell you some hardware widget. It needs some software to make it work, but I have barely any budget left to get it working at all, let alone get all the niceties in there like “security”, “customizability” or even the ability to do maintenance on the software (it’s usually just a source tree on some developer’s computer, one folder per device version that got shipped – with things like CI, version control, static analysis being but a pipe dream). This is essentially the situation we have with a lot of embedded devices at the moment. One of the scariest instances of these are baseboard management controllers (BMCs). If you’re not familiar with them, small chips (usually ARM cores these days) that come with enterprise motherboards and are basically the most privileged chip in the system (more privileged than the host CPU even). Depending on the setup, they can generally read main memory, access the PCIe bus, read network traffic, flash your BIOS or just take over the machine entirely. And yet, the software that comes on these things tends to be decade old versions of Linux, hacked together with SDKs for proprietary drivers that are full of problems, Web servers that are full of security holes (did I mention that by default if you plug a network cable into the primary ethernet port the BMC will become available on the network with a default password?), and software that is so buggy you’re going to waste hours just trying to use the thing to reinstall the operating system on your server, because it keeps losing the remote mount. This is the kind of thing that keeps me up at night. It’s understandable how this kind of thing happens though. The motherboard vendors don’t have a lot of incentives to make this stuff actually work super well. They’ve sold you the board and they’ve ticket out the BMC box in the marketing materials and some of the features even kind of work. If you’re lucky you may even get a security fix every other year when somebody rediscovers the existence of these chips. Open source can help fix this problem by providing a venue for people to come together and solve this as a collective problem. People who buy these boards do care about security and usability and fixing bugs, but very few organizations have enough servers that it would make sense for them to build their entire own BMC stack (you’d have to be at the scale of Facebook or Google). But if it’s open source, you can easily imagine somebody submitting a patch for some corner case bug they have, or an organization that still has a few thousand previous generation servers lying around taking on the (relatively small) incremental maintenance burden of putting out images with updated security fixes for those machines. It’s also good for vendors, because they can take advantage of shared infrastructure. Instead of hacking together something for every new device iteration, they can just submit the new device model upstream and use the standard process to get binaries out. No more bespoke, per-chip directories. In the BMC world, this has been finally happening in the last few years with the OpenBMC project (penbmc/openbmc), but this kind of thing is a pattern that happens over and over. Open source is a way to coordinate collective action problems.

  • Security. I touched upon it a little bit in the previous answer, but let me just re-iterate the point. If the code is open, anybody can look at it and if they spot a security bug, get in a patch and quickly get it fixed. Now, this isn’t perfect. Spotting security bugs is hard and not a lot of people are good at it, but at least it is possible. There is something that’s been happening in the last few years however, that strengthens this argument quite a bit. We’re seeing more and more powerful tools for automatic detection of security and other problems (fuzzing, static analysis, etc). If the source code is available, it is quite easy for third parties (e.g. academics working on new such techniques) to rapidly apply these techniques to a large corpus of important real world code.

  • Community. Open source communities are a place for people with overlapping problems to meet that wouldn’t otherwise talk to each other. We’ve seen this happen a lot in the Julia community. Since everybody is talking in terms of the same concepts (Julia code and abstractions), we get very high quality discussions between people from very different fields (e.g. scientific application developers and computer scientists working close to the hardware). Magic can happen when you get these different perspectives into the same room together and it’s not an infrequent occurrence in the Julia community that entire new research directions get discovered by people chatting in the issue tracker or the discussion forum.

  • Education. Reading the source code of important open source projects (compilers, operating systems, Web browsers) can be an extremely enlightening experience to understand how things really work under the hood. These code bases are testaments of the complexity of the real world and include all the details that are often skipped over in textbooks because they would be distracting from the main point (but turn out to be fairly important in practice). In many code bases (or in the commit log) you’ll often also be able to find out why something was done the way it is, which can give a lot of insight that is hard to convey any other way. Now, why should companies care that about this? Well, imagine having to hire somebody to work on software you use a lot. If the software is open source, chances are you’ll be able to find people that have at least read parts of the code base and will be able to hit the ground running immediately (disclaimer: works better for clean and well documented code).

Overall, I don’t think any of this is too controversial. The tech industry is basically built on top of open source. It is impossible to get anything meaningful done without touching open source software. There are problems that go along with it. It can be hard to get support sometimes for open source software, there isn’t always a clear vision for where the project should go, different people involved may have different goals and often the maintenance burden is not adequately carried by those that benefit most. I think there is some room for innovation here, but closed source is not the answer.

Supercomputing with Julia (Keno Fischer Quora Session)

Keno Fischer, Julia Computing Co-Founder and Chief Technology Officer (Tools) participated in a Quora Session March 18-23. One of Keno’s responses was also featured in Forbes.

What was it like to run code on some of the world’s largest supercomputers?

(This is retold from memory, some of the details might be slightly off).

The date is April 14th, 2017. For most of the past month, a distributed team of engineers from UC Berkeley, Lawrence Berkeley National Labs, Intel, MIT and Julia Computing have been working day and night on the codebase of the Celeste project. The goal of the Celeste application is to crunch 55,000 GB (i.e. 55TB) of astronomical images and produce the most up to date catalog of astronomical objects (stars and galaxies at that point), along with precise estimates (and, uniquely, well defined uncertainty estimates) of their positions, shapes and colors. Our goal was ambitious. We’d apply variational inference to a problem four orders of magnitude larger than previously attempted. At peak, there’d be 1.5 million separate threads running on over 9,000 separate machines. And we’d attempt to do all of this, not in a traditional HPC programming language like C++ or Fortran, but in Julia, a high-level dynamic programming language with all of its dynamic features including garbage collection.

A year prior, Celeste required several minutes to process even a single light source. Through improvements and some basic optimization, we’d gotten this down to a few seconds. Already almost good enough for the current data set we had (at this performance level, a few hundred machines would be able to crunch through the data set in a few days), but not to demonstrate that we could handle the data coming from the next generation telescope (LSST). For that we want to demonstrate that we could scale to the entire machine and more importantly make good use of it. To demonstrate this capability, we’d set ourselves a goal: Scale the application to the entirety of the Cori supercomputer at LBNL (at the time the world’s fifth largest computer) and perform at least 10^15 double precision floating point operations per second (one petaflop per second, or about 1/10th of the absolute best performance ever achieved on this machine on code that just tried to multiply some giant matrices). If we managed to achieve this, we would submit our result as an entry for consideration to the annual Gordon Bell prize. The deadline for this entry was the evening of April 15th (a Saturday).

Against this background, I found myself coming to the office (at the time a rented desk across the street from Harvard Business School in Boston) on April 14th. My collaborators were doing the same. In consideration of the upcoming Gordon Bell deadline, the system administrators at LBNL had set aside the entire day to allow different groups to run at the scale of the full machine (generally the machine is shared among many much smaller jobs – it is very rare to get the whole machine to yourself). Each group was allocated an hour and we’d be up second. We took a final look at the plan for the day. Which configurations we’d run (we’d start with the full scale run for 15-20 minutes to prove scalability and then kick off several smaller runs in parallel to measure how our system scaled up with the number of nodes), who’d be responsible for submitting the jobs, looking at the output, collecting the results, etc. We weren’t sure what would happen. At the beginning of the week we were able to do a smaller scale run on a thousand nodes. At the time our extrapolated performance was a few hundred teraflops. Not enough. After a week of very little sleep, I’d made some additional enhancements: a lot of general improvements to the compiler, enabling the compiler to call into a vendor-provided runtime library for computing the exponential function in a vectorized function and a complete change to the way we were vectorizing the frequently executed parts of the code (which consisted of a few thousand lines at the source level). I had tested these changes on a few hundred nodes and things looked promising, but we had never tested these changes on anything close to the scale we were attempting to run at.

A few minutes before noon (which was our designated starting time) we got word that the previous group was about to wrap up. We gathered on a video call anxiously watching a chat window for the signal that the machine was released to us and just like that, it was “Go. Go. Go.” time. With the stroke of a button, the script was launched and 9,000 machines in the lower levels of the NERSC building in the Berkeley hills roared into action. More than 3MW of power, the equivalent power consumption of a small village, kicked into action to keep these machines supplied.

After an anxious few minutes, a call out on the video call “Something’s wrong. We’re not getting any results”. We were seeing threads check in that they were ready to start processing data, but no data processing ever happened. “Alright, kill it and try again. We’ll look through the logs and figure out what happened.” After a few minutes, while the second run was starting up, it had become clear that three of the 1.5 million threads had crashed at startup. Worse, the log entries looked something like:

signal (11): Segmentation fault: 11

unknown function (ip: 0x12096294f)

unknown function (ip: 0xffffffffffffffff)

Not only had it crashed, but the crash had corrupted the stack and there was no indication what had caused the problem. “If this also crashes, what do we do?” “We can go back to the version from Monday, but we know it’s not fast enough.” “There’s no way we’ll be able to debug this before our time slot runs out. And even if we could, we just can’t have the machine sitting idle while we do.” “Go see if the next group is ready to run. If so, we’ll try to reclaim our remaining time at the end of the day”. And so, after 20 minutes on the machine we released it to the next group and set to debugging. If we couldn’t figure it out we’d run the Monday version.

We had about five hours to figure out what the problem was. About an hour and a half in, through some guesswork and careful examination of the logs, we were able to correlate the hex numbers in the stack trace to assembly instructions in the binary. None of the three crashes were in the same place, but a pattern emerged: two of the locations pointed right after calls to the new vendor library we had enabled. A hastily sent email to the vendor’s engineering team was met with disbelief, the library was frequently used in applications with up to tens or hundreds of thousands of threads. On the first call from each thread, the library would dynamically adjust itself depending on the CPU it was running on.

Looking at the disassembly, this code was clearly designed to be thread safe, and we couldn’t see any obvious errors. Nevertheless, the pattern fit. The library did its own stack manipulation (since it was designed to not use stack in most cases, which would explain the unwinder’s inability to give us a good backtrace), the crash happened very early in the program before any data was processed (which would be consistent with being in the initialization routine). Through some very careful binary surgery, we patched out the initialization routine, hardcoded the correct implementation and did a small scale test run on testbed system (after all the main system was currently serving other groups). Nothing crashed, but that didn’t mean much – the scale of the testbed didn’t even reach the scale of our test runs.

We got the machine back at the end of the day as scheduled. A binary had been built with our hack and we were ready for one last Hail Mary attempt. Thirty minutes on the clock until the machine was returned back to regular use. Once again, the 9,000 nodes roared into action, but this time – it worked. Within a minute our performance metric showed that not only was it working, but we had indeed hit our goal – we clocked in at 1.54 petaflops. There was audible relief all around on the call. These results in hand, we were able to negotiate for an additional, smaller scale, slot the next day to run our scaling experiments, and we still had to finish writing the paper afterwards, but the hard part was over.

I hope this account gives some insight into the experience. The Celeste project was one of the most ambitious projects I’ve ever been a part of, trying to push the envelope in so many different ways: statistical methods, parallel computing, programming languages, astronomy. Many improvements to Julia over the two years following this run were the direct result of experience gathered trying to make this happen – it’d be much easier this time around after all these improvements. A few weeks later I got an email that there had indeed been a bug in the initialization routine, which had now been fixed. The fact that nobody else had ever run into it was probably merely a question of scale. At 1.5 million running threads, even one-in-a-million can happen every single time. In the end running on a supercomputer is just like running on a regular computer, except a million times larger and a million times more stressful.