By: Tom Breloff
Re-posted from: http://www.breloff.com/Efficiency-is-key/
The human brain is intensely complicated. Memories, motor sequences, emotions, language, and more are all maintained and enacted solely through the temporary and fleeting transfer of energy between neurons: the slow release of neurotransmitters across synapses, dendritic integration, and finally the somatic spike. A single spike (somatic action potential) will last a small fraction of a second, and yet somehow we are able to swing a baseball bat, compose a symphony, and apply memories from decades in the past. How can our brain be based on signals of such short duration, and yet work on such abstract concepts stretching vast time scales?
In this blog, I hope to lay out some core foundations and research in computation neuroscience and machine learning which I feel will comprise the core components of an eventual artificially intelligent system. I’ll argue that rate-based artificial neural networks (ANN) have limited power, partially due to the removal of the important fourth dimension: time. I also hope to highlight some important areas of research which could help bridge the gap from “useful tool” to “intelligent machine”. I will not give a complete list of citations as that would probably take me longer to compile than writing this blog, but I will occasionally mention references which I feel are important contributions or convey a concept well.
These views are my own opinion, formed after studying many related areas in computational neuroscience, deep learning, reservoir computing, neuronal dynamics, computer vision, and more. This personal study is heavily complemented with my background in statistics and optimization, and 25 years of experience with computer programming, design, and algorithms. Recently I have contributed MIT-licensed software for the awesome Julia programming language. For those in New York City, we hope to see you at the next meetup!
See my bio for more information.
What is intelligence?
What does it mean to be intelligent? Are dogs intelligent? Mice? What about a population of ants, working together toward a common goal? I won’t give a definitive answer, and this is a topic which easily creates heated disagreement. However, I will roughly assume that intelligence involves robust predictive extrapolation/generalization into new environments and patterns using historical context. As an example, an intelligent agent would predict that they would sink through mud slowly, having only experienced the properties of dirt and water independently, while a Weak AI system would likely say “I don’t know… I’ve never seen that substance” or worse: “It’s brown, so I expect it will be the same as dirt”.
Intelligence need not be human-like, though that is the easiest kind to understand. I foresee intelligent agents sensing traffic patterns throughout a city and controlling stoplight timings, or financial regulatory agents monitoring transaction flow across markets and continents to pinpoint criminal activity. In my eyes, these are sufficiently similar to a human brain which senses visual and auditory inputs and acts on the world through body mechanics, learning from the environment and experience as necessary. While the sensorimotor components are obviously very different between these examples and humans, the core underlying (intelligent) algorithms may be surprisingly similar.
Some background: Neural Network Generations
I assume you have some basic understanding about artificial neural networks going forward, though a quick Google search will give you plenty of background for most of the research areas mentioned.
There is no clear consensus on the generational classification of neural networks. Here I will take the following views:
- First generation networks employ a thresholding (binary output) activation function. An example is the classic Perceptron.
- Second generation networks employ (mostly) continuously differentiable activation functions (sigmoid, tanh, or alternatives such as the Rectified Linear Unit (ReLU) and variants, which have discontinuities in the derivative at zero), following an inner product of weights and inputs to a neuron. Some have transformational processing steps like the convolutions and pooling layers of CNNs. Most ANNs today are second generation, including impressively large and powerful Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Deep Learning (DL), Neural Turing Machines (NTM), and much, much more.
- Third generation networks add a fourth dimension (time), and are likely built using an “energy propagation” mechanism which converts and accumulates inputs over time, propagating a memory of inputs through the network. Examples include Spiking Neural Networks (SNN), Liquid State Machines (LSM), and Hierarchical Temporal Memory (HTM).
The first two generations of networks are static, in the sense that there is no explicit time component. Of course, they can be made to represent time through additional structure (such as in RNNs) or transformed inputs (such as appending lagged inputs as in ARIMA-type models). Network dynamics can be changed through learning, but that structure must be explicitly represented by the network designer.
In the last few years, there has been incredible advances in the expressive power of second generation networks. Networks have been built which can approach or surpass human ability in object recognition, language translation, pattern recognition, and even creativity. While this is impressive, most second generation networks have problems of fragility and scalability. A network with hundreds of millions of parameters (or more) requires tons of computing power and labeled training samples to effectively learn its goal (such as this awesome network from OpenAI’s Andrej Karpathy). This means that acquiring massive labeled data sets and compute power are required when creating useful networks (and the reason that Google, Facebook, Apple, etc are the companies currently winning this game).
I should note that none of the network types that I’ve listed are “brain-like”. Most have only abstract similarities to a real network of cortical neurons. First and second order networks roughly approximate a “rate-based” model of neural activity, which means the instantaneous mean firing rate of a neuron is the only output, and the specific timings of neuronal firings are ignored. Research areas like Deep Reinforcement Learning are worthwhile extensions to ANNs, as they get closer to the required brain functionality of an agent which learns through sensorimotor interaction with an environment, however the current attempts do not come close to the varied dynamics found in real brains.
SNN and LSM networks incorporate specific spike timing as a core piece of their models, however they still lack the functional expressiveness of the brain: dendritic computation, chemical energy propagation, synaptic delays, and more (which I hope to cover in more detail another time). In addition, the added complexity makes interpretation of the network dynamics difficult. HTM networks get closer to “brain-like” dynamics than many other models, however the choice of binary integration and binary outputs are a questionable trade-off for many real world tasks, and it’s easy to wonder if they will beat finely tuned continuously differentiable networks in practical tasks.
More background: Methods of Learning
There are two core types of learning: supervised and unsupervised. In supervised learning, there is a “teacher” or “critic”, which gives you input, compares your output to some “correct answer”, and gives you a numerical quantity representing your error. The classic method of learning in second generation networks is to use the method of backpropagation to project that error backwards through your network, updating individual network parameters based on the contribution of that parameter to the resulting error. The requirement of a (mostly) continuously differentiable error function and network topology is critical for backpropagation, as it uses a simple calculus trick known as the Chain Rule to update network weights. This method works amazingly well when you have an accurate teacher with lots of noise-free examples.
However, with sufficient noise in your data or error, or inadequate training samples, ANNs are prone to overfitting (or worse). Techniques such as Early Stopping or Dropout go a long way to avoid overfitting, but they may also restrict the expressive power of neural nets in the process. Much research has gone into improving gradient-based learning rules, and advancements like AdaGrad, RMSProp, AdaDelta, Adam, and (my personal favorite) AdaMax have helped considerably in speeding the learning process. Finally, a relatively recent movement of Batch Normalization has improved the ability to train very deep networks.
With too few (or zero) “correct answers” to compare to, how does one learn? How does a network know that a picture of a cat and its mirror image represent the same cat? In unsupervised learning, we ask our network to compress the input stream to a reduced (and ideally invariant) representation, so as to reduce the dimensionality. Thus the mirror image of a cat could be represented as “cat + mirror” so as not to duplicate the representation (and without also throwing away important information). In addition, the transformed input data will likely require a much smaller model to fit properly, as correlated or causal inputs can be reduced to smaller dimensions. Thus, the reduced dimensionality may require fewer training examples to train an effective model.
For linear models, statisticians and machine learning practitioners will frequently employ Principle Component Analysis (PCA) as a data preprocessing step, in an attempt to reduce the model complexity and available degrees of freedom. This is an example of simple and naive unsupervised learning, where relationships within the input data are revealed and exploited in order to extract a dataset which is easier to model. In more advanced models unsupervised learning may take the form of a Restricted Boltzmann machine or Sparse Autoencoders. Convolution and pooling layers in CNNs could be seen as a type of unsupervised learning, as they strive to create partially translation-invariant representations of the input data. Concepts like Spatial Transformer Networks, Ripple Pond Networks, and Geoff Hinton’s “Capsules” are similar transformative models which promise to be interesting areas of further research.
After transforming the inputs, typically a smaller and simpler model can be used to fit the data (possibly a linear or logistic regression). It has become common practice to combine these steps, for example by using Partial Least Squares (PLS) as an alternative to PCA + Regression. In ANNs, weight initialization using sparse autoencoders has helped to speed learning and avoid local minima. In reservoir computing, inputs are accumulated and aggregated over time in the reservoir, which allows for relatively simple readout models on complex time-varying data streams.
Back to the (efficient) Future
With some background out of the way, we can continue to explore why the third generation of neural networks holds more promise than current state of the art: efficiency. Algorithms of the future will not have the pleasure of sitting in a server farm and crunching calculations through a static image dataset, sorting cats from dogs. They will be expected to be worn on your wrist in remote jungles monitoring for poisonous species, or guiding an autonomous space probe through a distant asteroid field, or swimming through your blood stream taking vital measurements and administering precise amounts of localized medicine to maintain homeostasis through illness.
Algorithms of the future must perform brain-like feats, extrapolating and generalizing from experience, while consuming minimal power, sporting a minimal hardware footprint, and making complex decisions continuously in real time. Compared to the general computing power which is “the brain”, current state of the art methods fall far short in generalized performance, and require much more space, time, and energy. Advances in data and hardware will only improve the situation slightly. Incremental improvements in algorithms can have a great impact on performance, but we’re unlikely to see the gains in efficiency we need without drastic alterations to our methods.
The human brain is super-efficient for a few reasons:
- Information transfer is sparse, and carries high information content. Neurons spike only when needed; to maintain a motor command in working memory, transfer identification of a visual clue, or identify an anomalous sound in your auditory pathway.
- Information is distributed and robustly represented. Memories can be stored and recalled amid neuron deaths and massive noise. (I recommend the recent works by Jeff Hawkins and Subutai Ahmad to understand the value of a Sparse Distributed Representation)
- Information transfer is allowed to be slow. Efficient chemical energy is used whenever possible. The release of neurotransmitters across synaptic channels and the opening of ion channels for membrane potentiation is a much slower but efficient method of energy buildup and transfer. Inefficient (but fast) electrical spikes are only used when information content is sufficiently high and if the signal must travel quickly across a longer distance.
- Components are reused within many contexts. A single neuron may be part of hundreds of memories and invariant concepts. Dendritic branches may identify many different types of patterns. Whole sections of the neocortex can be re-purposed when unused (for example, in the brain of a blind person, auditory and other cognitive processing may utilize the cortical sections normally dedicated to vision, thus heightening their other senses and allowing for deeper analysis of non-visual inputs)
Side note: I highly recommend the book “Principles of Neural Design“ by Peter Sterling and Simon Laughlin, as they reverse-engineer neural design, while keeping the topics easily digestible.
The efficiency of time
Morse Code was developed in the 1800’s as a means of communicating over distance using only a single bit of data. At a given moment, the bit is either ON (1) or OFF (0). However, when the fourth dimension (time) is added to the equation, that single bit can be used to express arbitrary language and mathematics. Theoretically, that bit could represent anything in existence (with infinitesimally small intervals and/or an infinite amount of time).
The classic phrase “SOS”, the international code for emergency, is represented in Morse Code by a sequence of ”3 dots, 3 dashes, 3 dots” which could be compactly represented as a binary sequence:
10101 00 11011011 00 10101
Here we see that we can represent “SOS” with 1 bit over 22 time-steps, or equivalently as a static binary sequence with 22 bits. By moving storage and computation from space into time, we drastically change the scope of our problem. Single-bit technologies (for example, a ship’s smoke stack, or a flash light) can now produce complex language when viewed through time. For a given length of time T, and time interval dT, a single bit can represent N (= T / dT) bits when viewed as a sequence through time.
Moving storage and representation from space into time will allow for equivalent representations of data at a fraction of the resources.
Brain vs Computer
Computers (the Von Neumann type) have theoretically infinite computational ability, owing to the idea that they are Turing-Complete. However, they have major flaws of inefficiency:
- System memory is moved too coarsely. Memory works in bytes, words, and pages, often requiring excessive storage or memory transfer (for example, when you only need to read 1 bit from memory)
- Core data is optimized for worst case requirements. Numeric operations are generally optimized using 64-bit floating point, which is immensely wasteful, especially when many times the value may be zero. (See Unums for an example effort in reducing this particular inefficiency)
- All processing is super-fast, even when a result is not needed immediately. Moving fast requires more energy than moving slow. There is only one speed for a CPU, thus most calculations use more energy than required.
- Memory can be used for only one value at a time. Once something “fills a slot” in memory, it is unavailable for the rest of the system. This is a problem for maintaining a vast long-term memory store, as generally this requires a copy to (or retrieval from) an external archive (hard disk, cloud storage, etc) which is slow to access and limited in capacity.
- Distributed/parallel processing is hard. There are significant bottlenecks in the memory and processing pipelines, and many current algorithms are only modestly parallelizable.
It quickly becomes clear that we need specialized hardware to approach brain-like efficiency (and thus specialized algorithms that can take advantage of this hardware). This hardware must be incredibly flexible; allowing for minimal bit usage, variable processing speed, and highly parallel computations. Memory should be holistically interleaved through the core processing elements, eliminating the need for external memory stores (as well as the bottleneck that is the memory bus). In short, the computers we need look nothing like the computers we have.
Sounds impossible… I give up
Not so fast! There is a long way to go until true AGI is developed, however technological progress tends to be exponential in ways impossible to predict. I feel that we’re close to identifying the core algorithms that drive generalized intelligence. Once identified, alternatives can be developed which make better use of our (inadequate) hardware. Once intelligent algorithms (but still inefficient) can be demonstrated to outperform the current swath of “weak AI” tools in a robust and generalized way, specialized hardware will follow. We live in a Renaissance for AI, where I expect exponential improvements and ground-breaking discoveries in the years to come.
There are several important areas of research which I feel will contribute to identifying those core algorithms comprising intelligence. I’ll highlight the importance of:
- time in networks of neural components. See research from Eugene Izhikevich et al, who is currently applying simulations of cortical networks towards robotics. His research into Polychronous Neuronal Groups (PNG) and applications to memory should open a more rigorous mathematical framework for studying spiking networks. It also supports the importance of delays in synaptic energy transfer as a core piece of knowledge and memory.
- non-linear integration within dendritic trees. There is much evidence that the location of a synapse within the dendritic tree of a neuron changes the final impact on somatic membrane potential (which in turn determines if/when a neuron will spike). Synapses closer to the soma contribute directly, while distal synapses may exhibit supralinear (coincident detection) or sublinear (global agreement) integration. I believe that understanding the host of algorithms provided by the many types of dendritic structures will expand the generalizing ability of neural networks. These algorithms are likely the basis of prediction and pattern recognition with historical context. (See Jeff Hawkin’s On Intelligence for a high-level view of prediction in context, or Jakob Hohwy’s The Predictive Mind for a more statistical take.)
- diversity and specialization of components. The neocortex is surprisingly uniform given the vast range of abilities, however it is composed of many different types of neurons, chemicals, and structures, all connected through a complex, recurrent, and highly collaborative network of hierarchical layers. The brain has many different components so that each can specialize to fulfill a specific requirement with the proper efficiency. A single type of neuron/dendrite/synapse in isolation cannot have the expressivity (with optimal efficiency) as that of a network with highly diverse and specialized components.
- local unsupervised learning. For the brain to make sense out of the massive amounts of sensory input data, it must be able to compress that data locally in a smart way without the guide of a teacher. Local learning likely takes the form of adding and removing synaptic connections and some sort of Hebbian learning (such as Spike-Timing Dependent Plasticity). However in an artificial network, we have the ability to shape the network in ways which are more difficult for biology. Imagine neuronal migration which shifts the synaptic transmission delays between neurons, or re-forming of the dentritic tree structure. These are things which might happen in mammals over generations from evolutionary forces, but which may be powerful learning paradigms for us in a computer on more useful time scales.
- global reinforcement learning. Clearly at some point a teacher can help us learn. In humans we produce neurotransmitters, such as dopamine, which adjust the local learning rate on a global scale. This is why Pavlov had success in his experiments: the reward signal (food) had an impact on the whole dog, thus strengthening the recognition of all patterns which consistently preceded the reward. The consistent ringing of a bell allowed the neuronal connections between the sensory pattern “heard a bell” and the motor command “salivate” to strengthen. In this experiment, you may be thinking that this is not a trait we should copy: “The dumb dog gets tricked by the bell!” However, if the pattern is consistent, one is likely able to identify impending reward (or punishment) both more quickly and more efficiently by integrating alternative predictive patterns. If a bell always rings, how does one really know for sure what caused the reward? If the bell is then consistently rung without the presentation of food, the “bell to salivate” connections should subsequently be weakened. Learning is continuous and, in some sense, probabilistic. One global reward signal is an efficient way to adjust the learning, and true causal relationships should win out.
The human brain is complex and powerful, and the neural networks of today are too inefficient to be the model of future intelligent systems. We must focus energy on new algorithms and network structures, incorporating efficiency through time in novel ways. I plan to expand on some of these topics in future posts, and begin to discuss the components and connectivity that could compose the next generation of neural networks.