A New Deep Learning Theory: Compressing Noisy Data While Keeping Relevant Data

A New Deep Learning Theory: Compressing Noisy Data While Keeping Relevant Data
1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

Naftali Tishby, a computer scientist and neuroscientist from the Hebrew University of Jerusalem, recently presented a theory to explain the learning process of neural networks that intuitively sounds about right.

A concept described in 1999 and called “information bottleneck” argues that a network rids noisy input data of extraneous details as if by squeezing the information through a bottleneck, retaining only the features most relevant to general concepts.

The notion of relevant information is a recurring concept which seems to contradict Shannon’s information theory where all the information can be coded in 0s and 1s without a semantic meaning.

Neural networks though identify and learn using the very same information relevance concept. In a 1999 paper, Tishby and co-authors Fernando Pereira, now at Google, and William Bialek, now at Princeton University, formulated a mathematical optimization problem stating: “If X is a complex data set, like the pixels of a car photo, and Y is a simpler variable represented by those data, like the word “car.” You can capture all the “relevant” information in X about Y by compressing X as much as you can without losing the ability to predict Y.

As described by Wired “The basic algorithm used in the majority of deep-learning procedures to tweak neural connections in response to data is called “stochastic gradient descent”: Each time the training data are fed into the network, a cascade of firing activity sweeps upward through the layers of artificial neurons. “

So when the signal reaches the top layer, the final firing pattern can be compared to the correct label for the image—1 or 0, “car” or “no car.” Any differences between this firing pattern and the correct pattern are “back-propagated” down the layers, meaning that, like a teacher correcting an exam, the algorithm strengthens or weakens each connection to make the network layer better at producing the correct output signal. Over the course of training, common patterns in the training data become reflected in the strengths of the connections, and the network becomes expert at correctly labeling the data, such as by recognizing a car, a word, or a 1.

Deep Learning: Theory, Algorithms, and Applications. Berlin, June 2017


In their experiments, Tishby and Shwartz-Ziv tracked how much information each layer of a deep neural network retained about the input data and how much information each one retained about the output label.

“The scientists found that, layer by layer, the networks converged to the information bottleneck theoretical bound: a theoretical limit derived in Tishby, Pereira and Bialek’s original paper that represents the absolute best the system can do at extracting relevant information. At the bound, the network has compressed the input as much as possible without sacrificing the ability to accurately predict its label.

Tishby and Shwartz-Ziv also made the intriguing discovery that deep learning proceeds in two phases: a short “fitting” phase, during which the network learns to label its training data, and a much longer “compression” phase, during which it becomes good at generalization, as measured by its performance at labeling new test data.”

Source: ACM Tech News, Wired

Photo Credits: Gene Kogan – t-SNE flowers* / Flickr CCM

*”A gridded 2d visualization of flowers, clustered by similarity using t-SNE. each of the images are first encoded into 4096-bit feature vector derived from the activations of the last fully-connected layer in a convolutional neural network. Then the features are reduced to 2d using t-SNE dimensionality reduction technique, and then gridded using RasterFairy.