Deep Learning & Natural Language: The Crime Scene Investigation Case

Deep Learning & Natural Language: The Crime Scene Investigation Case
1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

Researchers from the School of Informatics of the University of Edinburgh taught Neural Networks to analyse Datasets of the show CSI, and identify the perpetrator in each case. The research was aimed at enabling neural networks to solve a problem by assimilating information from images, audio, transcribed dialogue and scene descriptions.

The CSI Dataset was built using an annotation system where three graduate students, proficient in English, none of them regular CSI viewers, were asked to select an answer for a selected question every three minutes, the show being halted and the choices underneath displayed. The idea was to capture the process by humans determine the perpetrator. Annotations from 39 episodes (comprising 59 cases) were used to build the Dataset.

Annotation interface (first pass): After watching three minutes of the episode, the annotator indicates whether she believes the perpetrator has been mentioned. (1)

The core of the neural network model is a one-directional long-short term memory network .

LTSMs are simple recurrent neural networks which can be used as a building components or blocks (of hidden layers) for an eventually bigger recurrent neural network (2).

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their internal memory to process arbitrary sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition (2).

As stated by the authors of the research paper: “LSTMs provide ways to selectively store and forget aspects of previously seen inputs, and as a consequence can memorize information over longer time periods. Through input, output and forget gates, they can flexibly regulate the extent to which inputs are stored, used, and forgotten.” (1)

Like humans watching an episode, the LTSM model is presented with a sequence of (possibly multi-modal) inputs, each corresponding to a sentence in the script, and assigns a label l indicating whether the perpetrator is mentioned in the sentence (l= 1) or not (l= 0). The model is fully incremental, each labeling decision is based solely on information derived from previously seen inputs. (1)

Overview of the perpetrator prediction task. (1)

The model receives input in the form of text, images, and audio. Each modality is mapped to a feature representation. Feature representations are fused and passed to an LSTM which predicts whether a perpetrator is mentioned (label l= 1) or not (l= 0).

According to the authors, the LTSM model achieves an average precision of 60% vs 85% for humans which is encouraging.

Photo Credits: Craig Sunter, I will be an intellect ! Flickr – CC license.


  1. Whodunnit? Crime Drama as a Case for Natural Language Understanding, Lea Frermann Shay B. Cohen Mirella Lapata. Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh. 31 Oct 2017.
  2. Wikipedia