Siri is synonymous to that familiar voice in our iPhones today but until recently I did not how how Siri and other interactive voice applications/programs work. Determined to rectify the situation, I went hunting for answers.
A very high level approximation for the process can be described in the process flow below:
For Step 1 different speech recognition algorithms (of which Dynamic Time Warping is one) can be used. Step 2 requires natural language processing. In this post, I focus only on the first step, and more specifically on the DTW algorithm.
The basic premise is that in order to recognize an input from a user/speaker, you need to compare the input with several “templates”. A template is a “feature sequence” of syllables/letters that represents a word and is compared with the input sequence the speaker provides. The DTW algorithm is used to get a good input-template alignment. For example, a good template – input alignment would look like this:
Technically speaking, we have two time sequences:
C = c1,c2,…,cj,…cm
If Q represents the template time sequence and C represents the input time sequence from a user, we could construct a m-by-n matrix with C (input) on the x-axis and Q (template) on the y-axis.
The (ith , jth ) element of the matrix corresponds to the squared distance, (qi – cj)2 . Hence, to find the best match/ a good alignment, the problem is defined as finding a path through the matrix that minimizes the cumulative distance of the path.
The cumulative path distance can also be called the Warping Cost and the problem to minimize the warping cost is represented as:
where wk is the matrix element (i,j)k that also belongs to kthelement of a warping path W. The problem can be solved using dynamic programming, an algorithm type that solves a problem by iteratively breaking the problem into similar smaller problems.
Going back to finding the optimal path, in DTW there are only 3 routes that that the path can take:
Given these three allowed routes from a frame, we can list all the possible paths in the matrix. And the path minimizing the distance between the input sequence and the template sequence is the optimal path. The complexity for finding the optimal path is O(m*n*3).
The final step is to stack up several templates against the input sequence. The template whose optimal path has the lowest distance (Warping Cost) is the optimal template and the word corresponding to the template is used for language processing by Siri.