{

}

(2.1) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

: Mean vector of j$^{th}$ Gaussian, \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

: Covariance matrix of j$^{th}$ Gaussian, \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

: Probability of j$^{th}$ Gaussian. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} Conditional probability of observation of the test vector \textit{\uline{x}} in terms of i$^{th}$ speaker's parameter set is calculated as given below. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{14.77mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

(2.2) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

{

(

)

(

)$^{T}$} (2.3) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} EM algorithm can be formulated as follows. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

(2.4) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm}

(2.5) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm}

(2.6) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm}

(2.7) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} In these formulas, \textit{\uline{x}}\textit{$_{i,j,T}$} represents i$^{th}$ speaker's j$^{th}$ training future vector. This optimization procedure is ended, if the calculated likelihood value does not increase more than a predefined threshold between consecutive iterations. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} Identification test of any speaker, who is in the set, includes two phases. In the first phase, likelihood value of subject speaker's test set is calculated for each candidate speaker. The second phase includes assignment of speaker who has the highest likelihood ratio, to the subject speaker's identity. Suppose that H represents the assigned speaker and X$_{S}$ represents the whole set of test vectors of the subject speaker, we can formulate this decision process as; \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

(2.8) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} Using Bayes rule we can rewrite

as in (2.9). \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

(2.9) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} Assuming the probability of each speaker is equal and \textit{p(X}\textit{$_{S}$}\textit{)} value is the same for each speaker, we can simplify (2.9) in (2.10). \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{12.50mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm}

(2.10) \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \textbf{3 Speaker Identification Performance Analysis} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} Speaker identification system requires both training and test vector sets for speaker identification process. In order to test the speaker identification performance on a discrete frequency band, the train and test sets are generated including only the filtered power spectrum values in analysis frequency range. It is also observed that, these frequency bands must not be shorter than 500 Hz. In the experiments, we use TIMIT speech corpus [10] that has eight different dialect regions of American English. TIMIT already includes voice active regions in utterances, so in this work we do not need to use a voice activity detection mechanism. The speaker sets we use are restricted only the records of speakers in the fifth dialect region; this approach cancels the effect of dialect region difference in speaker identification performance. Moreover, we work on three speaker sets. First set includes only male speakers, second set includes only female speakers, and third set includes both male and female speakers. The number of speakers in all these sets is equal to twenty-four. Performance analysis in the same gender also eliminates the information carried by gender difference that is valuable for speaker identification. We generate the training set using the unique utterances from all speakers' records, these files have “sa” prefixes, and the files with “si” prefix are used in the test set. Furthermore, phonetic dominance problem in training is cancelled by using these unique utterances. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} In the experiments, speech records are segmented in 20 ms frames and the duration between adjacent frames is kept at 10 ms. Each frame is weighted using Hamming window and transformed to frequency domain using DFT, then the power spectrum of a frame is calculated using these coefficients. The power spectrum coefficients are passed through a filter bank that is composed of uniform triangular filters. Train and test files for each frequency band are generated using the filtered power spectrum. The training phase is the same as given in section two. On the other hand, speaker identification performance is measured according to two criteria: vector ranking and speaker ranking. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{4.93mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} In \textit{vector ranking}, we compare the statistical likelihood values of each test vector in terms of candidate speakers, and then we assign a rational number between 0 and 1 to the identification performance of the correct speaker. The mean value of all speakers' performance values is calculated and assigned as a final measure of speaker identification performance value for this frequency interval. On the other hand, in \textit{speaker ranking}, we compare the statistical likelihood values of each speaker's test set in terms of candidate speakers, then we do the same numerical assignment as in the previous method that we explain. Also the final measure of speaker identification performance value at this frequency interval according to the speaker ranking is obtained by calculating the average of all speaker's performance values. After we calculate the performance on each frequency band, we can visualize how the speaker identification performance varies along the whole frequency axis. These results are also examined comparing with calculated F−ratio [1] values at each frequency band. Also, F−ratio for this case is the ratio of inter speaker variance to intra speaker variance at that frequency band, and it is interesting to note that there is a correlation between calculated F−ratio values and vector ranking results. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \textbf{4 Conclusion} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{12.50mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} Observations in this work give us a new perspective about the importance of frequency bands in speaker identification systems. Although, mel−scale is used in speaker identification systems generally, it is possible to define a new scale using the results of this work. Besides that, we have already developed a new filter bank according the results of this work, it is called as “speaker sensitive frequency scale filter bank” (SSFSF). In the speaker identification test including 462 speakers of TIMIT corpus, the system with SSFSF gives better identification results as compared with the system including mel−scale filter bank. Furthermore, the following work that we focus on is a subjective test to compare our observations and human auditory system responses. \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \textbf{References} \vspace{0.00mm} \vspace{0.00mm} \setlength{\parindent}{0.00mm} \setlength{\leftskip}{0.00mm} \setlength{\rightskip}{0.00mm} \vspace{0.00mm} \begin{enumerate} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} Atal, B.S., “Automatic recognition of speakers from their voices”, Proc. IEEE, Vol. 64, pp. 460-474, 1976. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} Davis, S.B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”, IEEE Trans. Acoust. Speech, Signal Processing, Vol. ASSP-28, pp. 357-366, 1980. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} Rosenberg , A.E., and Soong , F.K., “Evaluation of a vector quantization talker recognition system in text independent and text dependent modes”, Computer Speech and Language, Vol. 22, pp. 143-157, 1987. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} Reynolds, D.A., Rose, R.C., “Robust text-independent speaker identification using Gaussian mixture speaker models”, IEEE Trans. Speech and Audio Processing, Vol. 3, pp. 72-83, 1995. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} Tishby, N.Z., “On the application of mixture AR hidden Markov models to text independent speaker recognition”, IEEE Trans. Signal Processing, Vol. 39, pp. 563-570, 1991. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} Oglesby, J. and Mason, J., “Radial basis function networks for speaker recognition”, in Proc. ICASSP, May 1991, pp. 393-396. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} Orman, Ö.D., Arslan L., “A comparative study on closed set speaker identification using RBF network and modular networks”, Accepted for presentation in TAINN'2000. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} M. R. Sambur, “Selection of Acoustic Features for Speaker Identification”, IEEE Trans. Acoust. Speech, Signal Processing, Vol. ASSP-23, pp. 176-182, 1975. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} D.\textbf{ }O'Shaughnessy, “Speaker Recognition”, IEEE ASSP Magazine, pp. 4-17, October 1986. \vspace{0.00mm} \item \vspace{0.00mm} \setlength{\parindent}{-4.93mm} \setlength{\leftskip}{4.93mm} \setlength{\rightskip}{0.00mm} “Getting started with darpa TIMIT CD-ROM: an acoustic phonetic continuous speech database”, National Institute of Standarts and Technology (NIST), Gaithersburg, MD (prototype as of Dec. 1988). \vspace{0.00mm} \end{enumerate} \end{document}