去年十月，微软人工智能与研究部门的一个研究者和工程师团队报告他们的语音识别系统实现了和专业速录员相当甚至更低的词错率（WER）——达到了 5.9%，参考机器之心文章《重磅 | 微软语音识别实现历史性突破：语音转录达到专业速录员水平（附论文）》。但 IBM 官方博客今日发文宣称人类的水平实际上应该是 5.1%，而同时该文章还表示 IBM 的系统的词错率已经超越了之前微软报告的最佳水平，达到了 5.5%。IBM 宣称这是一个全新的突破，但相关研究论文似乎仍未发布（我们未能找到），机器之心将继续保持关注，期待能在第一时间向读者分享这一成果的技术细节。
哥伦比亚大学计算机科学系教授兼主席 Julia Hirschberg 对一直以来语音识别上的复杂挑战评论说：
要达到和人类一样的识别语音的能力是一个持续性的挑战，因为人类语音，尤其是在自发性的对话（spontaneous conversation）中的人类语音，是非常复杂的。而且我们也很难定义人类的表现，因为人类在理解其他人的语音上的能力会各有不同。当我们将自动识别和人类表现进行比较时，需要考虑两件很重要的事情：在被评估的同样的语音上识别器的表现和人类的表现。因此，IBM 最近在 SWIRCHBOARD 和 CallHome 数据上的成就是非常了不起的。而且 IBM 一直以来都在努力想要更好地理解人类理解这两个得到广泛引用的语料库的能力，这也让我印象深刻。这项科学成就在当前 ASR 技术上的表现是很了不起的，也表明我们仍然有一种让机器比肩人类语音理解的方法。
March 7, 2017
Reaching new records in speech recognition
Depending on whom you ask, humans miss one to two words out of every 20 they hear. In a five-minute conversation, that could be as many 80 words. But, for most of us it isn’t a problem. Imagine, though, how difficult it is for a computer?
Last year, IBM announced a major milestone in conversational speech recognition: a system that achieved a 6.9 percent word error rate. Since then, we have continued to push the boundaries of speech recognition, and today we’ve reached a new industry record of 5.5 percent.
This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like “buying a car.” This recorded corpus, known as the “SWITCHBOARD” corpus, has been used for over two decades to benchmark speech recognition systems.
To reach this 5.5 percent breakthrough, IBM researchers focused on extending our application of deep learning technologies. We combined LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning. The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples – so it gets smarter as it goes and performs better where similar speech patterns are repeated.
Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the ultimate industry goal. Others in the industry are chasing this milestone alongside us, and some have recently claimed reaching 5.9 percent as equivalent to human parity…but we’re not popping the champagne yet. As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent.
To determine this number, we worked to reproduce human-level results with the help of our partner Appen, which provides speech and search technology services. And while our breakthrough of 5.5% is a big one, this discovery of human parity at 5.1 percent proved to us we have a way to go before we can claim technology is on par with humans.
As part of our research efforts, we connected with different industry experts to get their input on this matter too. Yoshua Bengio, leader of the University of Montreal’s MILA (Montreal Institute for Learning Algorithms) Lab agrees we still have more work to do to reach human parity:
“In spite of impressive advances in recent years, reaching human-level performance in AI tasks such as speech recognition or object recognition remains a scientific challenge. Indeed, standard benchmarks do not always reveal the variations and complexities of real data. For example, different data sets can be more or less sensitive to different aspects of the task, and the results depend crucially on how human performance is evaluated, for example using skilled professional transcribers in the case of speech recognition,” says Bengio. “IBM continues to make significant strides in advancing speech recognition by applying neural networks and deep learning into acoustic and language models.”
We also realized finding a standard measurement for human parity across the industry is more complex than it seems. Beyond SWITCHBOARD, another industry corpus, known as “CallHome,” offers a different set of linguistic data that can be tested, which is created from more colloquial conversations between family members on topics that are not pre-fixed. Conversations from CallHome data are more challenging for machines to transcribe than those from SWITCHBOARD, making breakthroughs harder to achieve. (On this corpus we achieved a 10.3 percent word error rate – another industry record – but again, with Appen’s help, measured human performance in the same situation to be 6.8 percent).
In addition, with SWITCHBOARD, some of the same human voices in test speakers’ data are also included in the training data used to train the acoustic and language models. Since CallHome has no such overlap, the speech recognition models have not been exposed to test speakers’ data. Because of this, there is no repetition and this has led to its larger gap between human and machine performance. As we continue to pursue human parity, advancements in our Deep Learning technologies that can pick up on such repetitions are ever more important to finally overcoming these challenges.
Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, also commented on the ongoing complex challenge of speech recognition:
“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated,” she shared. “IBM’s recent achievements on the SWITCHBOARD and on the CallHome data are thus quite impressive. But I’m also impressed with the way IBM has been working to better understand human ability to understand these two, much-cited corpora. This scientific achievement is in its way as impressive as the performance of their current ASR technology, and shows that we still have a way to go for machines to match human speech understanding.”
Today’s achievement adds to recent advancements we’ve made in speech technology – for example, in December we added diarization to our Watson Speech to Text service, marking a step forward in distinguishing individual speakers in a conversation. These speech developments build on decades of research, and achieving speech recognition comparable to that of humans is a complex task. We will continue to work towards creating the technology that will one day match the complexity of how the human ear, voice and brain interact. While we are energized by our progress, our work is dependent on further research – and most importantly, staying accountable to the highest standards of accuracy possible.
To check out the white paper on this automatic speech recognition milestone, please see this link https://arxiv.org/abs/1703.02136