Now: 首页 >> 新闻中心 >> IT资讯 >> 正文

IBM宣称人类语音识别词错率实际应为5.1%,自家系统已突破至5.5%


选自IBM

作者:George Saon

机器之心编译

参与:吴攀、黄小天

去年十月,微软人工智能与研究部门的一个研究者和工程师团队报告他们的语音识别系统实现了和专业速录员相当甚至更低的词错率(WER)——达到了 5.9%,参考机器之心文章《重磅 | 微软语音识别实现历史性突破:语音转录达到专业速录员水平(附论文)》。但 IBM 官方博客今日发文宣称人类的水平实际上应该是 5.1%,而同时该文章还表示 IBM 的系统的词错率已经超越了之前微软报告的最佳水平,达到了 5.5%。IBM 宣称这是一个全新的突破,但相关研究论文似乎仍未发布(我们未能找到),机器之心将继续保持关注,期待能在第一时间向读者分享这一成果的技术细节。


在交谈中,人听到的每20个词之中便会漏听1至2个。5分钟的对话里,我们有可能漏听80个单词。但是,这并不妨碍交谈。试想一下,这种情况换成计算机会怎样?


去年,IBM宣布在会话语音识别方面取得重大进展,把语音识别的词错率降至6.9%。自此之后,词错率一降再降,直至今天的5.5%。


词错率的测定来自一个困难的语音识别任务:记录人们之间日常的诸如买车之类的话题交谈。这个被记录的语料库称之为SWITCHBOARD,20多年来一直是语音识别系统的检测标准。


IBM集中扩展深度学习应用技术终于取得了5.5%词错率的突破。我们结合了LSTM模型和带有3个强声学模型的WaveNet语言模型。这3个使用的声学模型中,前两个是6层双向LSTM,其中一个具有多特征输入,另一个则通过说话者-对抗多任务学习进行训练。第3个模型的独特之处在于可以从正负两个样本中进行学习。因此IBM的系统变得越来越聪明,尤其是在相似语音模式重复之处,表现更佳。


达到像人一样交谈的词错率,长久以来一直是业界的最终目标。其中一些宣称实现了与人持平的5.9%的词错率。作为今天成就的一部分,我们重新确定了人的实际词错率为5.1%,比之前达到的还要低。


我们的合作者Appen提供了语音和搜索技术服务,帮助我们最终确定了人的真实词错率。实现5.5%的词错率是一个大突破,但人类实际词错率的确定表明我们还没有达到最终目标。


作为研究努力的一部分,我们联合其他业界专家获得了他们的语音数据。蒙特利尔大学MILA实验室领导者Yoshua Bengio认为,要达到像人一样,我们仍然要付出更多努力:


「尽管近些年来有这些了不起的进展,但要在语音识别和目标识别等人工智能任务中实现人类水平的表现仍然是一项极具挑战性的科学难题。实际上,标准基准并不总是可以体现真实数据的多样化和复杂性。比如说,不同的数据集可能对一个任务的不同方面有更多或更少的敏感度,而且其结果严重依赖于人类表现被评估的方式,比如在语音识别的案例中使用技能娴熟的转录员。」Bengio说,「IBM通过将神经网络和深度学习应用于声学和语言模型,一直在语音识别上取得显著进展。」


我们还意识到要在整个行业领域找到一种标准的测试人类表现的方法比预想的要复杂得多。除了SWITCHBOARD,这个行业的另一个语料库CallHome提供了另一组可供测试的语言数据,这个数据集是根据家庭成员在没有预先固定主题上进行的更加口语化的对话而创建的。比起SWITCHBOARD,来自CallHome数据的对话对机器而言更难以转录,这使得在其上的突破更难以实现。(在这个语料库上我们实现了10.3%的词错率——这是另一个行业记录;但同样,通过Appen的帮助,在同样情形下的人类的准确度是6.8%)。


此外,在SWITCHBOARD测试时,在测试说话者数据中一些同样的人类声音也被包含在了用于训练该声学和语言模型的训练数据集中。因为CallHome没有这样的重叠,所以其语音识别模型没有接触到测试说话者的数据。因为这个原因,就没有重演(repetition),这会导致人类表现和机器表现之间出现更大的差距。随着我们继续努力向人类水平进军,我们在能够利用这些重演的深度学习技术上的进展在帮助我们最终攻克这些难题上发挥了前所未有的重要作用。


哥伦比亚大学计算机科学系教授兼主席 Julia Hirschberg 对一直以来语音识别上的复杂挑战评论说:


要达到和人类一样的识别语音的能力是一个持续性的挑战,因为人类语音,尤其是在自发性的对话(spontaneous conversation)中的人类语音,是非常复杂的。而且我们也很难定义人类的表现,因为人类在理解其他人的语音上的能力会各有不同。当我们将自动识别和人类表现进行比较时,需要考虑两件很重要的事情:在被评估的同样的语音上识别器的表现和人类的表现。因此,IBM 最近在 SWIRCHBOARD 和 CallHome 数据上的成就是非常了不起的。而且 IBM 一直以来都在努力想要更好地理解人类理解这两个得到广泛引用的语料库的能力,这也让我印象深刻。这项科学成就在当前 ASR 技术上的表现是很了不起的,也表明我们仍然有一种让机器比肩人类语音理解的方法。


今天的成就是我们在语音技术上的新里程碑。之前,比如说去年12月份,我们为Watson语音转文本服务增加了语者分类(diarization)功能,这是在区分对话中的个体方面的一项进步。这些语音进展构建于数十年的研究的基础之上,而且实现人类水平的语音识别是一项复杂的任务。我们将继续努力创造未来有一天能够达到人类所听、所说和所想的复杂度的技术。尽管我们为我们的进展而鼓舞,但我们的工作还依赖于未来的研究——而且更重要的是,要致力于实现可能的最高标准的准确度。


原文

March 7, 2017

Reaching new records in speech recognition

Depending on whom you ask, humans miss one to two words out of every 20 they hear. In a five-minute conversation, that could be as many 80 words. But, for most of us it isn’t a problem. Imagine, though, how difficult it is for a computer?

Last year, IBM announced a major milestone in conversational speech recognition: a system that achieved a 6.9 percent word error rate. Since then, we have continued to push the boundaries of speech recognition, and today we’ve reached a new industry record of 5.5 percent.

This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like “buying a car.” This recorded corpus, known as the “SWITCHBOARD” corpus, has been used for over two decades to benchmark speech recognition systems.

To reach this 5.5 percent breakthrough, IBM researchers focused on extending our application of deep learning technologies. We combined LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning. The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples – so it gets smarter as it goes and performs better where similar speech patterns are repeated.

Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the ultimate industry goal. Others in the industry are chasing this milestone alongside us, and some have recently claimed reaching 5.9 percent as equivalent to human parity…but we’re not popping the champagne yet. As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent.

To determine this number, we worked to reproduce human-level results with the help of our partner Appen, which provides speech and search technology services. And while our breakthrough of 5.5% is a big one, this discovery of human parity at 5.1 percent proved to us we have a way to go before we can claim technology is on par with humans.

As part of our research efforts, we connected with different industry experts to get their input on this matter too. Yoshua Bengio, leader of the University of Montreal’s MILA (Montreal Institute for Learning Algorithms) Lab agrees we still have more work to do to reach human parity:

“In spite of impressive advances in recent years, reaching human-level performance in AI tasks such as speech recognition or object recognition remains a scientific challenge. Indeed, standard benchmarks do not always reveal the variations and complexities of real data. For example, different data sets can be more or less sensitive to different aspects of the task, and the results depend crucially on how human performance is evaluated, for example using skilled professional transcribers in the case of speech recognition,” says Bengio. “IBM continues to make significant strides in advancing speech recognition by applying neural networks and deep learning into acoustic and language models.”

We also realized finding a standard measurement for human parity across the industry is more complex than it seems. Beyond SWITCHBOARD, another industry corpus, known as “CallHome,” offers a different set of linguistic data that can be tested, which is created from more colloquial conversations between family members on topics that are not pre-fixed. Conversations from CallHome data are more challenging for machines to transcribe than those from SWITCHBOARD, making breakthroughs harder to achieve. (On this corpus we achieved a 10.3 percent word error rate – another industry record – but again, with Appen’s help, measured human performance in the same situation to be 6.8 percent).

In addition, with SWITCHBOARD, some of the same human voices in test speakers’ data are also included in the training data used to train the acoustic and language models. Since CallHome has no such overlap, the speech recognition models have not been exposed to test speakers’ data. Because of this, there is no repetition and this has led to its larger gap between human and machine performance. As we continue to pursue human parity, advancements in our Deep Learning technologies that can pick up on such repetitions are ever more important to finally overcoming these challenges.

Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, also commented on the ongoing complex challenge of speech recognition:

“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated,” she shared. “IBM’s recent achievements on the SWITCHBOARD and on the CallHome data are thus quite impressive. But I’m also impressed with the way IBM has been working to better understand human ability to understand these two, much-cited corpora. This scientific achievement is in its way as impressive as the performance of their current ASR technology, and shows that we still have a way to go for machines to match human speech understanding.”

Today’s achievement adds to recent advancements we’ve made in speech technology – for example, in December we added diarization to our Watson Speech to Text service, marking a step forward in distinguishing individual speakers in a conversation. These speech developments build on decades of research, and achieving speech recognition comparable to that of humans is a complex task. We will continue to work towards creating the technology that will one day match the complexity of how the human ear, voice and brain interact. While we are energized by our progress, our work is dependent on further research – and most importantly, staying accountable to the highest standards of accuracy possible.

To check out the white paper on this automatic speech recognition milestone, please see this link https://arxiv.org/abs/1703.02136

原文网址:https://www.ibm.com/blogs/watson/2017/03/reaching-new-records-in-speech-recognition/