Using Speech Recognition to Automatically Transcribe Interviews, Meetings, and Speeches

I’ve been looking for a way to use speech recognition to automate the transcription of interviews, meetings, speeches, conference presentations, and so on.

I spend a lot of time on the phone interviewing experts for the articles and reports I write. Normally I conduct the interview with a headset and do my best to type a transcript of what is said. I’m slow and a terrible typist, so my transcript misses a lot and comes out with many misspellings that are impossible to correct. Usually for an hour-long interview it takes me another hour to go through and fix mistakes, filling in gaps, and making guesses at uninterpretable words.

I would greatly benefit from a speech recognition solution that could create a fairly accurate transcript from audio, for example, live over the phone or from an mp3 file.

This need was emphasized to me even more this week, when I attended a conference and spent two days trying to take notes and capture useful quotes from speakers. I have a digital voice recorder and have all of the presentations in mp3 format, but it’s going to be quite a challenge to comb through all of that audio to find relevant quotes for the articles I will be writing about the conference. How much easier it would be it I had a software application that could convert all of those mp3s into fairly accurate text transcripts!

Unfortunately, it appears that voice recognition software is not ready to handle meetings and so on where multiple voices are involved. These systems have to be trained to recognized the voice of a single user.

I’m using this blog post to mark and share some possible solutions I have encountered. I will plan to add to this list as time goes — if and when the technology continues to improve.

+ Dragon Naturally Speaking by Nuance is supposed to be the best reasonably-priced speech recognition software for professional use. Nuance says Dragon is not able to transcribe multiple voices, but I’m tempted to shell out the $200 just to see what kind of results I might get with it. Suppose it were 50 percent accurate transcribing unfamiliar voices? That might be good enough for me.

+ Windows has its own built-in speech recognition capability. I plan to test this out to see whether I can make it work somehow. However, it’s hard to believe that Microsoft could come up with a better solution than a specialist company like Nuance.

+ One suggestion I’ve run into a lot is to transcribe a meeting or lecture by “parroting” or “re-speaking.” In other words, using speech rec software like Dragon, you listen to the recording of the meeting on headphones and repeat what you hear into your computer mic. Because Dragon is trained to your voice, it can create an automatic transcript. Sounds laborious, but it would probably be better that having to type it all out myself.

+ I also heard about a company called Koemei that has a cloud-based solution for converting video and audio assets into text. Looks as if this might work pretty well, however, their entry-level service is $149 per month. That sounds like a lot, but maybe someday…. For $20 per month I would definitely try it.

+ Another idea I have thought of is to call my Google Voice number and play the audio recording into my voicemail. Google Voice automatically transcribes my voicemails into text and often does an acceptable job — good enough so I could paste the results into a word processor and make quick corrections. I’m not sure yet if Google Voice can handle long audio streams, though. I’m thinking about testing this solution to see if I can make it work somehow.

+ Here’s an interesting video by Chaelaz showing how to use YouTube’s closed-captioning transcription service to convert audio to text. Looks as if you would have to create a video first and upload it to YouTube, but that’s an interesting possible work-around for what I’m trying to do.

ARB — 21 June 2013

Advertisement

How the Internet Reinforces Confirmation Bias

Recently I wrote about confirmation bias in connection with the climate change controversy — see my article at ThomasNet, “All This Wrangling Over Climate Change – What’s Up With That?” The Skeptic’s Dictionary refers to confirmation bias as “a type of selective thinking whereby one tends to notice and to look for what confirms one’s beliefs, and to ignore, not look for, or undervalue the relevance of what contradicts one’s beliefs.”

Today I ran across an interesting TED Talk (TED hosts and posts video talks on innovative topics) by political activist Eli Pariser who has some interesting things to say about how the algorithms used on web sites such as Facebook and Google tend to reinforce our current thinking and filter out new ideas — see his talk, “Beware Online ‘Filter Bubbles‘” — well worth watching, only nine minutes.

Pariser explains what he means by a filter bubble:

Your filter bubble is kind of your own personal, unique universe of information that you live in online … the thing is, you don’t decide what gets in, and more importantly, you don’t actually see what gets edited out.

If you and I both search for the same thing at the same time on Google, for example, we get different results. The danger of the filter bubble, says Pariser, is that

this moves us very quickly toward a world in which the Internet is showing us what it thinks we want to see, but not necessarily what we need to see.

He suggests that a personalization algorithm deciding what to show us needs to look not just at what it thinks is “relevant,” but at other factors too, such as those in this slide from his presentation:

This seems like a great insight. Anyway, I highly recommend this short video to get you thinking outside the box:

AB — 24 August 2011

SEO Angst: The Secret of Search Engine Optimization

Many who manage web sites invest great effort and expense in search engine optimization (SEO), the practice of optimizing the content and format of a site and its pages so as to attract the most search engine traffic.

SEO is important to online businesses, because qualified web traffic can translate into eyeballs (if a site sells advertising) or sales (if it’s an e-commerce site) or potential clients (if the site is run by, say, a consulting firm).

I’ve been around the practice of SEO for about 15 years (before it was even called SEO), and I’ve come to believe in a central truth about it:

If you want search engine traffic, the first thing you have to do is deserve it.

This means providing honest, substantive content.

This also means offering well-executed services and a customer experience that serves the visitor well.

This concept is approximately equivalent to customer-centeredness in marketing or user-centered design in software development. A business has to make a profit, try to grow, strive for market share — but business success in the long term is hard to come by without a strong customer focus, or user focus in the case of web traffic.

By all means, optimize your site for search engine traffic, but be aware that few businesses make it for very long by tricking Google.

Do what you can to direct web traffic to your site, but make sure you deserve it.

AB — 5 May 2011

Undo: One of the Greatest Innovations in Computing

The Undo function — a life-saver.

From “Behavioral issues in the use of interactive systems,” Lance A. Miller and John C. Thomas, International Journal of Man-Machine Studies, Sept. 1977:

A more complex situation, however, occurs … when a user wishes to “undo” the effects of some number of prior commands — as, for example, when a user inadvertently deletes all personal files. Recovery from such situations is handled by most systems by providing “back-up” copies of (all) users’ files, from which a user can get restored the personal files as they were some days previous. While this is perhaps acceptable for catastrophic errors, it would be quite useful to permit users to “take back” at least the immediately preceding command (by issuing some special “undo” command).

Now if they would only invent an Undo button for one’s personal life.

AB — 15 April 2011

Where the Big Green Copier Button Came From

Big green copier buttonRecently I’ve been studying the use of ethnography in large companies for product design and market strategy, which relates to some of the work I’ve done in usability and user experience.

In process of the research, I ran across an interesting anecdote about how the “big green button” on printers came out. I think it illustrates the value of video ethnography in product design, but, on an even more basic level, the value of simply watching how people live and work and use your product.

In a 1999 presentation for WPT Fest, Xerox PARC anthropologist Lucy Suchman described how she helped Xerox engineers understand how hard copiers were to use:

Around this time [1979] a project began at PARC to develop an intelligent, interactive expert system that would provide instructions to users in the operation of a particular photocopier, just put on the market and reported by its intended users to be “too complicated.” With Austin Henderson, I initiated a series of studies aimed first at understanding what made the existing machine difficult to use, and later at seeing just what happened when people engaged in “interactions” with my colleagues’ prototype expert advisor.

Scientists struggling with copierIn order to explore these questions in detail we got a machine ourselves and installed it in our workplace. I then invited others of my co-workers, including some extremely eminent computer scientists, to try using the machine to copy their own papers for colleagues, with the understanding that a video camera would be rolling while they did so. This resulted among other things in what has become something of a cult video that I produced for John Seely Brown for a keynote address to CHI in 1983, titled “When User Hits Machine.” This image, taken from a 3/4″ reel-to-reel video recording made in 1982, shows two of my colleagues using the machine to make two-sided copies of a research paper. The CHI audience would recognize Allen Newell, one of the founding fathers of AI. His PARC colleague is a brilliant computational linguist named Ron Kaplan.

Video ethnographer Susan Faulkner of Intel relates one of the interesting results of Suchman’s video:

The film was shown to researchers and engineers at Xerox, and it led to significant changes in interface design, including the addition of the now ubiquitous large green button that allows users to quickly and easily make a copy.

AB — 2 June 2010

Coming Soon: Tom Cruise’s Computer Interface From ‘Minority Report’

My favorite computer interface has to be the fictional one used by Tom Cruise in the 2002 Steven Spielberg movie Minority Report (based on a 1956 short story by Philip K. Dick). In the movie, Cruise plays a time cop who is part of a team that prevents murders by predicting them in advance and arresting the future perpetrators.

What has always fascinated me about the movie is the computer interface the cops use to do their investigations — it’s a huge holographic screen that hangs in the air in front of the user, who interacts with it using virtual-reality gloves. Here’s a screen shot from the movie that will give you an idea:

Computer interface from Minority Report

The exciting news for me comes from a TED Talks video from February 2010 showing a lecture by John Underkoffler of the MIT Tangible Media Group (“John Underkoffler points to the future of UI“), who was science advisor for Minority Report. He and colleagues designed the interfaces that appeared in the film.

Underkoffler has some fascinating things to say about how interfaces are evolving. He tells how the design work was done for Minority Report — the design for the computer interfaces was done as a real R&D project.

But most exciting is that Underkoffler and colleagues are actually developing the real thing — the “spatial operating environment” as he calls it — and he was able to demonstrate it during his talk. Here’s a still of his demo from the video:

John Underkoffler demonstrates UI

During his talk he says:

Much of what we want computers to help us with in the first place is inherently spatial, and the part that isn’t spatial can often be ‘spatialized’ to allow our wetware to make better sense of it.

A spatialized interaction model, he believes, improves our computing experience, as it aligns better with the way our brains work.

During the talk, Underkoffler demonstrates a logistics application his team is developing that combines structured data with 3D geographical mapping. He also shows how a spatial operating environment might be used for media manipulation and editing.

Very soon, Underkoffler says, “this stuff will be built into the bezel of every display, it’ll be built into architecture.”

At the end of the presentation, the host asks the big question: “When? … In your mind, five years’ time, someone can buy this as part of a standard computer interface?”

Underkoffler replies, “I think in five years’ time, when you buy a computer, you’ll get this.”

The fist “killer app” for the spatial operating environment? “At the moment, our early adopter customers — and these systems are deployed out in the real world — do all the big data-heavy, data-intensive problems with it. So, if it’s logistics in supply chain management, or natural gas and resource extraction, financial services, pharmaceuticals, bioinformatics — those are the topics right now. But that’s not the killer app!”

He leaves us hanging at that point, recognizing perhaps that the most interesting applications are impossible to foresee.

Here’s the video in its entirety, with lots of fascinating demonstration footage:

AB — 1 June 2010

New wristwatch uses a linear rather than circular clock face

Just yesterday I read on The Watchismo Times (a blog dedicated to unusual timepieces) about a new mechanical wristwatch designed with a linear time display rather than the traditional circular clock face. (See “Urwerk King Cobra CC1 Reintrepretation of 1958 Patek Philippe Cobra Prototype – Cylindrical Retrograde Linear Jumping Hour Display.”)

This design is thought-provoking: We normally conceive of time as a line, and yet for centuries the standard timepiece interface has been a circle. The author of the Watchismo site explains why this is:

Why do we think of time as travelling in a straight line yet display it rotating around a circle? The answer is straightforward: mechanisms that continually rotate are much simpler to produce than those that trace a straight line then return to zero. In fact, the latter is so difficult that, until now, nobody has ever managed to develop a production wristwatch with true retrograde linear displays.

It makes me think about how I conceive time personally. In the big picture, I think I do see time as a straight line going infinitely to the left and right.

In spite of the more linear design of the calendars I use, I believe I conceive of the calendar as a circle, as if the year were superimposed on a standard clock face. However, in my mind, the calendar runs counterclockwise with January at approximately the 11:00 position. I think my circular conception of the calendar comes from the periodic nature of the solar year. Why the year goes counterclockwise in my mind I don’t know.

When it comes to my conception of days, though, I see some ambiguities. I do conceive of them on some level as a circle of 24 hours, but on reflection I think that conception is at least partly based on the circular clock faces we use to keep time, as well as on the collective 24-hour standard we use to keep our society synchronized.

Certainly the new Urwerk King Cobra CC1 provides food for thought about how we think about time and about the user interfaces of the devices we use to keep track of it. Below is a link to Watchismo’s picture of the watch. Watchismo also provides many fascinating details about how the watch is designed and constructed.

Urwerk linear wristwatch
Urwerk linear wristwatch

AB — 10 July 2009

Google Noticeboard: Net-based communications for “have-nots”?

In a recent article on his Content Nation site, John Blossom of Shore Communications discussed the possibilities for the new Google Noticeboard application as an Internet and computing tool for the world’s 5 billion people who are too poor to have Internet access.

Blossom is a respected expert in the content industry, and his new book, Content Nation: Surviving and Thriving as Social Media Changes Our Work, Our Lives and Our Future, explores the future of society in light of social media.

In the recent article, “The Other Five Billion: Google Focuses on Truly Universal Publishing for Content Nation,” I learned of Blossom’s interest in the Hole in the Wall project, in which, Blossom writes:

… in the back alleys of New Delhi poor children with no previous exposure to computers were given access to the Web via a PC embedded in the wall of a building. Almost immediately they became what an adult would consider “computer literate” and started teaching one another how to publish and how to collaborate on content.

The Hole in the Wall has also has also attracted my attention for its lessons on human-computer interaction. For more on the Hole in the Wall, see my blog entry “The Hole in the Wall: Computing for India’s Impoverished.”

The Google Noticeboard application Blossom discusses allows people to use publicly-shared computers to send text or voice messages through public Noticeboards. The application is designed such that it can be used by people with no computer experience, or even people who are illiterate.

The following series of images gives an idea of the interaction design:

AB — 1 April 2009