Building Software to Fix Thousands of Errors on the NY Times
The Problem
Back in the 90s, The New York Times started a project to digitize and transcribe old printed articles, called the TimesMachine. The TimesMachine not only allows users to see the old printed articles, but also to see the transcriptions of those articles in the same format as a typical article on NYTimes.com. This article transcription feature is available for all articles after around 1960; articles before then are only available to download as PDFs.
Last year I was working on a project for a history class and I noticed a recurring, common error in these transcribed articles: there would commonly be an extra space in the middle of words, which occurred, in many cases, three or four times per paragraph.
Here's an example from the first paragraph of a pretty notable article, the first article in the release of the famous Pentagon Papers titled Vietnam Archive: Pentagon Study Traces 3 Decades of Growing U. S. Involvement:
This error appeared over 300 times in just six transcribed articles I selected. Furthermore, the error likely occurs thousands, if not tens of thousands of times in transcribed New York Times articles.
If you take a look at the original articles in print, you'll see that this error occurs because the transcription system interprets end-of-line hyphens as a space, so words that were continued on two lines are split in two with a space.
An Automated Solution
Seeing this issue, I developed a solution, taking advantage of Google's Ngram dataset, which contains information on hundreds of thousands of words and names, and how common they are. Here's what that looks like for the example article mentioned previously:
View a Demo
Here's a quick demo site that showcases the program, example errors and fixes from real articles, as well as a small try it yourself page: nytimesfixer.vercel.app.
The General Algorithm
If the program identifies a word that appears to be uncommon, but becomes much more common when combined with an adjacent word, the program sees it to be highly likely that it has caught a transcription error. The program then performs the combination to implement the fix. There are a few additional algorithms and datasets along with this that are used to limit false positives, eliminate a couple of subtle bugs, and improve performance.
A simplified version of the algorithm looks something like:
- Split the article into words
- For each word, check if combining it with the next word would create a word that is more common than keeping the two words separate
- If so, combine the two words and add to the final list of words
- If not, make no changes and add the unchanged current word to the final list of words
- Combine the final list of words into a complete article and return
Here's some basic code for this:
1/* assume getCommonality returns a numerical value that is 2 higher for more common words and very low number for 3 words not present in the dictionary. */ 4const { getCommonality } = require('./dictionaryData'); 5 6const fixArticle = (articleContent) => { 7 const words = articleContent.split(' '); 8 const fixedWords = []; 9 10 for (let i = 0; i < words.length - 1; i++) { 11 const curWord = words[i]; 12 const nextWord = words[i + 1]; 13 const sumOfCommonalities = getCommonality(curWord) + getCommonality(nextWord); 14 const commonalityOfCombined = getCommonality(curWord + nextWord); 15 16 if (commonalityOfCombined > sumOfCommonalities) { 17 // combined version of the two words is more better than current arrangement 18 fixedWords.push(curWord + nextWord); 19 // skip next word since it has been combined with current word 20 i += 1; 21 } else { 22 // nothing to fix, so add word as normal 23 fixedWords.push(curWord); 24 } 25 } 26 27 return fixedWords.join(' '); 28}; 29
For the final program, I added support for multiple paragraphs, more word delimiters outside of just spaces, support for ignoring punctuation in words and supporting apostrophes, and a couple of other features that users of the program may find helpful.
Here's a link to the code, written with Node.js, for more information: github.com/xtrp/nytimesfixer. The code I've published is pretty simple to download and run, and can be called and tested like so:
1const fixArticle = require('./path/to/nytimes-fixer/src/fixArticle/'); 2 3// article content can be any string, with paragarphs separated by line breaks 4const articleContent = 5 'A massive study of how the United States went to war in Indochina, con ducted by the Pentagon three years ago, demonstrates that four administrations progrestively developed a sense of com mitment to a non‐Communist Vietnam, a readiness to fight the North to pro tect the South, and an ultimate frustra tion with this effort—to a much greater extent than their public statements ac knowledged at the time.'; 6 7const fixedArticleContent = fixArticle(articleContent); 8/* fixedArticleContent will be: 9"A massive study of how the United States went to war in Indochina, conducted by the Pentagon three years ago, demonstrates that four administrations progrestively developed a sense of commitment to a non‐Communist Vietnam, a readiness to fight the North to protect the South, and an ultimate frustration with this effort—to a much greater extent than their public statements acknowledged at the time." 10*/ 11
This existing algorithm could undoubtedly be optimized in a number of key ways, primarily in searching the dictionary list more efficiently. The dictionary list is currently ordered by most common to least common words, but since the goal of a search of the dictionary is typically to find how common a word is, pre-sorting the array alphabetically and searching it using something like binary search would be much more effective. This is just one of the optimizations that could be made to the algorithm that immediately comes to mind.
Fetching Article Content With Readability
Per their Terms Of Service, The New York Times does not allow users to scrape their site, although I got the contents of a couple of articles from them through Fair Use, using Mozilla's Readability package, which is published on NPM. The Readability package allows for the HTML content of a site with an article to be parsed into the raw text of the article. It was originally built to power Reader View on Firefox. Here's what the code looks like for getting the raw text from one New York Times article, for example:
1const fetch = require('node-fetch'); 2const { Readability } = require('@mozilla/readability'); 3const JSDOM = require('jsdom').JSDOM; 4 5const getArticleContent = async (url) => { 6 const pageHTML = await (await fetch(url)).text(); 7 const pageDOM = new JSDOM(pageHTML, { url, runScripts: 'outside-only' }); 8 9 // parse with Mozilla Readability 10 const reader = new Readability(pageDOM.window.document); 11 return reader.parse().textContent; 12}; 13(async () => { 14 // test with example article 15 const url = 'https://www.nytimes.com/2021/03/24/business/biden-economy-infrastructure.html'; 16 const content = await getArticleContent(url); 17 console.log(`${url} article content: ${content}`); 18})(); 19
Fixing Additional Errors
Of course, this error of the extra space separating words only comprises of a portion of the many errors on The New York Times website and other news sites alike.
However, this is one of the only recurring types of errors that I've seen — other errors such as spelling and grammar mistakes seem to be of a more journalistic editing variety.
In terms of fixing and detecting other errors, Mozilla Readability provides an effective foundation in scraping news sites for content, and there exists other APIs for fetching content from articles as well, such as NewsAPI for example.
While there are existing autocorrect and spell-checker packages, using Google's Ngram dataset and an error function like Levenshtein distance could be helpful if you're interested in building your own.
While I'm currently not completely proficient with natural language processing, I've been getting more involved in machine learning recently and there are undoubtedly many applications for NLP in this general area of work.
Conclusion
I hope you found this article and the program to be interesting in learning more about parsing online articles and fixing errors relating to natural languages.
Here's a link to a site that showcases the program, example real articles and their fixes, as well as a try it yourself page: nytimesfixer.vercel.app.
And here's a link to the GitHub repository, which includes the code for the program: github.com/xtrp/nytimesfixer. The code for the algorithm is located at src/fixArticle/
; everything else is just the code for the site. I'm considering publishing this as an NPM package, so if you think that would be useful, feel free to contact me: gabriel@gabrielromualdo.com.
Last fall I reached out to an archivist at The New York Times about this issue. I was told that the transcription was done in the late 1990s using a system that translated article text from microfilm, which created a a couple of errors including the extra-space problem. I was told that I would likely be put in touch with a developer at The New York Times to discuss implementing a fix like this. In the meantime, I'll likely reach out to other members of the archiving and development team to possibly move this project forward in that manner.
The algorithm that has fixed hundreds of errors on The New York Times website and has the potential to fix exponentially more. I really enjoyed working on this project overall.
Thanks for scrolling.
— Gabriel Romualdo, July 8, 2021