Natural Language Generation & Robojournalism: How TecnoNews automatically publishes news more informative than sources
If you go to EntretenimientoBit, a news blog or augmented newsroom where 200 quality posts are published per day, out of 300 sources, you will be able to witness TecnoNews’ powerful rewriting algorithm in action. The articles that it publishes are created by merging and enriching different coverages from various media outlets. How does it do it? The purpose of this blog post is to give you an overall idea of how it works.
The pipeline of the algorithm starts with a news text, we’ll call it T0. When it enters the system, it is categorized, tagged and geolocalized. With this information, T0 is checked against the existing news articles in the data base to see if it has something in store that cover the same event. If the news article doesn’t appear to be related to any stored article, it is labeled as trivial (although this can change if a new article with the same coverage enters the pipeline and the process is triggered again). We use several criteria to determine if two articles refer to the same events: the title, the content (syntactic cues, synonyms, etc.), the most relevant named entities, the date of publication, among others.
Once it has done this, TecnoNews creates a NewsTree:
This is a convenient way of showing how T0 is related to the previously published articles from a wide range of media outlets (we will call these articles T1-T3). After this, the idea is to choose which information will become the backbone of the news article T0'. TecnoNews extracts facts from T1-T3 that were not present in the original news article.
Based on information about the topics, the algorithm checks if it can enrich the story with facts and an appropriate image extracted from Wikipedia. A small racconto from T4 might be added, which is a small retelling of a related fact that happened some time before the events discussed in T0.
The algorithm will then apply rephrasing strategies to make sure the new content is not marked down as plagiarism.
A tool that can be used to compare the original article with the one rewritten by TecnoNews is Copyscape.
If we choose an article generated by TecnoNews (T0') and compare it with its source (T0), Copyscape will generate two percentages for the two articles, indicating how much of one text it finds in the other. Obviously, the algorithm’s goal is not to make something 100% original, that would mean we are not speaking about the same facts. This is why there is always some degree of overlap between T0 and T0'.
Once this process is finished, TecnoNews will enrich T0' with its infovis features, such as knowledge graphs. This new article can serve as a first draft that needs minimum revision, but it’s pretty much ready to be published.
If you are intrigued about this process and how content generation might improve the organic traffic of your website or concisely summarize thousands of news, please visit our webpage or contact us and we will be more than happy to help.