Benchmarking: How does SentiLecto NLU API perform in Spanish with respect to other APIs? (part 2)
In this post, we’ll make a qualitative analysis of the output of four Natural Language Understanding APIs: IBM Watson, Google, MeaningCloud and SentiLecto. We will make comparisons across different tasks: complex syntactic phenomena, named-entity recognition and sentiment analysis. Our results show how SentiLecto, NaturalTech’s Natural Language Understanding engine, consistently produces quality text analytic output that outperforms the other APIs.
No matter what industry you are in, written text is a crucial component of all aspects of life. To learn more about how you can leverage SentiLecto and earn an edge over your competitors, contact us.
If you want to know how SentiLecto performed in basic syntax operations, read our previous post. Those metrics are important because the resolution of more complex operations depends on the resolution of simpler phenomena.
- Advanced Syntactic Phenomena
We’ll begin with syntactic phenomena that for different reasons poses a challenge to most NLU systems. We’ll try examples that contain the phenomenon, show you how the APIs handled the input and comment on why the output was desirable (or not).
i. Phrasal-Verb Complement
In Spanish, there are certain verbs that require special types of complements, which are called phrasal-verb complements. These complements are prepositional phrases (they usually have the form preposition + noun phrase), hence the name. For example, take the following sentence:
María cuenta con Juan (trans. María counts on Juan)
If you think about the verb of the sentence contar (no to be confused with the verb contar which means to enumerate or count), it always calls for a complement that starts with the preposition con, followed by a noun phrase (in this case, Juan). Now that we know what a phrasal-verb complement means, let’s test the APIs.
(a) El terrorista arremetió contra tres turistas en su camioneta.
Trans. The terrorist plowed a van into three tourists.
As you can see, SentiLecto correctly recognizes tres turistas as the phrasal-verb complement and su camioneta as the instrument used to perform the action.
One of the most conspicuous characteristics of human language is that it’s constituents are organized in a hierarchical fashion. If you think about our example, the constituent that is demanded by the verb, contra tres turistas, is more important than the instrumental complement con su camioneta. If we got rid of the former, we’d get a sensation of incompleteness of the sentence, whereas the latter, if removed, would still yield a grammatical, complete sentence. Not being able to model this means valuable information that could be leveraged further along the pipeline is lost.
Google and MeaningCloud have this problem. They mark contra tres turistas as a simple complement, overlooking the fact that it has a different, more prominent syntactic role in the sentence than the other complement, con su camioneta.
In the case of IBM Watson, (let’s remember it doesn’t support syntactic role analysis, so we use the semantic role API), it simply doesn’t assign any syntactic role to the phrasal-verb complement, not even a simple ‘object’ label.
We tested other sentences with phrasal-verb complements such as the following:
(b) El hombre renunció a Microsoft. (Trans. The man quit Microsoft)
SentiLecto correctly recognizes the phrasal-verb complement (check it yourself):
MeaningCloud recognizes it as a Direct Object, probably because of the presence of the a preposition:
Google labels it as a simple prepositional phrase:
And Watson identifies it as an object, which is a sound response considering how close the role of phrasal-verb complement is to that of the direct object.
Now, let’s go to sentence c:
(c) Con cada cuatrimestre yo debo lidiar con los alumnos con mucha paciencia. (Trans. Every term, I must deal with the students with a lot of patience)
This input triggered a behavior analogous to what we saw in sentence (a): SentiLecto recognized the phrasal-verb and other complements correctly; MeaningCloud and Google simply recognized three complements, and Watson recognized con los alumnos as the object (check it yourself).
In conclusion, SentiLecto was the only one that could satisfactorily recognize the phrasal-verb complements (and other types of specific syntact roles such as an instrumental or time complement).
ii. Anaphora resolution
Imagine the following text:
Sounds odd, right? As speakers, there are many strategies available to us to make texts less redundant and more effective. The above snippet could be rewritten as:
As speakers, we usually present the entities we want to talk about by their names first, but once they’ve entered the stage, so to speak, and the adressee knows what we are talking about, we simply use pronouns. When you read the fragment, I’m sure you didn’t doubt who ‘he’, ‘him’, ‘she’, or ‘her’ were referring to. This relationship between a pronoun and its referent is called anaphora by grammars. So pointing out what is the referent of a pronouns is a task known in NLP as anaphora resolution.
This is non-trivial, as a NLU engine must know what the pronouns are referring to, as this strategy is ubiquitous in text and speech. For this reason, we tried to test the APIs ability to resolve anaphoras.
(d) El hermano de mi padre decidió comprarme un auto, pero yo lo rompí.(Trans. My father’s brother decided to buy me a car, but I broke it.)
As you can see, SentiLecto is capable of recognizing that there are two clauses in the sentence delimited by the word pero, this is the reason why it yields two outputs. What is interesting is that in the second clause, it correctly detects lo as the direct object. Moreover, because lo is a pronoun, the system knows that it must have an antecedent and correctly points to un auto. This is challenging because in spanish, it could also have pointed to the subject of the first sentence, el hermano de mi padre, but SentiLecto isn’t confused by this.
What happens to MeaningCloud’s output? it mistakes El hermano de mi padre as the subject of lo rompí. Google doesn’t assign an antecedent to lo as you can see in the output below, although it correctly recognizes it as the object of the verb rompí.
As you can see, Watson doesn’t even take into account the second clause of the sentences, much less analyzes the role of lo or its antecedent:
Before passing to the other section, let’s quickly go through another example of anaphora resolution:
(f) Según exlica el manual de historia, el Gral. José de San Martín, quien es el prócer nacional de Argentina, liberó a Chile en 1818.
(Trans. As the history manual explains, the general José de San Martín, who is an argentinian national hero, freed Chile in 1818)
Why is this challenging? because for understanding the meaning of the sentence, the APIs must solve the antecedent of the relative pronoun quien (who), which in this case is el Gral. José de San Martín.
MeaningCloud outputs manual de historia (history manual) as the subject of es el prócer nacional de Argentina (is an argentinian national hero) and of liberó Chile in 1818 (freed Chile in 1818), when we can correctly recognized that the person who did those things is general San Martín.
Manual de historia is also recognized as the object of explica (explains), whereas the subject is José de San Martín. This would mean that José de San Martín explains the history manual, and yet, that is not the meaning of the sentence.
Google correctly recognizes the subject of es and liberó (José de San Martín). It doesn’t recognize manual the historia as the subject of explica.
Watson, surprisingly, recognizes the subject of es and liberó. Although it makes the same mistake MeaningCloud makes: it outputs José de San Martín as the subject of explain and manual de historia as the object.
As for SentiLecto, it can correctly handle this input. Check it yourself.
2. Named Entity Recognition
SentiLecto’s concept of an Entity is that which is opionable, so anything in the text we could assign an opinion to, counts as an entity. It also distinguishes between this plain entities and named entities, which refer to precise single entities that exist in the world, such as a person or an organization. Again, we are going to try a couple of examples and comment on the performance of the APIs.
(g) A dos días de que Bruselas apruebe su estrategia digital, el fundador de Facebook, Mark Zuckerberg, se ha reunido este lunes en Bruselas con altos cargos comunitarios. (source: https://elpais.com/economia/2020/02/17/actualidad/1581968051_809001.htm)
(Trans. Two days before Brussels approves its digital strategy, Facebook founder, Mark Zuckerberg, has met on Monday in Brussels with senior community officials).
When given this input, SentiLecto recognizes all the entities, it doesn’t matter if they are of the plain, more general kind like su estrategia digital (its digital strategy) or named-entities (Brussels, Mark Zuckerberg).
It’s also very interesting to note that SentiLecto can tell that in the first appearance of Bruselas, the writers isn’t talking about the city as a physical space. Instead, they are using figurative speech so Bruselas really means the European Comission that is about to approve a digital strategy, hence why it is classified as an organization. Lastly, SentiLecto can recognize that Mark Zuckerberg refers to the same entity as el fundador de Facebook.
With Google, you can see how it doesn’t recognize Mark Zuckerberg and el fundador de Facebook as the same entity. The meaning nuance of Bruselas is also lost:
MeaningCloud only recognizes named-entities, it doesn’t map Mark Zuckerberg and el fundador the Facebook to the same entity:
The output of Watson is very similar:
3. Sentiment Analysis
Sentiment Analysis is a complex task for any algorithm. It consists in extracting subjective information in source materials like news, social media, etc. Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic. SentiLecto owns a great syntactic knowledge that allows it to build an accurate fact representation and to perform an entity-based sentiment analysis.
If you want to know more about how it does it, click here.
Let’s see an example:
(h) La sociedad colombiana está transformándose tanto por la llegada de venezolanos como porque el proceso de paz abrió un espacio para que otras demandas diferentes al conflicto, económicas o culturales, sean del interés nacional. (source: https://www.bbc.com/mundo/noticias-america-latina-51469238)
(Trans. Colombian society is being transformed both by the arrival of Venezuelans and because the peace process opened a space where other demands different from the armed conflict, like economic or cultural demands, can become of national interest)
Both Google and SentiLecto recognized that the sentiment of this sentence is positive, as it talks about transformation that answers to new demands that leave the Colombian armed conflict behind.
Watson and MeaningCloud, on the other hand, couldn’t recognize any porality at all.
As you can see, SentiLecto has a consistent performance throughout all the tasks that we have tested. This, combined with its understanding of Spanish syntax, make it your go-to choice when looking for NLU solutions. If you have any questions, please contact us at: email@example.com