Benchmarking: How does SentiLecto NLU API perfom in Portuguese respect to other APIs? (part 1)
In this new post, we are going to talk about how SentiLecto compares to two well-known NLU engines for text in Portuguese: Google Natural Language, and Spacy across three basic syntax tasks: subject extraction, object extraction and passive voice sentences role extraction. We chose these two APIs because they are among the very few products that offer adequate full parsing for Portuguese.
We created a battery of sentences for each task, taken from the web portal of BBC news in Portuguese. We hand checked every single output and awarded marks even if it was not the same as the expected input, as long as the answer was plausible. The same approach was implemented in previous posts for Spanish language and again SentiLecto outperforms the others!
The average of all the three scores across the three APIs shows that SentiLecto outperforms the other two APIs in our tests.
If you want to know more about our results, the methodology and other precisions, continue reading our blog post. To see the test data and the outputs, visit our repo. If you are interested in leveraging SentiLecto’s capabilities or have any question/comment, please write to firstname.lastname@example.org and we will get back to you as soon as possible.
SentiLecto is the natural language understanding technology of NaturalTech that processes language the way native speakers do.
First of all, if you have doubts about the basic syntactic concepts we are going to talk about in a second, you can go to the first bench-marking post in this series, which tested Spanish input. There, you’ll find a detailed explanation alongside many useful linguistic precisions that also apply to this blog post.
The methodology is very similar to the one we used in the first post. We devised a battery of three tests with a total of 45 inputs. In the following sections, we will break down the results of all three tests.
You can see above that the three APIs have a strong performance when it comes to object extraction. Let’s see an example to get an idea of what this task is about:
Coronavírus: por que não houve casos confirmados na América Latina?
Trans: Coronavirus: why hasn’t there been confirmed cases in Latin America?
Why is the sentence challenging?
Because it has a constituent ‘coronavírus:’ that has little relationship with the following parts of the sentence.
As you can see, SentiLecto doesn’t get distracted and it seamlessly extracts the direct object, it also separates the sentence into two clauses: ‘Coronavirus:’ and ‘why hasn’t there been confirmed cases in Latin America?’. Google also does a good job:
Spacy’s output for the same sentence is not what we expected:
As for subject extraction, SentiLecto performed better than the rest of the APIs. Let’s illustrate this with an example:
Por isso, equipe de investigadores dos Estados Unidos descobriu outro animal que pode fazer a mesma tarefa.
Trans: This is why a team of researchers from the United States discovered another animal that could do the same task.
SentiLecto not only extracts the objects correctly but also the other roles in the clause:
Again, Google correctly performs the task:
There are many caveats when it comes to the syntax of passive voices. We advise you to read our previous blog post on benchmarking to see the differences between narrow and broad conception of roles. In this test, the idea is to score the object, subject and verb extraction for each input. We can see again that SentiLecto can solve all the inputs we passed to it:
Let’s see an example:
Coronavírus: estudante de Cingapura é agredido em ‘ataque racista’ em Londres
Trans: Coronavirus: Singaporean student was attacked in ‘racist attack’ in London
Notice how SentiLecto correctly recognizes the verb and the direct object. It can also tell that another person performed the attack, even though that isn’t explicit in the sentence. For this reason, as a subject, it says ‘(implicit) someone’. One of the advantages of SentiLecto is that it can deal with implicit and tacit subjects, which is something the other APIs don’t take into consideration.
Google, yet again has good performance. Spacy, on the other hand, mistakes Coronavírus as the subject and estudante de Cingapura as an apposition that refers to this subject.
We tested our API against two others. The three of them have very solid performances in Portuguese, which is in itself an achievement, as not many products can solve the syntactic complexities that characterize this language. SentiLecto outperformed the other two APIs.