gomtch — find text even if it doesn’t want to be found!

Nicolas Augusto Sassi
4 min readDec 30, 2020

A little bit of context

With the discussions on censorship and freedom of expression on social midia, it is easy to lose sight of the technical part that permeates any decision that can be taken by the platforms. However advanced the algorithms and AI that analyze texts published by users may be, still, little can a machine do when a smart user wants to hide the transmission of information.

A good example may be Leet, which was created to circumvent text filters on platforms and allow users to transmit unauthorized information.

Leet (or “1337”), also known as eleet or leetspeak, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via reflection or other resemblance. Additionally, it modifies certain words based on a system of suffixes and alternate meanings. There are many dialects or linguistic varieties in different online communities.

It is clear that the transmission of written information is more complex than simply “writing correctly”. As long as the interlocutor is able to understand the message, communication can be considered to be successfully established.

In visual word recognition a reader is able to recognize single letters at an abstract level independently of the physical properties of these letters. Readers perform fine-grained discrimination processes to correctly differentiate between very similar strings (e.g., foe and toe), while at the same time they overlook small differences to achieve the corresponding abstract letter representation from different allographs of the same lexical item (e.g., foot, FoOt, FOOT).²

There is then a challenge for the platforms. Even if there is an effort to create punishments for users who violate the community’s terms of use (usually spreading hate speech in general), how do you find the offending users?

Of course, one could hire employees just to analyze posts (as Facebook does until this article is published) and with the results create AI models that find the “prohibited terms” before they even go live on the platform. However, it is important to bear in mind that AI models are laborious in terms of development, costly in terms of execution and require a large (if not huge) sample of training examples.

gomtch presents itself as a low cost solution to solve this problem.

What is gomtch

gomtch is tool written entirely in Go for document comparison. The solution presents a series of features already integrated for text normalization and case generalization according to the configuration desired by the user for each document through the API. At the end of the process, ideally, the user would get as a result a map of matches, with the keys being the documents found and the values being the form of which they were found in the text (ex: Document is “friend” and the form is “fr!end”).

Solutions with the same purpose already exist in droves, gomtch is just another one with its specific implementation, in addition to focusing on bringing flexibility to the API user and support for languages other than English.

Unlike other tools with the same purpose, gomtch does not return a percentage by comparing documents, instead it directly returns which documents were found.

Some of the text standardization features already implemented are:

  • HTML parsing (removes all HTML tags and keeps only the text)
  • sequential characters removal (reaaal = real)
  • UPPER and lowercase normalization
  • unicode normalization (canção = cancao)
  • characters/terms replace through regex

Pipeline

gomtch does not differentiate between reference documents and comparison documents, so they are interchangeable.
When creating a new document from any text corpora, gomtch automatically normalizes and tokenizes according to the settings provided by the user. All standardizations and the form of tokenization are optional.
You can compare one document (the reference document) with n other documents (comparison documents). The more comparison documents the longer the application’s execution time. The size of the reference document also impacts the application’s runtime.

Finally, the solution returns a map with the index of the documents found as a key and the shape of the values found as key’s values.

Match scores and comparison heuristics

Each comparison works based on a match-score chosen by the API user. The match-score can be as simple as a number that will represent the percentage that a token needs to be equal to the token compared without (minimum match-score) as a complex function (custom match-score).

Match-scores are very important because they tell the tool how it should work with special characters and numbers present in the documents. Since the match-score is 100%, for example, gomtch will consider any subversion to a token to be a difference (eg match-score 100%: friend != fri3nd). With softer match-scores or match-score functions, gomtch can consider different tokens as equal (eg match-score 90%: friend == fri3nd).

It is important to keep in mind that by default gomtch compares numbers with numbers and only considers complete equality (ex: 1 == 1). Also, gomtch only considers letters with letters (ex: A == A). Finally, comparisons of numbers with letters and letters with special characters are orchestrated by the match-score defined by the user.

References

¹ https://pt.wikipedia.org/wiki/Leet

² Nicola Molinaro, Jon Andoni Duñabeitia, Alejandro Marìn-Gutièrrez, Manuel Carreiras. From numbers to letters: Feedback regularization in visual word recognition.

This project seems valid and you want to contribuite somehow?

  • Use it
  • Git it some starts
  • Build something with gomtch
  • Tell people about the project
  • Raise issues and open pull requests to make it better

--

--

Nicolas Augusto Sassi

The one trying to understand at least a bit about midia, culture and programming