Date:

Estimated Time:3 minutes

Multiword synomyms in SolR

Medical domain has multiple ways to tell about the same thing. It becomes handy to use and maintain muliword synonyms thesauri.

SolR recently released a built-in solution to handle multi-word synonyms and I have tested it successfully:

Some questions remains: 1) How many synonyms can solr handle and still perform well ? 2) How to make synonyms addition/modifications without rebooting ? 3) Which is the better query parser together with multi-word synonyms ?

The simple query parser

Given this field: text:"say me bye or say hello world" And the given configuration:

<example>

Note: The simple query parser has some limations compared to the complexphrase parser:

However it has some advantages:

How to handle negation

SolR allows to look for the absence of a word or phrase, but not for negations of them. One solution I found is to exploit the multi-word synonyms, by linking all the various ways to express negation to the neg word. Then it is possible to ask for the absence of a negation.

never had,never had any, has no, not, no anymore => neg

This can be used in this way:

Generalisation of negation

For a given simple full-text low level filter such a word or a phrase:

Simplifying interface for user

The more user friendly interface in my mind would allo people to write group of words and articulate them with logical operator such AND/OR/NOT. You don't want them to manage phrase and word too. So when a user look for:

They might check for ordered and compacted word, but by default we the tolerate some distance:

In order to remove negated pattern, we apply the above method:

People might add other groups and articulate them together:

The above example would match a text containing see you soon and not containing an other example. By the way any multi-word synonym would be translated. This also means the user not able to use jocker or fuzzy search. However, the jocker can be replaced with the stemming process, which has the advantage of not breaking the performances in case of very narrow joker query and also being transparent for the end user. There is also a possible replacement for fuzzy feature transparent to user.

Fuzzy search with synonyms

Word Embedding offers the opportunity to freely produce typo synonyms. I have tested succesfully a quite large list of them (~200k entries) and the performances where not impacted. So most common error are transparently integrated to end users.

Dated, Delay and other structured informations

The simple query parser let add some structured informations within the free-text to be queried. For example the below enriched text allows:

It is also possible to add many coded informations within the text to be queried as structured data with full-text structured capabilities. This will be very useful to get NLP pipelines results.

Dealing with multivalued fields (MV)

While the MVs look temptating for storing multiple occurence of the same concept, they are not a good choice when dealing with full text queries. Indeed the multiple values are not considered as independant values but as a whole with some defined distance between values.

So how to modelize multiple occurences of the same concept within SolR ? One alternate solution is to use multiple single fields for textual datatype. For example it the encounter has two physician notes both with a adverse event section, the resulting encounter document will have two fields "pn.adverse1" and "pn.averse2". This implies that when the user asks for something present into this section, the resulting query should be modified accordingly

This makes also possible to mix full text queries with structured queries based on the same field. Also there is some drawbacks. The replacement mecanism makes mandatory to know in advance the maximum number of fields of every documents. An other drawback is when the user wants to look into every section of the same document. One solution is to copyField

Conclusion

The final method offers:

Still some aspects are missing:

This lets envisage a simple but powerful interface. Let's now see how to transform a medical relational database and populate SolR.

This page was last modified: