Ticket #1014 (closed Enhancement: worksforme)

Opened 8 years ago

Last modified 8 years ago

saerchTopic() does not find topic with value "word" when querying for "ord"

Reported by: Malte Owned by:
Priority: Major Milestone: Release 4.8.4
Component: DeepaMehta Standard Distribution Version: 4.8.3
Keywords: Cc: jri
Complexity: 3 Area:
Module: deepamehta-core

Description

My topic values are indexed with the modes Fulltext und Fulltext-Key.

However, when i search topics with the lucene query phrase "gesamt" a topic containing the string "[...] Bezirksgesamt .." is not matched by my lucene query fired via dm4.searchTopics().

However if i modify the lucene query phrase to "*gesamt" the topic containing the value "Bezirksgesamt" is returned.

What now confuses me is the lucene query syntax documentation, especially the paragraph on wildcard searches":

Note: You cannot use a * or ? symbol as the first character of a search.

Well, it seems using dm4.searchTopics() i can and neither an error is returned nor the results get weird. It does the trick for me but is it safe to use?

It would be nice if someone could explain this oddity or help me towards a practical solution.

Thanks for your support.

Change History

comment:1 Changed 8 years ago by jri

Yes, by default Lucene doesn't support leading wildcards in search terms, but they can be enabled.
http://wiki.apache.org/lucene-java/LuceneFAQ#What_wildcard_search_support_is_available_from_Lucene.3F

Apparently Neo4j's Lucene-index module enables leading wildcards, that's why they work in DM.

The reason why Lucene doesn't enable leading wildcards by default is because such queries possibly see a performance penalty as the entire index must be scanned manually. (They are begin-of-word indexes, not end-of-word indexes I guess). However the performance penalty might be negligible if a) your index is not toooo big, and/or b) you have enough RAM.
http://stackoverflow.com/questions/11766351/understanding-lucene-leading-wildcard-performance

Consider performance tests for "Bezirks*" vs. "*gesamt" on both, your development machine, and your production machine.

comment:2 Changed 8 years ago by Malte

I will observe and report back on how or if performance is currently an issue with my setup. Thanks for the explanation and these web resources. Very insightful, the wildcard character in a lucene search phrase seems to have a similar meaning as in a regular expression.

Thanks for your help.

comment:3 Changed 8 years ago by Malte

  • Status changed from new to closed
  • Resolution set to worksforme

So, in practice, right now, this works very well for me. And sorry, I don't have time to dive into building performance tests for the moment.

BTW: Would you consider to expose some of "lucene query phrase" syntax to end users of DM 4 through UI elements (as outlined here https://my.deepamehta.de/topicmap/54135/topic/76900)?

Anyway, thanks for the clarification and your support!

Note: See TracTickets for help on using tickets.