Proposal: Substring Filtering #555
Closed
myronmarston
started this conversation in
Ideas
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
Current ElasticGraph API
ElasticGraph currently offers two filter inputs for
Stringfields:TextFilterInputis used forStringfields defined withf.mapping type: "text";StringFilterInputis used for the rest.At Block, we have some internal ElasticGraph clients that need case-insensitive substring filtering. The full text search predicates (
matchesQueryandmatchesPhrase) can provide substring filtering in some cases, but it doesn't provide general purpose substring filtering. For example, if you filter on "script" it'll match a text value like"A bash script"but it won't match a string value like"subscription", at least with default field settings. Furthermore, we'd like to use this on fields that have not been indexed astext.Therefore, we're interested in expanding the ElasticGraph filtering API to offer substring filtering.
Elasticsearch/OpenSearch Offerings
Elasticsearch and OpenSearch have a few features that could be used here:
For our purposes, I believe the wildcard query will work best, for a few reasons:
StringFilterInputon allStringfields, so this isn't suitable for our general implementation. However, it can remain an option for a potential future optimization for specific fields."*script*".".*script.*". However, Regexp is strictly more flexible/powerful than wildcard, and we consequently expect wildcard queries to perform equivalently or better.While we plan to use a wildcard query to implement this, we do not plan to expose direct access to the wildcard query operator, instead offering a more limited substring filter that internally translates into a wildcard query. This is in keeping with ElasticGraph's current level of abstraction and retains some nice flexibility: since substring filtering can be implemented in multiple ways, we can change that implementation in the future if there is need! ElasticGraph is designed to expose a curated subset of Elasticsearch/OpenSearch functionality to clients while offering excellent performance. This approach aligns with that.
Prefix Filtering
Elasticseach and OpenSearch also offer a dedicated prefix query. Prefix queries should be more performant than substring queries, and the Elasticsearch/OpenSearch Wildcard docs highlight the fact that
*should be avoided at the beginning of a wildcard query.Given that, I plan to offer a prefix filtering predicate alongside the substring filtering predicate, allowing clients who want the more strict semantics of a prefix search to opt-in to a more performant query.
Future Plans: Unify
StringFilterInputandTextFilterInputWhile
StringFilterInputandTextFilterInputare currently distinct, in the future I'd like to unify them using the multi-fields feature. This feature would allow a single field to be indexed in multiple ways--as bothtextandkeyword, for example, enabling a single filter input containing all of the string/text filtering predicates.While the design of a multi-fields feature is out of scope for this proposal, it's worth bearing in mind because the API changes proposed here should ideally be compatible with this future direction.
API Options
Option 1:
contains: String/startsWith: StringThe simplest option is just to add
containsandstartsWithpredicates:However, it feels inconsistent that
equalToAnyOfaccepts a list of values whilecontainsandstartsWithonly accept a single value. If a client wants to express a filter likename contains "foo" OR name contains "bar"than they'd have to wrap it with ananyOf:That's a bit annoying to work with.
Option 2:
containsAnyOf: [String!]/startsWithAnyOf: [String!]With this option,
StringFilterInputwould be updated like so:This is simpler to use for the case discussed above:
However, it doesn't provide a clear place to specify whether it should be case sensitive or not. There are a couple ways we can offer that functionality.
Option 2a:
ignoreCaseargumentFor this option, we'd add an
ignoreCaseargument alongside the other predicates:By default, a query would be case-sensitive, but a client could choose to ignore case:
However, this API allows
ignoreCase: trueto be specified along with any of theStringFilterInputsubfields. It's not clear what these expressions would mean:In addition,
equalToAnyOfis always case sensitive. AndmatchesQuery/matchesPhraseare case-insensitive, which could matter when we unifyStringFilterInputandTextFilterInput. As far as I know, there's no way to makematchesQuery/matchesPhrasecase-sensitive.We don't want to offer an input argument that is easy to misuse/misunderstand, so this doesn't seem like an ideal option.
Option 2b: separate
caseSensitivelyContainsAnyOf/caseInsensitivelyContainsAnyOfpredicatesHere's how this option would look:
This avoids the issues presented by
ignoreCasewith the above option. However,caseInsensitivelyContainsAnyOfis quite the mouthful. This doesn't feel like a great API to me.Option 3:
contains: StringContainsFilterInput/startsWith: StringStartsWithFilterInputHere's the schema for this option:
And here are some examples of how it would be used:
This allows
ignoreCaseto be specified just undercontains: {...}andstartsWith: {...}, scoping it appropriately. In addition, it opens the door to offering bothcontains: {allSubstringsOf: ...}andcontains: {anySubstringOf: ...}, which is a nice bit of added flexibility.Note
We don't offer an
allfield understartsWithbecause I don't believe it's useful. A string can easily contain multiple substrings, but it can only start with one prefix.Proposal
I propose we implement Option 3--
contains: {anySubstringOf: [...], allSubstringsOf: [...]}andstartsWith: {anyPrefixOf: [...]}. This option provides maximal flexibility and (in my opinion, at least) reads nicely.containswill translate to awildcardquery, andstartsWithwill translate to aprefixquery. In the future, when we offer a first-class multi-fields feature, these predicates can be enhanced to use awildcardfield if the field has awildcardvariant in the index mapping.Consequences
equalToAnyOf,matchesQuery, ormatchesPhrase. However, a number of optimization avenues are open to us, including the use ofwildcardfields.Beta Was this translation helpful? Give feedback.
All reactions