[fix](search) Use FE-provided analyzer key for multi-index columns in search()#60798
Conversation
|
run buildall |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
TPC-H: Total hot run time: 28833 ms |
TPC-DS: Total hot run time: 183482 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
1 similar comment
|
run buildall |
TPC-H: Total hot run time: 28665 ms |
cc6d06a to
dd2aef3
Compare
|
run buildall |
TPC-H: Total hot run time: 28953 ms |
TPC-DS: Total hot run time: 182830 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 28816 ms |
TPC-DS: Total hot run time: 183559 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
… multi-index columns When a column has multiple inverted indexes with different analyzers (e.g., one default untokenized and one with English parser), search() in Lucene/scalar mode returned empty results because BE always passed an empty analyzer key to select_best_reader(), causing it to pick the wrong (untokenized) index for tokenized queries. The fix: 1. Extract analyzer_key from FE-provided index_properties before calling select_best_reader() and pass it through. 2. Remove the is_variant_sub restriction on the EQUAL_QUERY to MATCH_ANY_QUERY upgrade, so regular columns with multiple indexes also get the correct FULLTEXT reader. Fixes DORIS-24542
…x reader selection The previous fix unconditionally upgraded all EQUAL_QUERY to MATCH_ANY_QUERY in resolve(), which broke EXACT queries (they also map to EQUAL_QUERY but need the untokenized STRING_TYPE reader). Move the fix to build_leaf_query() where clause_type is known: - TERM → override to MATCH_ANY_QUERY (selects FULLTEXT/tokenized reader) - EXACT → keep EQUAL_QUERY (selects STRING_TYPE/untokenized reader) For variant subcolumns, resolve() still uses FE-provided analyzer_key. For regular columns with multiple indexes, query_type alone drives the reader type preference in select_best_reader's select_for_text(). Fixes: DORIS-24542
…multi-index reader selection
When a column has both tokenized and untokenized indexes, WILDCARD/PREFIX/REGEXP
queries selected the untokenized reader, causing patterns like "h*llo" to match
against full strings ("hello world") instead of individual tokens ("hello").
Extend the MATCH_ANY_QUERY override (already applied to TERM) to also cover
WILDCARD, PREFIX, and REGEXP clause types. Safe for single-index columns due to
select_best_reader()'s single-reader fast path.
Add untokenized-only index regression tests to verify no behavior change.
63518c1 to
9b3764b
Compare
|
run buildall |
TPC-H: Total hot run time: 28975 ms |
TPC-DS: Total hot run time: 183936 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
… search() (apache#60798) ### What problem does this PR solve? Issue Number: close #DORIS-24542 Problem Summary: When a column has multiple inverted indexes with different analyzers (e.g., one default untokenized index and one with English parser), `search()` in Lucene/scalar mode returns empty results. **Root cause:** In `FieldReaderResolver::resolve()`, `select_best_reader()` was always called with an empty analyzer key `""`, causing it to pick the wrong (untokenized) index for tokenized queries. Additionally, the EQUAL_QUERY → MATCH_ANY_QUERY upgrade was restricted to variant subcolumns only. **Fix:** 1. Extract `analyzer_key` from FE-provided `index_properties` before calling `select_best_reader()` and pass it through 2. Remove the `is_variant_sub` restriction on the query type upgrade so regular columns with multiple indexes also get the correct FULLTEXT reader
…o branch-4.0 Squashed backport of the following master PRs: - apache#59747 [fix](search) Make AND/OR/NOT operators case-sensitive in search DSL - apache#60654 [refactor](search) Refactor SearchDslParser to single-phase ANTLR parsing and fix ES compatibility issues - apache#60782 [fix](search) Upgrade query type for variant subcolumns with analyzer-based indexes - apache#60784 [fix](search) Fix MATCH_ALL_DOCS query failing in multi-field search mode - apache#60786 [feat](search) Support field-grouped query syntax field:(term1 OR term2) - apache#60790 [fix](search) Add searcher cache reuse and DSL result cache for search() function - apache#60793 [fix](search) Fix wildcard query on variant subcolumns returning empty results - apache#60798 [fix](search) Use FE-provided analyzer key for multi-index columns in search() - apache#60814 [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode - apache#60834 [test](search) Add regression test for wildcard query on variant subcolumns with multi-index - apache#60873 [fix](search) fix MATCH_ALL_DOCS losing occur attribute in multi-field expansion - apache#60891 [fix](search) inject MATCH_ALL_DOCS for multi-MUST_NOT queries in lucene mode
…o branch-4.0 Squashed backport of the following master PRs: - apache#59747 [fix](search) Make AND/OR/NOT operators case-sensitive in search DSL - apache#60654 [refactor](search) Refactor SearchDslParser to single-phase ANTLR parsing and fix ES compatibility issues - apache#60782 [fix](search) Upgrade query type for variant subcolumns with analyzer-based indexes - apache#60784 [fix](search) Fix MATCH_ALL_DOCS query failing in multi-field search mode - apache#60786 [feat](search) Support field-grouped query syntax field:(term1 OR term2) - apache#60790 [fix](search) Add searcher cache reuse and DSL result cache for search() function - apache#60793 [fix](search) Fix wildcard query on variant subcolumns returning empty results - apache#60798 [fix](search) Use FE-provided analyzer key for multi-index columns in search() - apache#60814 [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode - apache#60834 [test](search) Add regression test for wildcard query on variant subcolumns with multi-index - apache#60873 [fix](search) fix MATCH_ALL_DOCS losing occur attribute in multi-field expansion - apache#60891 [fix](search) inject MATCH_ALL_DOCS for multi-MUST_NOT queries in lucene mode
… bug fixes (#61028) ### What problem does this PR solve? Squashed backport of all search() function improvements and bug fixes from master to branch-4.0. This PR combines the following master PRs into a single backport: | Master PR | Type | Description | |-----------|------|-------------| | #59747 | fix | Make AND/OR/NOT operators case-sensitive in search DSL | | #60654 | refactor | Refactor SearchDslParser to single-phase ANTLR parsing and fix ES compatibility issues | | #60782 | fix | Upgrade query type for variant subcolumns with analyzer-based indexes | | #60784 | fix | Fix MATCH_ALL_DOCS query failing in multi-field search mode | | #60786 | feat | Support field-grouped query syntax field:(term1 OR term2) | | #60790 | fix | Add searcher cache reuse and DSL result cache for search() function | | #60793 | fix | Fix wildcard query on variant subcolumns returning empty results | | #60798 | fix | Use FE-provided analyzer key for multi-index columns in search() | | #60814 | fix | Fix implicit conjunction incorrectly modifying preceding term in lucene mode | | #60834 | test | Add regression test for wildcard query on variant subcolumns with multi-index | | #60873 | fix | fix MATCH_ALL_DOCS losing occur attribute in multi-field expansion | | #60891 | fix | inject MATCH_ALL_DOCS for multi-MUST_NOT queries in lucene mode | ### Release note Backport search() function improvements including DSL parser refactoring, multi-field search fixes, variant subcolumn support, query caching, and field-grouped query syntax. ### Check List (For Author) - Test - [x] Regression test - [x] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [x] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason - Behavior changed: - [ ] No. - [x] Yes. New search() function features and bug fixes backported from master. - Does this need documentation? - [x] No. - [ ] Yes. ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label
What problem does this PR solve?
Issue Number: close #DORIS-24542
Problem Summary:
When a column has multiple inverted indexes with different analyzers (e.g., one default untokenized index and one with English parser),
search()in Lucene/scalar mode returns empty results.Root cause: In
FieldReaderResolver::resolve(),select_best_reader()was always called with an empty analyzer key"", causing it to pick the wrong (untokenized) index for tokenized queries. Additionally, the EQUAL_QUERY → MATCH_ANY_QUERY upgrade was restricted to variant subcolumns only.Fix:
analyzer_keyfrom FE-providedindex_propertiesbefore callingselect_best_reader()and pass it throughis_variant_subrestriction on the query type upgrade so regular columns with multiple indexes also get the correct FULLTEXT readerRelease note
Fix search() returning empty results when a column has multiple inverted indexes with different analyzers.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)