These are often punctuation, prepositions, articles which add
more noise than useful matches.
Bug: T358862
Change-Id: I364ef629049471410cc1b3dd9b5df5de1515097b
Updated tests to reflect external changes.
NB: These tests and code should be rewritten to use mocked data to only
verify the algorithm to avoid the need to update them when the real
data changes.
Change-Id: I537df34405eea23569621ad0c5a31dc9d336c1b0
I noticed some language names are not searchable. I made it so
that autonyms from language-data are added to the search index.
Without this, languages not present in Names.php or in the CLDR
extension are not searchable via the API except by language code.
Change-Id: I51a9e2eb15fb40963e6edbf1db76133d84de7291
* Store prefixes and infixes separately in the data
* First match language code, then prefixes, then infixes
* Try to use suggestion either in user language or autonym first
* use formatversion=2 to avoid escaping Unicode
Using Language::fetchLanguageName might can have a small
performance impact. On the other hand there is now check
to skip languages we already found, avoiding some fuzzy
matching.
This is in a preparation for a change in jquery.uls to use
the search API more, while trying to reduce the amount of
weird autocompletion suggestions we show to the user.
Bug: T73891
Change-Id: Id94c5352d9a591969bf90144d1d2d5e758d08301
This adds several custom languages.
The addition of Punjabi addresses Bug T178070.
The addition of Chinese addresses Bug T73891.
Georgian and Catalan (Valencian) variant spellings
are added because these are the most frequent languages
that are not found in the ULS search box.
Bug: T73891
Bug: T178070
Change-Id: Ifbb08b560e454643d246379c19f725bde61917e9
To keep the average and maximum bucket size low, I made codepoints
< 4000 more granular and code points >= 4000 less granular. This
could be tweaked further for sure to reach more even sized buckets.
Bucket stats before:
- 773 buckets
- smallest has 1 entries
- largest has 1804 entries
- median size is 66 entries
- average size is 45.394566623545 entries
Bucket stats after:
- 698 buckets
- smallest has 1 entries
- largest has 1792 entries
- median size is 16 entries
- average size is 50.272206303725 entries
Change-Id: Id62d93658117564b05294c2fe36ca7c182784859
Serialized format is no longer in style for data. PHP files can
take advantage of AutoLoader and caching so they can even be faster
than serialized files. As side bonus we can have readable diffs
for updates.
Only downside is that the file generation takes about ten lines of
ugly string manipulation.
Change-Id: If09704d1172daa13c72a308814534cac1fe9899f