LanguageNameSearch: do not mix different scripts in same buckets

To keep the average and maximum bucket size low, I made codepoints
< 4000 more granular and code points >= 4000 less granular. This
could be tweaked further for sure to reach more even sized buckets.

Bucket stats before:
 - 773 buckets
 - smallest has 1 entries
 - largest has 1804 entries
 - median size is 66 entries
 - average size is 45.394566623545 entries

Bucket stats after:
 - 698 buckets
 - smallest has 1 entries
 - largest has 1792 entries
 - median size is 16 entries
 - average size is 50.272206303725 entries

Change-Id: Id62d93658117564b05294c2fe36ca7c182784859
This commit is contained in:
Niklas Laxström
2016-06-15 11:19:01 +02:00
parent f73f9a8b5d
commit 55b68c329d
2 changed files with 13691 additions and 13829 deletions

View File

@@ -46,7 +46,15 @@ class LanguageNameSearch {
}
public static function getIndex( $name ) {
return self::getCodepoint( $name ) % 1000;
$codepoint = self::getCodepoint( $name );
if ( $codepoint < 4000 ) {
// For latin etc. we need smaller buckets for speed
return $codepoint;
} else {
// Try to group names of same script together
return $codepoint - ( $codepoint % 1000 );
}
}
/**

File diff suppressed because it is too large Load Diff