LanguageNameSearch: do not mix different scripts in same buckets
To keep the average and maximum bucket size low, I made codepoints < 4000 more granular and code points >= 4000 less granular. This could be tweaked further for sure to reach more even sized buckets. Bucket stats before: - 773 buckets - smallest has 1 entries - largest has 1804 entries - median size is 66 entries - average size is 45.394566623545 entries Bucket stats after: - 698 buckets - smallest has 1 entries - largest has 1792 entries - median size is 16 entries - average size is 50.272206303725 entries Change-Id: Id62d93658117564b05294c2fe36ca7c182784859
This commit is contained in:
@@ -46,7 +46,15 @@ class LanguageNameSearch {
|
||||
}
|
||||
|
||||
public static function getIndex( $name ) {
|
||||
return self::getCodepoint( $name ) % 1000;
|
||||
$codepoint = self::getCodepoint( $name );
|
||||
|
||||
if ( $codepoint < 4000 ) {
|
||||
// For latin etc. we need smaller buckets for speed
|
||||
return $codepoint;
|
||||
} else {
|
||||
// Try to group names of same script together
|
||||
return $codepoint - ( $codepoint % 1000 );
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user