Building Blocks for Accessing
Multilingual Data: CLDR
Steven R. Loomis, IBM GFTT 1
Access available handouts at ala.15.ala.org/sessions/handouts.
About Me
• Senior Software Engineer, 

IBM Global Foundations Technology Team
• IBM’s technical lead for the ICU4C/C++
software library, and primary voting
representative to Unicode
• Member of CLDR-TC, lead of ULI-TC
2
Access available handouts at ala.15.ala.org/sessions/handouts.
Agenda
• About CLDR
• Focus Areas:
• Language Identification
• Transliteration
• Searching and Sorting
• Keyboards/Entry
• Q&A
3
Access available handouts at ala.15.ala.org/sessions/handouts.
What is CLDR?
• Common Locale Data Repository
• Language and region-specific data
• Covers hundreds of language/region pairs
• Open data (like Unicode itself), XML/JSON
format
• Community input, carefully curated
4
Access available handouts at ala.15.ala.org/sessions/handouts.
Who is CLDR?
• CLDR’s Technical Committee,

the CLDR-TC, is part of the Unicode
Consortium
• Active participation by industry, academic,
open source projects, national standards
bodies, individuals
5
Access available handouts at ala.15.ala.org/sessions/handouts.
Who uses CLDR?
• Apple, Google, IBM, Microsoft…
• Wikimedia foundation, jQuery, …
• Java, node.js, php, …
• Many users via ICU C/C++/Java library
6
Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Data
• Data required for respecting the
linguistic, cultural, geopolitical
requirements of specific users
• Example: "What day is it?"
7
Access available handouts at ala.15.ala.org/sessions/handouts.
XML / JSON
• XML: “es-US”
• <month type="6">Junio</month>
• JSON: “es-US”
• { …

"6": "Junio", …

}
8
Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR Coverage
• Coverage vs. number of languages
9
Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR site and SurveyTool (DEMO)
• DEMO:
• http://unicode.org/cldr
• http://st.unicode.org/cldr-apps
10
Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Identifiers — BCP47
• Example: sr-Latn-RS
• sr : ISO-639 "Serbian"
• Latn : ISO-15924 "Latin Script"

(vs Cyrillic)
• RS : ISO 3166 / UN M.49 "Serbia"
Latn
Latnsr
Latn
LatnLatn
Latn
LatnRS
11
Access available handouts at ala.15.ala.org/sessions/handouts.
Language/Territory/Script info
Facts:
• “The Cyrillic Script can be used to write
Mongolian, Russian, Serbian…”
• “Italian is spoken in Italy, San Marino,
Switzerland…”
12
Access available handouts at ala.15.ala.org/sessions/handouts.
Language Identification: Exemplars
English
(Latin)
a b c d e f g h i j k l m 

n o p q r s t u v w x y z
Serbian
(Latin)
a b c ć č d đ dž e f g h i j k l lj m 

n nj o p r s š t u v z ž
Serbian
(Cyrillic)
а б в г д ђ е ж з и ј к л љ м н њ о п р 

с т ћ у ф х ц ч џ ш
Russian
(Cyrillic)
а б в г д е ё ж з и й к л м н о п р 

с т у ф х ц ч ш щ ъ ы ь э ю я
13
Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration
• Existing data for rule sets.
• ALA-LC format could be included.
• Rule based engine.
14
Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration Rule Example: Greek
• <tRule>Σ ↔ S ;</tRule>
• <tRule>τ ↔ t ;</tRule>
• <tRule>Τ ↔ T ;</tRule>
15
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: ICU transliterator demo
• http://demo.icu-project.org/icu-bin/
translit
16
Access available handouts at ala.15.ala.org/sessions/handouts.
Searching and Sorting
• Unicode (UCA) provides base
• CLDR “tailors”: 

English vs. Danish vs. French
• German: Mueller = Müller = MUELLER
• Multiple stages and options:
• blackbird vs black-bird vs BlackBird
17
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: Collator
• http://demo.icu-project.org/icu-bin/
collation.html
18
Access available handouts at ala.15.ala.org/sessions/handouts.
Keyboards / Entry
• Standardized
identifier for
keyboard tables
• Allows comparison
between keyboard
providers
19
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: MARC processor
CLDR
data
Script: Armn (Armenian)
Exemplar text matches hy
“Armenian”
Transliterate to latin: 

“Hayastaneayc‘ ekeġec‘i”
Regions where spoken: 

Armenia, Russia, Georgia,
Syria, Lebanon, Iran,
Turkey, Cyprus
20
uses: CLDR, ICU4J, MARC4J
Access available handouts at ala.15.ala.org/sessions/handouts.
Thank You / Q&A
• srloomis@us.ibm.com
• @srl295 ( Twitter, GitHub, Freenode )
• ibm.biz/srloomis
21

Building Blocks for Accessing Multilingual Data: CLDR

  • 1.
    Building Blocks forAccessing Multilingual Data: CLDR Steven R. Loomis, IBM GFTT 1
  • 2.
    Access available handoutsat ala.15.ala.org/sessions/handouts. About Me • Senior Software Engineer, 
 IBM Global Foundations Technology Team • IBM’s technical lead for the ICU4C/C++ software library, and primary voting representative to Unicode • Member of CLDR-TC, lead of ULI-TC 2
  • 3.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Agenda • About CLDR • Focus Areas: • Language Identification • Transliteration • Searching and Sorting • Keyboards/Entry • Q&A 3
  • 4.
    Access available handoutsat ala.15.ala.org/sessions/handouts. What is CLDR? • Common Locale Data Repository • Language and region-specific data • Covers hundreds of language/region pairs • Open data (like Unicode itself), XML/JSON format • Community input, carefully curated 4
  • 5.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Who is CLDR? • CLDR’s Technical Committee,
 the CLDR-TC, is part of the Unicode Consortium • Active participation by industry, academic, open source projects, national standards bodies, individuals 5
  • 6.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Who uses CLDR? • Apple, Google, IBM, Microsoft… • Wikimedia foundation, jQuery, … • Java, node.js, php, … • Many users via ICU C/C++/Java library 6
  • 7.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Locale Data • Data required for respecting the linguistic, cultural, geopolitical requirements of specific users • Example: "What day is it?" 7
  • 8.
    Access available handoutsat ala.15.ala.org/sessions/handouts. XML / JSON • XML: “es-US” • <month type="6">Junio</month> • JSON: “es-US” • { …
 "6": "Junio", …
 } 8
  • 9.
    Access available handoutsat ala.15.ala.org/sessions/handouts. CLDR Coverage • Coverage vs. number of languages 9
  • 10.
    Access available handoutsat ala.15.ala.org/sessions/handouts. CLDR site and SurveyTool (DEMO) • DEMO: • http://unicode.org/cldr • http://st.unicode.org/cldr-apps 10
  • 11.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Locale Identifiers — BCP47 • Example: sr-Latn-RS • sr : ISO-639 "Serbian" • Latn : ISO-15924 "Latin Script"
 (vs Cyrillic) • RS : ISO 3166 / UN M.49 "Serbia" Latn Latnsr Latn LatnLatn Latn LatnRS 11
  • 12.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Language/Territory/Script info Facts: • “The Cyrillic Script can be used to write Mongolian, Russian, Serbian…” • “Italian is spoken in Italy, San Marino, Switzerland…” 12
  • 13.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Language Identification: Exemplars English (Latin) a b c d e f g h i j k l m 
 n o p q r s t u v w x y z Serbian (Latin) a b c ć č d đ dž e f g h i j k l lj m 
 n nj o p r s š t u v z ž Serbian (Cyrillic) а б в г д ђ е ж з и ј к л љ м н њ о п р 
 с т ћ у ф х ц ч џ ш Russian (Cyrillic) а б в г д е ё ж з и й к л м н о п р 
 с т у ф х ц ч ш щ ъ ы ь э ю я 13
  • 14.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Transliteration • Existing data for rule sets. • ALA-LC format could be included. • Rule based engine. 14
  • 15.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Transliteration Rule Example: Greek • <tRule>Σ ↔ S ;</tRule> • <tRule>τ ↔ t ;</tRule> • <tRule>Τ ↔ T ;</tRule> 15
  • 16.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Demo: ICU transliterator demo • http://demo.icu-project.org/icu-bin/ translit 16
  • 17.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Searching and Sorting • Unicode (UCA) provides base • CLDR “tailors”: 
 English vs. Danish vs. French • German: Mueller = Müller = MUELLER • Multiple stages and options: • blackbird vs black-bird vs BlackBird 17
  • 18.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Demo: Collator • http://demo.icu-project.org/icu-bin/ collation.html 18
  • 19.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Keyboards / Entry • Standardized identifier for keyboard tables • Allows comparison between keyboard providers 19
  • 20.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Demo: MARC processor CLDR data Script: Armn (Armenian) Exemplar text matches hy “Armenian” Transliterate to latin: 
 “Hayastaneayc‘ ekeġec‘i” Regions where spoken: 
 Armenia, Russia, Georgia, Syria, Lebanon, Iran, Turkey, Cyprus 20 uses: CLDR, ICU4J, MARC4J
  • 21.
    Access available handoutsat ala.15.ala.org/sessions/handouts. Thank You / Q&A • srloomis@us.ibm.com • @srl295 ( Twitter, GitHub, Freenode ) • ibm.biz/srloomis 21