Isolate all translatable messages into resource files so they can be efficiently localized. [Required]
All translatable text messages in the interface should be isolated in separate modules so that they can be efficiently localized. Only one language should be stored in each resource file because different languages often have conflicting code pages (with the exception of Unicode) and localization for different languages often takes place at different times. Comments should be added to source English message files to indicate strings that should not be translated and provide any translation instructions for specific messages. The best way to identify hard-coded source English strings is through regular testing with a pseudo-translated interface.
Whenever possible translatable messages should be stored in industry standard resource file formats. Localization vendors are well-acquainted with the standard formats and commercial translation tools can be used to efficiently process them. For Windows C++ applications messages should be stored Windows resource files (.rc) and for .NET applications messages should be stored in XML-based .resx files.
Allow for expansion of translated text. [Required]
Strings inevitably get longer during translation. In general the shorter the source English message is the more likely it is to become longer in other languages. On the other hand a source English message that is very long probably will not increase in length as much when it is translated. German tends to have long words and provides a good stress test on user interface layout. Asian fonts generally require a little more space in the vertical direction.
Internal text buffers should be long enough to accommodate expansion of translated text. The user interface layout should also allow for expansion. When possible extra white space should be added to the English user interface to provide some room for expansion. In some cases labels can be put above the corresponding field rather than to the left of it so that the labels have plenty of room to expand during localization.
Dialog boxes should be designed based on the lowest supported monitor resolution. English Dialogs should not be too large because localized versions could become even bigger and they could occupy too much screen space (or even not fit on the screen).
Even if the English user interface has extra room to allow for some expansion it still might be necessary to resize dialogs boxes during localization. The translation kit should provide the ability for localization vendors to independently resize dialog boxes as necessary. This means that the resource file format needs to contain not only translatable messages but also geometric information about layout.
There are commercial translation tools which can be used to resize dialog boxes for Windows resources (.rc) and .NET resources (.resx). For web files the HTML layout should be designed to allow extra room and graceful wrapping for translated messages.
An aspect ratio macro should be defined for each MDL dialog box in order to allow MDL dialog boxes to be resized.
Aspect ratio macros should have a one to one relationship with MDL dialog boxes - no macro should control multiple dialog boxes, nor the other way around. Aspect ratio macros should be commented with the title of the dialog box they control, and the key-in to open the dialog. Every MDL dialog box should have an associated key-in to facilitate testing automation.
Use consistent English terminology and style. [Required]
A glossary of key terms should be created for each product to maintain consistency in the base English product user interface and documentation. Slang expressions and cultural references in messages should be avoided. At the start of a localization project the English terminology should be translated to establish a glossary of the proper terminology in the target language.
Guidelines should be established for format, punctuation, capitalization, and style of text messages in labels, menus, prompts, error messages, and tool tips. A technical writer should review all user interface messages. If terminology and style are consistent in the base English product it will help in producing a consistent translated product. It will also make the base English interface clearer for non-native English speakers.
Translatable strings in resource files should be organized by the way they appear in order to provide context to the translators. Adding new strings at the end of the file is not helpful because most translations are leveraged automatically.
Use a single meaning for messages. [Required]
Each user interface message should have a single meaning. Sometimes the same message in English can be used in two different ways but in another language the message would need to be translated differently in each case. A unique message should be used for each context so that each can be translated differently if necessary.
Do not construct messages. [Required]
Word order and grammar are different in each language so user interface messages should not be constructed. If a program constructs an English sentence by concatenating phrases then it might not be possible to put the phrases together correctly in other languages. Concatenated messages can be found by inspecting source code or through testing of the pseudo-translated interface.
Similarly, user interface messages should not be constructed by combining English phrases with user interface elements (like option menus or text fields) to form a sentence. As above the pieces might not fit together appropriately in other languages. A "Label: <Field>" approach should be used instead of forming phrases with interface elements.
Allow parameters in messages to be re-ordered during localization. [Required]
Messages often contain parameters which allow numeric or text values to be substituted into the message at run time. When there is more than one parameter in a message it is sometimes necessary to change the order of the parameters during localization to account for grammatical differences in the target language. There should be a method for changing the order of parameters in a translated message and the procedure should be specified in the translation kit documentation.
Messages with parameters should provide as much context as possible. For example, the string "File %s not found" is better than "%s not found" because the word "File" helps the translator translate using the correct gender.
Do not hard-code font names. [Required]
Whenever possible fonts should be selected which can properly display character data for each supported locale. Font names should not be hard-coded because a given font might not support glyphs required in other languages. For Windows applications font names should be included in a resource file and for web applications font face names should be defined in a style sheet so they can be adjusted during localization if necessary.
Tahoma font is the system font for Windows 2000 and XP and it exists on all localized versions of the operating systems so it might be a good choice for a font. Microsoft also provides a font face name called MS Shell Dlg which is mapped to a shell font that is capable of displaying characters for the current locale. The best way to test fonts is by using a locale setting and pseudo-translated interface which contain character data outside Latin 1 character set (Windows code page 1252).
Support locale-sensitive formats for date, time, numbers, currency, and other cultural formats. [Required]
Calendar, Date, and Time
The Gregorian calendar is widely used throughout the world but in some regions there are other calendars including the Japanese era name calendar, ROC (Taiwan) calendar, Buddhist Era (B.E.) calendar, Hijri calendar, and Hebrew calendar. Windows provides support for all of these calendar types.
There are three main sequences used in dates: Year-month-day, day-month-year, and month-day-year. There are both long and short forms of the date format. In the short date form the number of digits used to represent year, month, and day varies in each locale and the separators between units can be slash, dash, or period.
In some countries the first day of the week is considered to be Sunday while in others it is Monday. Depending on the country, weekends can be (Saturday, Sunday), (Thursday, Friday), or (Friday, Saturday).
Depending on the locale, times are represented using AM/PM or the 24 hour clock. There are 24 time zones in the world but the rules for computing the time in a given locale are complicated by regional differences in daylight savings time. Since client-server software often spans multiple time zones it is important to consider the time zone when processing date/time information. Date/time information could always be presented in the server time zone or could be adjusted to the user's local time zone.
There are a number of time zone naming issues. One time zone may have many names or it may have no name at all. The name of a time zone may change when daylight savings time is in effect. Time zone name abbreviations are not unique.
There is an international standard ISO 8601 for date and time data. It is intended for interchange of date and time data rather than presentation. The standard relies on the Gregorian calendar and universal coordinated time. There are a number of formats supported by the standard but the general form is year-month-day-(T)-hour-minute-second. ISO 8601 should be used to exchange date and time data in a locale-independent manner.
Numbers and Currency
Decimal numbers should be formatted in a way that is appropriate for the user's locale. The symbols used for decimal point and grouping differ in each locale. Currency should also be formatted in a locale-sensitive manner. The currency symbol, the position of the symbol, and the number format are different in each locale.
There is an international standard ISO 4217 which defines currency codes. ISO 4217 codes should be used when currency data needs to be stored or exchanged in a locale-independent way.
Names, addresses, and phone numbers
In the U.S. names are of the form <title> <given name>=""> <family name>=""> <suffix>. In East Asia the family name precedes the given name and the equivalent of title occurs at the end. Current system locale data does not assist with locale-sensitive formatting of names.
Different name formats can be supported by using parameterized messages which could be customized during localization.
The local address format is different in each country. Current system locale data does not assist with locale-sensitive formatting of postal addresses. Different address formats can be supported by using parameterized messages which could be customized during localization.
Be sure to include the country code before any phone numbers. Note that toll free numbers might not work from outside the country of origin.
Measurement units
Applications should be flexible enough to support both U.S. and metric measurement units. Most regions in the world use the metric system.
Support A4 paper size. [Required]
Outside North America A4 paper size is used instead of Letter size. A4 paper is 210mm by 297mm and is a little narrower and longer than Letter size. Any files or user interface screens that are formatted for printing should be designed to fit on either A4 or Letter size paper. Documentation should be designed so that it can be efficiently printed on either Letter or A4 size paper.
Use culturally appropriate icons and avoid text in icons. [Required]
Icons should be culturally appropriate, maintain simplicity, and make use of international symbols when possible. Potentially offensive symbols, cultural references, religious symbols, gestures, and flags should be avoided. Text in icons should be avoided because it is very expensive to localize. In order to localize text in icons it is necessary to use graphics editors rather than standard computer assisted translation tools.
It is expensive to localize icons so the goal is to have international icons which can be used in all locales. Unfortunately it is often difficult to design meaningful icons which are universal. In case an alternate icon needs to be used for a given locale you should make sure that it is possible to substitute icons. Icons should be defined in a modular way and should not be bound inside executable code. Icons should be included in the localization kit in case they need to be customized for a given region. Icons should be periodically reviewed for international suitability.
Shortcut-key combinations should be accessible on international keyboards. [Required]
Use international English whenever possible. [Best practice]
International English should be used whenever possible. For example, it would be best to use the generic term "Postal code" rather than "ZIP code" which is U.S. specific. It is important to have a generic base English product which could be marketed in other countries for cases where the cost of localization into a specific language can not be justified.
Use culturally appropriate colors and sounds. [Best Practice]
Exercise caution when using colors and sounds in the user interface. Colors and sounds have different connotations in each country. For example, red means "stop" or "error" in some regions but in other countries it means "happy".
Determine the user's desired language and country. [Required]
Locales define preferences for a cultural or geographic region. Locale names usually consist of a language or (language, country) pair. For example, some French-speaking locales are "French", "French (France)", and "French (Canada)". ISO two letter language codes and country codes are used to construct locales in .NET. RFC 3066 language tags are often used for web applications.
When an application is initialized it should determine the user's preferred locale. A Windows desktop application can determine the user's locale by checking the Regional Options settings in the Control Panel. A web application can determine the user's preferred locale by the web browser language preference, cookies, or a user preference in the application.
Set the locale in the application. [Required]
Once the user's preferred locale is determined the default locale for the application should be set appropriately. Operations which are affected by the locale setting (like time/date format, numeric format, currency format, sort order) should be performed in a locale-sensitive manner.
Avoid assumptions about the size of characters or character sets. [Required]
When designing algorithms or data structures it is important to avoid incorrect assumptions about the size of characters or character sets. Single byte characters are obviously one byte but multibyte characters used for East Asian languages can be one or two bytes. Wide characters are of uniform width but the width can vary depending on the platform. In the Unicode encoding UTF-16 code points are typically represented with two bytes but obscure characters are accessed through surrogate pairs and require four bytes. In the Unicode encoding UTF-8 characters can be one, two, three, or four bytes.
The basic ASCII character set has 128 characters. The Latin 1 character set for Western European languages has 256 characters. National character sets for Chinese, Japanese, and Korean have thousands of characters. The Unicode Standard, version 4.0 provides codes for more than 96,000 characters from the writing systems of the world.
Support input, processing, and display of accented characters for European languages and multibyte characters for Chinese, Japanese, and Korean. [Required]
There are three basic types of characters: single-byte characters, multibyte characters, and wide characters. At a minimum existing applications should support multibyte characters so they can handle East Asian languages. Ideally applications should use wide characters encoded in Unicode so they can support character data from all key languages at the same time. As applications evolve they might include a mixture of multibyte characters and wide characters. New applications should be implemented with wide characters encoded in Unicode.
In C/C++ multibyte characters are represented by the char data type and string processing has to be done in a multibyte-aware manner. In C/C++ wide characters are represented with the wchar_t data type and strings are processed with wide character routines. In .NET the character data type is a wide character encoded in Unicode (UTF-16).
If necessary applications should perform transcoding or conversion between different character encodings upon input and output. [Required]
Applications often need to operate in a heterogeneous environment, interface with third-party products, and support legacy data. When necessary a program needs to be able to convert between different character encodings.
There are three encodings of Unicode: UTF-8, UTF-16, and UTF-32. An application might use UTF-16 for internal processing but store data or generate web pages in UTF-8.
A simple and efficient algorithm can be used to convert among the three Unicode encodings.
It is also necessary to be able to convert between Unicode and legacy multibyte character encodings. For example, an internationalized software application might process character data internally in Unicode but still need to handle multibyte character encodings at system interfaces. Unicode contains all the characters from key multibyte character encodings and it is possible to do roundtrip conversions between Unicode and key multibyte character encodings. The arrangement of kanji characters is different in Unicode than in the Chinese, Japanese, and Korean character sets so a transformation table is used to perform the conversions.
It might even be necessary for an application to convert between different multibyte character encodings. For example, there are three common Japanese multibyte character encodings: Shift-JIS, EUC, and ISO-2022-JP. Shift-JIS is widely used on PCs, EUC is typically used in a Unix environment, and ISO-2022-JP is used to exchange data as in email messages which are sent over the Internet. When moving character data between a Unix system and a personal computer it might be necessary to convert between EUC and Shift-JIS and when sending an email over the internet it is necessary to transcode to ISO-2022-JP. There are standard algorithms to convert among Shift-JIS, EUC, and JIS character encodings.
It is common to convert between two different character encodings by using Unicode as a pivot point. The data is converted from the first character encoding to Unicode and then from Unicode to the second character encoding. Key programming languages and operating systems provide functions to convert between a wide variety of character encodings.
Characters should be classified (alphabetic, numeric, printable, etc) either using a locale-sensitive function or according to Unicode character properties. [Required]
The traditional approach is to use a locale-sensitive function to perform character classification based on the locale data for a given language and country. The more modern approach is to use functions which perform character classification based on Unicode character properties.
Case conversion should be performed in a locale-sensitive manner or according to Unicode character properties. [Required]
There is a distinction between upper and lower case letters in the Latin alphabet but Chinese ideographs, Korean Hangul, and Japanese kana symbols do not have a distinction between upper and lower case. The traditional approach is to use a locale- sensitive functions to perform case conversions based on the locale data for a given language and country. The more modern approach is to use functions which perform conversions based on Unicode character properties.
Case conversions should be done in a way that supports multibyte characters; otherwise, the second byte of a double byte Japanese character could be mistakenly converted which would corrupt the character.
Text strings should be collated or sorted in a culturally appropriate manner. [Required]
Sorting character data based on the binary value of character set codes rarely produces the expected culturally appropriate order. For example, in a straight binary sort of ASCII data the letters "a" and "A" do not sort next to each other while in a linguistic sort for English the lower case and upper case letters would sort next to each other. In order to sort character data in a culturally appropriate manner it is necessary to use locale-sensitive linguistic sorting rather than binary sorting. A culturally appropriate collation algorithm uses multiple levels to properly handle characters from different scripts, upper/lower case characters, diacritics, full-width/half-width characters, and special symbols. C, C++, C#, and Java provide locale-sensitive sorting functions which produce culturally appropriate results.
Applications should support on-the-spot input of Chinese, Japanese, and Korean characters. [Best practice]
Chinese, Japanese, and Korean character sets contain thousands of characters. A utility called a Input Method Editor (IME) or Front End Processor (FEP) is typically used to assist the user in entering East Asian character data. The application should support the IME so that East Asian character data can be entered naturally into any text fields.
Applications should support bidirectional Arabic and Hebrew text. [Best practice]
Compared to European and East Asian languages, Arabic and Hebrew represent a small share of the software market but support for bidirectional issues should be considered during the product development cycle. Arabic and Hebrew are primarily written right-to-left although some embedded foreign words are written left-to-right. For Arabic and Hebrew the user interface should basically be the mirror image of the English user interface. Menus, dialogs, and labels should all flow from right-to-left rather than left- to-right.
There is a distinction between the logical form of text (the order in which it is spoken) and the presentation form (how it is displayed). Text should be stored in the logical form. In Arabic each character has four forms depending on the context (isolated, middle, beginning, end). Contextual analysis must be performed to determine the proper glyph for each character.
Text processing for bidirectional writing is complex and whenever possible the support provided by the underlying programming language and operating system should be used.
Applications should use Unicode for internal text processing. [Best practice]
Unicode is a universal character set which includes all the characters used in the major writing systems of the world. It allows text data to be exchanged across programs, platforms, languages, and countries. Unicode is supported by the major operating systems, modern web browsers, and many other applications.
There are three character encoding schemes that are used to represent Unicode text: UTF-8, UTF-16, and UTF-32. The three encodings are all capable of representing the full repertoire of Unicode characters but they encode the data differently in code units of 8, 16, or 32 bits. UTF-8 is often used for exchange and storage of Unicode text data. UTF-8 is a multibyte encoding of Unicode character set in which each character is one to four bytes in length. An ASCII character has the same value in UTF-8, Western European accented characters are typically two bytes, and Asian characters are generally three bytes in UTF-8. In UTF-16 each basic character is two bytes (with the exception of surrogates used to access rare characters which require four bytes).
In Windows 2000/XP C++ wide characters are encoded in UTF-16 and C# characters are also encoded in UTF-16.
Applications should support Unicode surrogate pairs. [Best practice]
In the Unicode Standard version 4.0 there are thousands of characters beyond the basic multilingual plane. Two bytes are not sufficient to access every character. In Windows 2000/XP C++ wide characters are encoded in UTF-16 and C# characters are also encoded in UTF-16. In UTF-16 the basic code point is two bytes but in order to represent some characters it is necessary to use surrogate pairs. When traversing character strings, functions which support surrogate pairs should be used.
Third-party software that is included in a product should also conform to the internationalization guidelines. [Required]
The third-party software should be evaluated through product specifications and hands-on testing to determine the level of support for internationalization and localization. In particular it is important to check if the third-party products support multibyte characters or Unicode and to confirm if the third-party products contain any internal messages that need to be localized.
Product installer should be internationalized and localized. [Required]
The product installer should be internationalized and localized. Installers are typically the last thing that is developed but they are the first thing that a customer sees. For a server product the installer internationalization/localization is probably not as important as for a desktop product because the product installation will only be performed by a small number of administrators rather than a large number of end-users.
The installer should be integrated into the translation kit. [Required]
The installer should be integrated into the translation kit using an approach that is similar to that used by MicroStation. This will allow the localization teams to localize the installers and will also provide the ability for localization vendors to build and test install sets.
Any custom installer messages should be included in the translation kit so they can be translated. Similarly any installer billboards with marketing messages should also be included in the translation kit.
External localization vendors should have the ability to build localized install sets. The major vendors should have access to the same installer technology if necessary for the build process. In the current MicroStation translation kit there is a simple security scheme which uses a password to distinguish between internal builds and external builds by localization vendors, mainly to prevent unreleased localized products get into the field.
Web pages should be published in UTF-8 character encoding or the appropriate local character encoding. [Required]
Localized web pages need to use a character encoding that can properly represent all of the character data in the pages. Pages can either be encoded in the appropriate local code page or in Unicode UTF-8 encoding. UTF-8 has the advantage that it supports all major writing systems and can be used for multilingual web pages. In older web browsers UTF-8 is not consistently supported but the latest browser versions support UTF-8.
HTML pages should be tagged with the proper character encoding and language. [Required (for character encoding)]
HTML documents transmitted by HTTP should specify the proper character encoding in the charset parameter in the Content-Type header. In the case of a stand-alone HTML files the charset should be specified at the top of the HTML file using a meta tag. The value for the charset can be any character encoding included in the IANA Character Set Registry but browsers only recognize a subset of the registered character encodings. For UTF-8 character encoding the charset value is "utf-8". Marking the charset will allow the web browser to correctly interpret the character data and select the proper font to render the text.
Ideally the language of the HTML page should also be specified with the lang attribute. The language value is a code from RFC 3066 Tags for the Identification of Languages which is made up of an ISO 639 language code and an optional ISO 3166 country code. For English it would look like this.
<html lang="en">
XML documents should be tagged with the proper character encoding and language. [Required (for character encoding)]
XML documents should be encoded in UTF-8 and should be labeled accordingly.
<?xml version="1.0" encoding="UTF-8"?>
The language tag for English XML content would look like
<Info xml:lang="en">
Database should store data in a locale-independent manner. [Required]
In general, data including dates, times, numbers, percentages, and currencies should be stored and transferred in locale-independent way. The user interface application should format the information appropriately for the desired locale.
Database text fields should be large enough to handle character data in European and Asian languages. [Required]
The size of text fields in the database should be checked to see if they need to be increased for foreign language text data. The field size n means number of characters for Microsoft Access 2000 but number of bytes for SQL Server 2000 and Oracle 8. Depending on the underlying database character encoding a character could be one to four bytes. A text field that is 32 bytes long will hold 32 ASCII characters but would only hold 16 Japanese double-byte characters or 10 Japanese characters encoded in UTF-8.
Database schema should include any required international information. [Required]
The database schema should be reviewed to make sure that it contains sufficient fields to represent international data.
Japanese names are typically sorted in phonetic order. In order to sort a list of Japanese names it is necessary to associate corresponding phonetic fields for first name and last name. If you are planning to sell a product in the Japanese market then you should consider adding phonetic name fields to support ordering of Japanese names.
Database operations should use national language support functions. [Required]
Microsoft Access 2000, SQL Server 2000, and Oracle 8/9 databases provide extensive national language support for foreign language character data and for locale-sensitive processing and formatting. If locale-sensitive operations like sorting or case00 conversion are performed in the database they should make use of the database national language support.
Database character encoding should be Unicode. [Best practice]
Unicode should be used for the database character encoding so that it is possible to store character data from any language in the same database. Microsoft Access 2000 stores Text fields in Unicode. SQL Server 2000 has both Unicode and non-Unicode character data types. Oracle supports a wide variety of database character sets including multibyte encodings and Unicode. The Oracle database character set is defined when the database is created.
Translatable text messages should not be stored in a database. [Best practice]
It is usually easier to manage localization when user interface messages are stored in files under source code control than when they are stored in a database. If translatable messages are stored in a database then special tools will be needed to manage the messages. Tools will be required to extract the source language messages to files so they can be sent out for translation, to import translated strings for multiple languages back into the database, and to make sure that localized messages stay in sync when source language messages are added or modified. In a hosted environment with multiple staging servers it might also be necessary to develop tools to "push" updated English and localized messages from one database to another.
Create a software translation kit for each product. [Required]
A translation kit or localization kit is a self-contained collection of files that allows the development team to outsource translation to an external localization vendor without revealing all of the proprietary source code. A separate software translation kit should be created for each product. Product organizations are responsible for developing and certifying translation kits for their products.
The translation kit should provide the ability for external localization vendors to translate, build, resize, and test localized resources. [Required]
The translation kit should include all the translatable files in a product as well as any tools needed to build and test a localized version of the product. The installer should be integrated into the translation kit so that the localization vendor can translate custom installer messages, build a localized install set, and test it. The localization process is more efficient when localization vendors can independently complete the translate-build-test cycle without sending files back and forth to you to complete intermediate builds.
The translation kit should be documented for localization vendors. [Required]
The directory structure for the localization kit should be as self-documenting as possible. Translatable files should be included in a separate directory so it is easy for localization vendors to determine which files should be translated and which should not. Files that need to be customized rather than translated (like web style sheets) should be stored in a separate subdirectory. Non-translatable files and tools should also be stored in a separate directory.
The translation kit should include documentation which concisely describes the types of files in the kit and explains how to build and test localized resources. The documentation should describe message file syntax for non-standard formats (what is translatable and what is code), character encoding issues, procedure for re-ordering message parameters, and any specific translation instructions.
The translation kit should be created with each base English product build. [Required]
The translation kit should be automatically generated and compiled as part of every base English product build. When Development or Product Release builds the base English product they should actually create the translation kit and build the English product from the English files in the translation kit. By creating a translation kit with every build the delay in creating translation kits will be eliminated.
Where both a binary and text source is available (e.g. DLL and RC files), the text format should be used in the translation kit because it may better facilitate change tracking in source code version systems.
The build process for base English products and localized products should be the same. [Required]
By making the base English builds use the same process as the foreign language builds we will insure that the translation kit compiles correctly and contains all the necessary resource files. English should be treated like "just another language".
The translated resource binary files should be modular and should be separate from the core executable for the application. [Best practice]
Not only should translatable messages be isolated into separate resource source files but the resources should also be modular at the binary level. By separating the user interface resources from the core executable there will be more confidence that the basic functionality of the product is the same regardless of the user interface language.
Large engineering projects often span multiple countries and languages so it is desirable to develop multilingual products which can support users in multiple locales. Modular resources allow one or more user interface languages to be delivered in the same product. The localized resources should be delivered in separate language packs.
For Windows C++ applications resource-only DLLs can be used. In .NET the default procedure is build base English resources into the main assembly and use satellite assemblies for localized user interface resources.