What is locale en_us?

621    Asked by OliviaJohnston in Java , Asked on Oct 12, 2022

I am writing an API (using Java) that takes locale as a parameter. We want the clients to be able to specify "en-US" or "en_US" as they both seem to be widely used across all languages.

Java documentation (source 3 from above) states "Well-formed variant values have the form SUBTAG (('_'|'-') SUBTAG)* where SUBTAG = [0-9][0-9a-zA-Z]{3} | [0-9a-zA-Z]{5,8}. (Note: BCP 47 only uses hyphen ('-') as a delimiter, this is more lenient)."

The way I understood this is that both "_" and "-" are supported, but their code supports only "-". see my sample unit test below which fails, but passes if i use "en-US".

@Test

public void testLocale() {

    Locale locale = Locale.forLanguageTag("en_US");

    assertThat(locale.getLanguage(), equalTo("en"));

}

Are there ways to parse a locale from both forms of string "en_US" and "en-US"? What is the recommended approach here ?


Answered by Ono Imai

Rather than replacing without understanding I'd like to explain why there appear to be two forms one underscored and another hyphenated and why one should care.


tl;dr it is not simple as a single char replacement.

1. specifications

There is several specification of locale en_us at play here :

ICU: The International Components for Unicode which has a section on how to encode / represent locales (language, country, script, variant, etc.), this project is mostly used in native languages like C. And should not be mixed with locales in a POSIX system.

Unicode CLDR: Which interestingly offers a simple page on the equivalence with language tag. This representation allows both hyphens and underscore, while preferring the hyphen. This one is also different from ICU.

(BCP 47 / RFC 4646) ⇒ RFC 5646 is about the language tag used in HTTP headers (RFC 3282) especially in Content-Language and in Accept-Language (language ranges).

So the standard in the web is to use language tags in some headers. So that means

If the server application is relying on standard HTTP headers it should handle language tags (language range more specifically for the Accept-Language).

If the server application is using a custom header like Company-MyApp-Other-Language then no problem, it is custom to this ecosystem.

If the client applications are using standard HTTP headers but using a bad format, then indeed the server-app should try to handle those.

2. some implementation details

The nice thing about PHP's Locale is that it handles ICU format and Language-tags.

However for Java the story is a bit more complicated, since the Locale class is pretty old (since JDK 1.1 circa 1997) and was not able to parse or format (toString) to any of the above mentioned standards, it used underscore as a separator but it is not ICU compliant. In JDK 1.7 circa 2011 the forLanguageTag / toLanguageTag were added to support BCP 47 / RFC 4646 standard ; the old methods kept their legacy behaviour for backward compatibility. At this date (march 2018) ICU /CLDR locale format is not supported by JDK's Locale.

Indeed the naive approach, working in most situations (99.999%) is to replace underscores by hyphens. Some rare cases can appear when the string only contains an extension.

However this does not work for every ICU representation especially when there's ICU keywords like @currency=..., those cannot be converted via a single char replacement.

sr_Latn_RS_REVISED@currency=USD

en_IE@currency=IEP

Also this won't work for java clients that use the the toString of the java Locale, this is more likely to cause problems for some country where the script is an important part for the locale, e.g. for Serbia :

Locale.forLanguageTag("sr-Latn-RS").toString() ⇒ sr_RS_#This legacy toString cannot be interpreted as a valid language tag, in fact the parser dismisses the script completely.

Just reading the toString javadoc should raise your eyebrow :

Returns: A string representation of the Locale, for debugging.

(Emphasis is mine.)

So handling every possible bad behavior in a standard header is tricky, the best would be to respect the language tag RFC, which the browser respects. If not possible then it is best to identify which application is responsible for this misbehavior and what is the format being used. This would avoid handling every possible format.

3. post scriptum

Besides, the javadoc part that is quoted in the question only applies to the variant part of a locale, not to the language and country

Well-formed variant values have the form SUBTAG (('_'|'-') SUBTAG)* where SUBTAG = [0-9][0-9a-zA-Z]{3} | [0-9a-zA-Z]{5,8}. (Note: BCP 47 only uses hyphen ('-') as a delimiter, this is more lenient).



Your Answer

Interviews

Parent Categories