How does syntax="IA5String" work?

"ab😀c" when encoded in UTF-8 is [97, 98, 240, 159, 152, 128, 99].


tokenId = "97, 98, 240, 159, 152, 128, 99, ..." (using decimal instead of hex for readability)

<ts:attribute-type id="street" syntax=""> <!-- DirectoryString -->
    <ts:string xml:lang="en">Street</ts:string>
    <ts:token-id as="utf8" bitmask="FFFFFFFFFFFFFF00000000000000000000000000000000000000000000000000"/>

We expect the value of street to be "ab😀c".

What should the value of street be if the syntax is IA5String?

<ts:attribute-type id="street" syntax=""> <!-- IA5String -->
    <ts:string xml:lang="en">Street</ts:string>
    <ts:token-id as="utf8" bitmask="FFFFFFFFFFFFFF00000000000000000000000000000000000000000000000000"/>
  1. ""
  2. "ab😀c"
  3. "ab"
  4. "ab😀c" (or something like that)

Good question. There are really only two ways to deal with the issue. Either invalidate the entire origin by going to the next origin (if there is one), or go with your option 3†.

Suppose we have:

  1. "correct input leads to consistent correct output; incorrect input leads to various unpredictable output";
  2. "correct input leads to consistent correct output; incorrect input leads to consistent error"

The first seems to have a better chance of survival in the competition of techs. (Think how HTML won over XHTML).

† Why there is only option 3, barring invalidating the origin

Let's look at option 3. It is more forgiving to the next joint in the pipeline of data processing. The next joint would not expect non-ASCII code in the string and would throw up badly at the smiley if we do not drop it.

There is a new option 3':

3'. "abc"

It is like what a data indexing engine would do as well. Suppose Google indexes a web page and found unattainable codepoint, it would index the rest of the web page. TokenScript provides for the data index engines. It also is what HTML engine does with broken HTML segments. The downside of this approach is you might get "Cancin" out of "Canción".

Banning IA5String altogether?

History shows that most application of IA5String can't last. It used to be a requirement for email addresses but we soon got stuff like someone@α

Some say car plates are uniformly IA5String but wait until you see ”京C-888888“

The general wisdom is that anything with real-world meaning can't be IA5String, so we are left with data identifiers, like the key to be used in <mapping>.

One reason to ban IA5String is that it doesn't serve any advantage. TelephoneNumber for example can serve to help index phone numbers by country code and region; NumericString serves to provide range-check for inquiries and filters. What does IA5String provide?

In that case, how do we handle an attribute with origin as="utf8" and syntax="" (NumericString)?

where the value from the origin is:

A) "abc 123"
B) "123 abc"
C) "12abc3"

C) is often nil/invalid when doing the equivalent of Int("12abc3") in many languages. If we apply 3', then it sounds like the attribute after applying syntax should be 123.

It feels like the outputs for (A), (B), (C) should all be nil or 0. A bit of inconsistency in applying syntax, but more natural. What do you think?

You know my default answer to such questions :slight_smile:

I agree that it should be nil (in the context of TokenScript, nil value leads to the next candidate being taken, hence the scripter can supply a constant value for that case, or choose to call another smart contract function).

When non-digit is mixed in a NumericString, it's always an error. But when an ideograph is mixed in IA5String, the programmers might think it is not an error and the data engine should cope with it?

You know my default answer to such questions :slight_smile:

Dropp the support for IA5String? My only concern is hardware wallets. Do you think it's okay to force hardware wallet to handle Unicode? (I am inclined so as well).

When confirming a signed message, the wallet will render it as hex if it doesn't know the underlying syntax, which makes the user can't tell what they are signing. But, if the underlying syntax is IA5String and contains unicode, the hardware wallet will either have to revert to hex, or throw up on the floor. By removing IA5String (it is also used to generate signed messages!) we force hardware wallet to support unicode.

I guess we might have to add IA5String back then :slightly_smiling_face:. Sounds like a valid reason for it. I doubt we’ll be able to dictate which encodings hardware wallets can support. In any case, even if we can, we still want to be able to support what is already in the market. And if they can only do ASCII, then we’ll have to only send that for signing.