API response encoding

Need help with your application? Ask here.
User avatar
freelancer
Posts: 17
Joined: Fri Jun 29, 2018 4:53 pm
Location: Sweden

API response encoding

Post by freelancer »

I've noticed the overview field in the API reponse for some games contains characters that are not printable in UTF-8. The same games return correct/printable characters when using the old/legacy API. A few examples:

Halo 2 (game ID 9)
New API: In Halo 2, the saga continues as Master Chief\u0097a genetically enhanced super-soldier\u0097is the only thing standing between the relentless Covenant and the destruction of all humankind.
Old API: In Halo 2, the saga continues as Master Chief—a genetically enhanced super-soldier—is the only thing standing between the relentless Covenant and the destruction of all humankind.

Left 4 Dead (game ID 22)
New API: L4D's survival co-op mode lets you blast a path through the infected in four unique \u0093movies,\u0094
Old API: L4D's survival co-op mode lets you blast a path through the infected in four unique “movies,”

The Legend of Zelda (game ID 113)
New API: \u0095 Explore the vast Overworld terrain of the land of Hyrule and discover hidden treasures
Old API: • Explore the vast Overworld terrain of the land of Hyrule and discover hidden treasures

Tom Clancy's Rainbow Six 3: Raven Shield (game ID 4441)
New API: Usta\u009ae regime
Old API: Ustaše regime

The old API responses were properly UTF-8 encoded, is this not the case with the new API? If so, is that by choice or by accident?

User avatar
GamersDatabase
Posts: 20
Joined: Wed Jul 04, 2018 2:48 pm

Re: API response encoding

Post by GamersDatabase »

It's intended. See this post about encoding:

viewtopic.php?f=5&t=115

User avatar
freelancer
Posts: 17
Joined: Fri Jun 29, 2018 4:53 pm
Location: Sweden

Re: API response encoding

Post by freelancer »

GamersDatabase wrote:
Thu Aug 09, 2018 9:27 pm
It's intended. See this post about encoding:

viewtopic.php?f=5&t=115
That post talks about what I assume is HTML entity encoding, which is not the same thing as the character encoding of a document. Entity encoding turns characters not allowed in HTML into HTML entities (for example, it turns < into <).

What I'm seeing in the API responses is that the same character is returned as different code points in the old vs new API. For example, an em dash is returned as \u2014 in the old API, which is the correct unicode character for an em dash. In the new API however, it's returned as \u0097, which is an unprintable control character in unicode. That leads me to believe that the new API responses are not using a unicode character encoding. I need to know if that's unintentional or intentional (and if so, which character encoding I should interpret it as).

The old API was explicitely returning UTF-8 (it had <?xml version="1.0" encoding="UTF-8" ?> at the top of the document), but the new API doesn't seem to be returning 100% valid UTF-8 data.

Or maybe I'm misunderstanding what the linked post is saying?

User avatar
Zer0xFF
Posts: 330
Joined: Fri Apr 20, 2018 9:18 am

Re: API response encoding

Post by Zer0xFF »

[mention]freelancer[/mention] thanks for pointing that out.

I've quickly look into the issue and I believe this was caused during the migration from latin1 to utf8 and to confirm that I've found this table https://www.i18nqa.com/debug/table-iso8 ... -1252.html that states latin \u0097 = \u2014 utf8

ummm.... fingers crossed just getting the data from the old server could fix this, else it might require manual review :/

[mention]GamersDatabase[/mention] the json encoding is intended, however the invalid character set is not, and it's a side effect from migrating data and would need to be fixed.
Regards
Zer0xFF

User avatar
edirol
Posts: 745
Joined: Thu Jun 28, 2018 1:08 am

Re: API response encoding

Post by edirol »

I've found incorrect chars in some games names and already fixed them.
[mention]Zer0xFF[/mention] According to my data there are about 2.7k games with incorrect chars in overview. Fixing them manually would be difficult.

User avatar
Zer0xFF
Posts: 330
Joined: Fri Apr 20, 2018 9:18 am

Re: API response encoding

Post by Zer0xFF »

[mention]edirol[/mention] the issue is, any game that was edited can't be edited from the old database as we might removed any new additions.
Regards
Zer0xFF

User avatar
edirol
Posts: 745
Joined: Thu Jun 28, 2018 1:08 am

Re: API response encoding

Post by edirol »

[mention]Zer0xFF[/mention] Is it possible to just replace incorrect chars with correct in accordance with the table you gave above?

User avatar
freelancer
Posts: 17
Joined: Fri Jun 29, 2018 4:53 pm
Location: Sweden

Re: API response encoding

Post by freelancer »

[mention]Zer0xFF[/mention] Thanks for confirming. And yeah, I'd concur with the estimate of about 2700 games being affected. The best course of action would probably be to write a small script to go through your database and replace the invalid characters, at least for games you can't fetch from the old database.

I've fixed the obvious ones in my own database and stopped updates for now, to make sure my users don't get bad data.

User avatar
Zer0xFF
Posts: 330
Joined: Fri Apr 20, 2018 9:18 am

Re: API response encoding

Post by Zer0xFF »

edirol wrote:
Fri Aug 10, 2018 9:26 am
@Zer0xFF Is it possible to just replace incorrect chars with correct in accordance with the table you gave above?
if I can identify them as invalid, the issue is, those are valid utf8 "control" character whatever that means.

https://www.utf8-chartable.de/unicode-u ... l?utf8=dec

if anyone have a good idea how I'd go about please do let me know.
I've tried casting (in mysql) the value to binary (to avoid encoding), the covert to latin1, after which it'll be converted to connection default aka utf8, this worked somewhat the missing characters showed up... but introduced a 2nd invalid character  before them
Regards
Zer0xFF

User avatar
freelancer
Posts: 17
Joined: Fri Jun 29, 2018 4:53 pm
Location: Sweden

Re: API response encoding

Post by freelancer »

[mention]Zer0xFF[/mention] As far as I can tell the problematic characters are all in the C1 control range (0080-009F). I did some googling which seems to support this, as this range is sometimes used for printable characters in ISO-8859 derivatives (even though that's technically not allowed, but why would that stop anyone...) and can cause some confusion when the text is converted to unicode. C1 control characters are rarely (if ever) used in plain text, and I'm 99% sure they shouldn't exist in the text fields of your database.

Based on this, I would say it's safe to replace all characters in that range (0080-009F) with the correct unicode codepoints. Write a small script that goes through every game in the database, does a string replace on every text field replacing those characters, and saves the modified data back to the database.

Relevant wiki page: https://en.wikipedia.org/wiki/C0_and_C1_control_codes
Also found this which might be helpful: https://gist.github.com/epheatt/1697194

Post Reply