[FIXED] Recognition of Russian texts in plain files #1423

Closed
opened 6 years ago by sergeevabc · 25 comments
sergeevabc commented 6 years ago (Migrated from github.com)

7x64, 27.5.1

  1. Download test.txt (SHA1: 80470ac9d191fbf76eadd89b5efb0c47d159cdab)
  2. Open file:///C:/test.txt in the browser
  3. Result: garbled text (PM believes encoding is Cyrillic ISO), expected: readable text (Unicode)

Text editors such as Notepad++ and Sublime Text recognize encoding of that file correctly if that matters.

Edit: added the hash to verify the file is unaltered.

7x64, 27.5.1 1. Download [test.txt][1] (SHA1: 80470ac9d191fbf76eadd89b5efb0c47d159cdab) 2. Open file:///C:/test.txt in the browser 3. Result: [garbled][2] text (PM believes encoding is Cyrillic ISO), expected: [readable][3] text (Unicode) Text editors such as Notepad++ and Sublime Text recognize encoding of that file correctly if that matters. Edit: added the hash to verify the file is unaltered. [1]: https://www.upload.ee/files/7587066/test.txt.html [2]: https://images.vfl.ru/ii/1508685144/e714d6c9/19101083.png [3]: https://images.vfl.ru/ii/1508685145/df65c17a/19101084.png
janekptacijarabaci commented 6 years ago (Migrated from github.com)

I suggest to try:
Type: text/plain; charset=utf-8

Notepad++, etc., see also: https://stackoverflow.com/questions/7256049/how-do-i-convert-an-ansi-encoded-file-to-utf-8-with-notepad

I suggest to try: Type: `text/plain; charset=utf-8` `Notepad++`, etc., see also: https://stackoverflow.com/questions/7256049/how-do-i-convert-an-ansi-encoded-file-to-utf-8-with-notepad
Martii commented 6 years ago (Migrated from github.com)

Interesting (Linux):

  • browser tests

Seems something is different between WebKit and Gecko/Goanna.

Interesting *(Linux)*: * <img src="https://user-images.githubusercontent.com/114709/31878236-a89df7bc-b796-11e7-9383-5caf5fd85637.png" alt="browser tests" title="Click to enlarge" width="168" height="105"/> Seems something is different between WebKit and Gecko/Goanna.
Martii commented 6 years ago (Migrated from github.com)

Notepad++ claims the text file (sha1 confirmed) is UTF-8 encoding by default here and when encoding is set to Windows-1251 turns into gibberish different than the browsers.

So guessing that text/plain on file drop/open schema may need the UTF-8 character set? Could need improvement on the equivalent of "sniffing" for the MIME type too.

Notepad++ claims the text file *(sha1 confirmed)* is UTF-8 encoding by default here and when encoding is set to Windows-1251 turns into gibberish different than the browsers. So guessing that text/plain on file drop/open schema may need the UTF-8 character set? Could need improvement on the equivalent of "sniffing" for the MIME type too.
wolfbeast commented 6 years ago (Migrated from github.com)

If there is no signature in the text file indicating code page, and it's opened locally without available content-type, there is no way for the browser to know what code page or encoding to use. Plain text is assumed to be one of the available ISO code pages and not defaulting to UTF-8, because of general usage reasons.

This issue is therefore the question what to use as default for locally-opened files or files that do not have any indication of encoding. I'm assuming Webkit defaults to UTF-8 and doesn't do any sniffing at all.

Of note, you can change the displayed encoding on the fly from the Web Developer menu, if it is displayed incorrectly.

If there is no signature in the text file indicating code page, and it's opened locally without available content-type, there is no way for the browser to know what code page or encoding to use. Plain text is assumed to be one of the available ISO code pages and not defaulting to UTF-8, because of general usage reasons. This issue is therefore the question what to use as default for locally-opened files or files that do not have any indication of encoding. I'm assuming Webkit defaults to UTF-8 and doesn't do any sniffing at all. Of note, you can change the displayed encoding on the fly from the Web Developer menu, if it is displayed incorrectly.
sergeevabc commented 6 years ago (Migrated from github.com)

@wolfbeast, why do you say the file has no indication of encoding?

$ file test.txt
test.txt: UTF-8 Unicode text, with CRLF line terminators
@wolfbeast, why do you say the file has no indication of encoding? ```console $ file test.txt test.txt: UTF-8 Unicode text, with CRLF line terminators ```
Martii commented 6 years ago (Migrated from github.com)

@sergeevabc
I did a hex dump of the file and there's nothing out of the ordinary at begin and end... so I assume $ file presumes UTF-8 or whatever the environment defaults to... in my case my Linux is UTF-8 as far as I know... and it's a Russian based distro.

EDIT:

$ file test.txt
test.txt: UTF-8 Unicode text, with CRLF line terminators
$ iconv -f UTF-8 -t WINDOWS-1251 test.txt > test-windows-1251.txt 
$ file test-windows-1251.txt 
test-windows-1251.txt: Non-ISO extended-ASCII text, with CRLF line terminators

Guess this doesn't make a huge bit of difference on the hex side... just not detected as UTF-8.

@wolfbeast

This issue is therefore the question what to use as default for locally-opened files or files that do not have any indication of encoding.

Well there's always the looming UTF-16 floating around too... so not entirely sure.

I'm assuming Webkit defaults to UTF-8 and doesn't do any sniffing at all.

Former is probably accurate. Latter would depend on BOM checking and how detailed their charset detector routine is.

Of note, you can change the displayed encoding on the fly...

Always forget that's there. :) Thanks.

@sergeevabc I did a hex dump of the file and there's nothing out of the ordinary at begin and end... so I assume `$ file` presumes UTF-8 or whatever the environment defaults to... in my case my Linux is UTF-8 as far as I know... and it's a Russian based distro. EDIT: ``` sh-session $ file test.txt test.txt: UTF-8 Unicode text, with CRLF line terminators $ iconv -f UTF-8 -t WINDOWS-1251 test.txt > test-windows-1251.txt $ file test-windows-1251.txt test-windows-1251.txt: Non-ISO extended-ASCII text, with CRLF line terminators ``` Guess this doesn't make a huge bit of difference on the hex side... just not detected as UTF-8. @wolfbeast > This issue is therefore the question what to use as default for locally-opened files or files that do not have any indication of encoding. Well there's always the looming UTF-16 floating around too... so not entirely sure. > I'm assuming Webkit defaults to UTF-8 and doesn't do any sniffing at all. Former is probably accurate. Latter would depend on BOM checking and how detailed their charset detector routine is. > Of note, you can change the displayed encoding on the fly... Always forget that's there. :) Thanks.
wolfbeast commented 6 years ago (Migrated from github.com)

UTF-16 never got much following and is hardly ever used. UTF-8 provides more than enough character space for any localization, and it's transparently compatible with low-bit ISO characters (UTF-16 is not). It's safe to assume that UTF-8 is the future.

I also don't think your *nix system locale matters. The problem seems to be that Mozilla in their attempt to "protect users from themselves" have precluded UTF-8 as a fallback user-override, or you would have been able to set this default.

I'll be taking this issue. It's a simple change; the question remains though if we should override the fallback to UTF-8 by default or not.

UTF-16 never got much following and is hardly ever used. UTF-8 provides more than enough character space for any localization, and it's transparently compatible with low-bit ISO characters (UTF-16 is not). It's safe to assume that UTF-8 is the future. I also don't think your *nix system locale matters. The problem seems to be that Mozilla in their attempt to "protect users from themselves" have precluded UTF-8 as a fallback user-override, or you would have been able to set this default. I'll be taking this issue. It's a simple change; the question remains though if we should override the fallback to UTF-8 by default or not.
Martii commented 6 years ago (Migrated from github.com)

It's a simple change; the question remains though if we should override the fallback to UTF-8 by default or not.

We coerce to UTF-8 on output from OUJS in our Buffers regardless of input type although we store in the DB whatever character set it is (e.g. unscathed)... if that helps. :)

> It's a simple change; the question remains though if we should override the fallback to UTF-8 by default or not. We coerce to UTF-8 on output from OUJS in our Buffers regardless of input type although we store in the DB whatever character set it is *(e.g. unscathed)*... if that helps. :)
wolfbeast commented 6 years ago (Migrated from github.com)

I have no idea what you just said. XD
I'll keep "auto" the default.

Also, if you want your browser locale to match your OS locale by default, you need to enable that; otherwise your OS system locale has 0 influence on the browser locale (this is by design)

I have no idea what you just said. XD I'll keep "auto" the default. Also, if you want your browser locale to match your OS locale by default, you need to enable that; otherwise your OS system locale has 0 influence on the browser locale (this is by design)
sergeevabc commented 6 years ago (Migrated from github.com)

@Martii, there is no ef bb ff BOM signature here, but go further and check for high-bit characters.

$ hexdump -Cv -n 32 test.txt
00000000  53 75 62 6a 65 63 74 3a  20 d0 91 d0 b5 d0 b7 d0  |Subject: БезР|
00000010  be d0 bf d0 b0 d1 81 d0  bd d0 be d0 b5 20 d0 b7  |ѕРїР°СЃРЅРѕРµ Р·|
@Martii, there is no ```ef bb ff``` [BOM signature][1] here, but go further and check for [high-bit][2] characters. ```console $ hexdump -Cv -n 32 test.txt 00000000 53 75 62 6a 65 63 74 3a 20 d0 91 d0 b5 d0 b7 d0 |Subject: БезР| 00000010 be d0 bf d0 b0 d1 81 d0 bd d0 be d0 b5 20 d0 b7 |ѕРїР°СЃРЅРѕРµ Р·| ``` [1]: http://www.garykessler.net/library/file_sigs.html [2]: http://chardet.readthedocs.io/en/latest/how-it-works.html#multi-byte-encodings
Martii commented 6 years ago (Migrated from github.com)

@wolfbeast
LOL if someone uploads a character set that isn't UTF-8 we store that as-is and then when we output the source it's always UTF-8. ;)

Anyhow... you know your project best... this issue ticket caught my interest in passing. Sorry if I'm roaming around the issue.

I'll keep "auto" the default.

Western is what was picked here. And as I mentioned I forgot that option was even there so it's probably the default here.


... UTF-16 ...

As a side note node, our web serving platform which is V8 based (WebKit in other words), manipulates string tests (regular expression, etc) in UTF-16 which is why we have to always serve UTF-8 besides Userscript engines needing it that way.

@wolfbeast LOL if someone uploads a character set that isn't UTF-8 we store that as-is and then when we output the source it's **always** UTF-8. ;) Anyhow... you know your project best... this issue ticket caught my interest in passing. Sorry if I'm roaming around the issue. > I'll keep "auto" the default. Western is what was picked here. And as I mentioned I forgot that option was even there so it's probably the default here. --- > ... UTF-16 ... As a side note *node*, our web serving platform which is V8 based *(WebKit in other words)*, manipulates string tests *(regular expression, etc)* in UTF-16 which is why we have to always serve UTF-8 besides Userscript engines needing it that way.
wolfbeast commented 6 years ago (Migrated from github.com)

@Martii I'm pretty sure I don't know who your "we" is here. This has nothing to do with database storage or retrieval or your particular individual workflow around it.

@sergeevabc checking for high bit characters isn't going to help much, since other code pages and even Western uses high bit characters as valid characters.

What I'll do is add "UTF-8" as a default fallback override option in Options -> Content -> Fonts and colors -> Advanced. I wouldn't want to override to UTF-8 by default since then you may end up with issues on the other side of the spectrum where UTF-8 is incorrect.

@Martii I'm pretty sure I don't know who your "we" is here. This has nothing to do with database storage or retrieval or your particular individual workflow around it. @sergeevabc checking for high bit characters isn't going to help much, since other code pages and even Western uses high bit characters as valid characters. What I'll do is add "UTF-8" as a default fallback override option in `Options -> Content -> Fonts and colors -> Advanced`. I wouldn't want to override to UTF-8 by default since then you may end up with issues on the other side of the spectrum where UTF-8 is incorrect.
wolfbeast commented 6 years ago (Migrated from github.com)

A better solution has been found:

  • Add UTF-8 as a selectable fallback override. 0792b6d75c
  • Use UTF-8 as a last-effort fallback when other encoding hints aren't available. This will still respect browser locale and overrides the user may have set. f137cc4b82
A better solution has been found: - Add UTF-8 as a selectable fallback override. 0792b6d75c0800a1901d96e2d1966759203d1cdd - Use UTF-8 as a last-effort fallback when other encoding hints aren't available. This will still respect browser locale and overrides the user may have set. f137cc4b8242a18f2042b12e2701e71265258f43
Martii commented 6 years ago (Migrated from github.com)

@wolfbeast
We is usually a term of collaborators and owners of a particular project or type of project e.g. not just myself.

This has nothing to do with database storage or retrieval or your particular individual workflow around it.

It theoretically could if a host outputted UTF-16 however you picked up on my exact notices on how to fix in plain english rather than code. Falling back to UTF-8 and allowing that default is what I would have done and currently already do. Our projects could have supported UTF-16 but we don't. :)

e.g. the extra pair of fingers typing here helped you figure out a good solution to your issue whether consciously or unconciously. Glad you understood the issue presented from @sergeevabc . :)


Quote for today:
"There is nothing noble in being superior to your fellow man; true nobility is being superior to your former self."

@wolfbeast [We](https://www.wikipedia.org/wiki/We) is usually a term of collaborators and owners of a particular project or type of project e.g. not just myself. > This has nothing to do with database storage or retrieval or your particular individual workflow around it. It theoretically could if a host outputted UTF-16 however you picked up on my exact notices on how to fix in plain english rather than code. Falling back to UTF-8 and allowing that default is what I would have done and currently already do. Our projects could have supported UTF-16 but we don't. :) e.g. the extra pair of fingers typing here helped you figure out a good solution to your issue whether consciously or unconciously. Glad you understood the issue presented from @sergeevabc . :) --- Quote for today: "There is nothing noble in being superior to your fellow man; true nobility is being superior to your former self."
wolfbeast commented 6 years ago (Migrated from github.com)

@Martii 🤣 I was merely saying I don't know who your "we" is representing; clearly it's not our "we" for Pale Moon.

@Martii :rofl: I was merely saying I don't know who _your_ "we" is representing; clearly it's not _our_ "we" for Pale Moon.
sergeevabc commented 6 years ago (Migrated from github.com)

@wolfbeast, I can confirm that it works as expected on 27.6 with a fresh profile, but the output of current profile is still garbled. Update was done via zip package. Should I change something in about:config?

@wolfbeast, I can confirm that it works as expected on 27.6 with a fresh profile, but the output of current profile is still garbled. Update was done via zip package. Should I change something in ```about:config```?
Martii commented 6 years ago (Migrated from github.com)

@sergeevabc

Click Advanced button then:

UTF-8 Fallback Character Encoding
@sergeevabc Click Advanced button then: <img src="https://user-images.githubusercontent.com/114709/32538553-97d3e7d6-c423-11e7-80c2-87bb69e46b43.png" alt="UTF-8 Fallback Character Encoding" title="Click to enlarge" width="64" height="72">
Martii commented 6 years ago (Migrated from github.com)

Hmmm that is still garbled when selecting UTF-8 in my existing profile too. :\

Hmmm that is still garbled when selecting UTF-8 in my existing profile too. :\
Martii commented 6 years ago (Migrated from github.com)

Ahh pick Fonts for: Cyrillic as well... but still garbled in existing profile.

Ahh pick `Fonts for: Cyrillic` as well... but still garbled in existing profile.
Martii commented 6 years ago (Migrated from github.com)

@sergeevabc
about:config?filter=intl.charset.fallback.override ... present in both dirty and clean profile.
about:config?filter=font.language.group ... missing in existing profile but present in clean. Scratch that... bug in about:config with this... locked up... is present.

--safe-mode ... no difference in dirty profile. e.g. still garbled. Weird.

@sergeevabc [about:config?filter=intl.charset.fallback.override](about:config?filter=intl.charset.fallback.override) ... present in both dirty and clean profile. [about:config?filter=font.language.group](about:config?filter=font.language.group) ... ~~missing in existing profile but present in clean.~~ Scratch that... bug in about:config with this... locked up... is present. --safe-mode ... no difference in dirty profile. e.g. still garbled. Weird.
Martii commented 6 years ago (Migrated from github.com)

@sergeevabc

Sorry for all these posts.

This worked for me resetting about:config?filter=intl.charset.detector 👍


For completeness reset about:config?filter=intl.fallbackCharsetList.ISO-8859-1 as well... but doesn't seem to affect it.

@sergeevabc Sorry for all these posts. This worked for me resetting about:config?filter=intl.charset.detector :+1: --- For completeness reset about:config?filter=intl.fallbackCharsetList.ISO-8859-1 as well... but doesn't seem to affect it.
wolfbeast commented 6 years ago (Migrated from github.com)

Thanks for verifying the fix.

FTR: running in safe mode does not magically reset all your preferences or ignore user-set ones.

Thanks for verifying the fix. FTR: running in safe mode **does not** magically reset all your preferences or ignore user-set ones.
Martii commented 6 years ago (Migrated from github.com)

Testers always try everything. ;) :)

Testers always try everything. ;) :)
sergeevabc commented 6 years ago (Migrated from github.com)

@wolfbeast, I saw you added this fix to Basilisk codebase, would you mind adding it to Fossamail as well?

@wolfbeast, I saw you added this fix to Basilisk codebase, would you mind adding it to Fossamail as well?
wolfbeast commented 6 years ago (Migrated from github.com)

FossaMail's unofficial 38.6.0a1 includes the fallback to UTF-8 from the platform-pickup. The only thing not included is the manual override choice for UTF-8 in FossaMail's front-end.

FossaMail's unofficial 38.6.0a1 includes the fallback to UTF-8 from the platform-pickup. The only thing not included is the manual override choice for UTF-8 in FossaMail's front-end.
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: MoonchildProductions/Pale-Moon#1423
Loading…
There is no content yet.