[FIXED] Recognition of Russian texts in plain files
#1423
Closed
opened 6 years ago by sergeevabc
·
25 comments
No Branch/Tag Specified
master
release
theme-hidpi
27.9_RelBranch
27.8_RelBranch
27.7_RelBranch
27.6_RelBranch
27.5_RelBranch
27.4_RelBranch
27.3_RelBranch
27.2_RelBranch
27.1_RelBranch
27.0_RelBranch
26.5_Atom_RelBranch
Atom
26.5_RelBranch
v26_Dev
v25-LTS
26.4_Atom_RelBranch
26.4_RelBranch
26.3_Atom_RelBranch
26.3_RelBranch
26.2_Atom_RelBranch
26.2_RelBranch
26.1_Atom_RelBranch
26.1_RelBranch
26.0_Atom_RelBranch
26.0_RelBranch
25.8_Atom_Relbranch
25.8_RelBranch
v25_Atom
v25_Dev
25.7_Atom_Relbranch
25.7_RelBranch
25.6_Atom_RelBranch
25.6_RelBranch
25.5_Atom_RelBranch
25.5_RelBranch
25.4_Atom_RelBranch
25.4_RelBranch
25.3_RelBranch
25.3_Atom_RelBranch
25.2_Atom_RelBranch
25.2_RelBranch
25.1_RelBranch
25.1_Atom_RelBranch
25.0_Atom_RelBranch
25.0_RelBranch
24.7_RelBranch
24.6_RelBranch
32.2.0_Release
32.2.0_RC1
32.1.1_Release
32.1.1_RC1
32.1.0_Release
32.1.0_RC2
32.1.0_RC1
32.1.0_beta3
32.1.0_beta2
32.1.0_beta1
32.0.1_Release
32.0.0_Release
31.4.2_Release
31.4.2_RC1
31.4.1.1_Release
31.4.1_Release
31.4.1_RC1
31.4.0_Release
31.4.0_RC2
31.4.0_RC1
31.3.1_Release
31.3.1_RC1
31.3.0.1_Release
31.3.0_Release
31.3.0_RC2
31.3.0_RC1
31.2.0.1_Release
31.2.0_Release
31.2.0_RC1
31.1.1_Release
31.1.0_Release_build2
RB_20220607_2
RB_20220607
31.1.0_Release
31.1.0_RC1
31.0.0_Release
RB_20220510
31.0.0_RC2
RC_20220507
31.0.0_RC1
29.4.6_Release
RB_29.4.6
29.4.6_RC1
29.4.5.1_Release-UXP
RC_20220409
29.4.5_Release-UXP
29.4.4_Release-UXP
29.4.3_Release-UXP
29.4.2.1_Release-UXP
29.4.2_Release-UXP
29.4.1_Release-UXP
RB_29.4.5-UXP
RB_29.4.5.1-UXP
RELBASE_20220127-UXP
RELBASE_20220118-UXP
RELBASE_20211214-UXP
RELBASE_20211110-UXP
RELBASE_20211109-UXP
RELBASE_20210914-UXP
29.4.5.1_Release
29.4.5_Release
30.0.1_Release
30.0.0_Release
30.0.0_RC4
30.0.0_RC3
30.0.0_RC2
30.0.0_RC1
29.4.4_Release
29.4.4_RC1
29.4.3_Release
29.4.3_RC1
29.4.2.1_Release
29.4.2_Release
29.4.2_RC1
29.4.1_Release
RELBASE_20210823
29.4.0.2_Release
29.4.0.1_Release
29.4.0_Release
RELBASE_20210817
29.4.0_RC2
RC_20210815
29.4.0_RC1
RC_20210813
29.3.0_Release
RELBASE_20210719
RC_20210715
RELBASE_20210608
29.2.1_Release
29.2.1_RC1
RC_20210604
29.2.0_Release
RELBASE_20210427
29.2.0_RC2
29.2.0_RC1
RC_20210421
29.1.1_Release
RELBASE_20210330
29.1.1_RC1
RC_20210326
29.1.0_Release
RELBASE_20210302
29.1.0_RC2
RC_20210226
29.1.0_RC1
RC_20210225
RELBASE_20210205
29.0.1_Release
RELBASE_20210202
29.0.0_Release
RC_20210130
29.0.0_RC2
RC_20210128
29.0.0_RC1
RELBASE_20201218
28.17.0_RC2
RC_20201216
28.17.0_RC1
RC_20201215
RELBASE_20201124
28.16.0_Release
RELBASE_20201120
28.16.0_RC1
RC_20201120
28.15.0_Release
RELBASE_20201024
28.15.0_RC1
RC_20201024
RELBASE_20201001
28.14.2_Release
RELBASE_20200930
28.14.1_Release
RELBASE_20200929
28.14.0_Release
28.14.0_RC2
28.14.0_RC1
RC_20200924
RELBASE_20200901
28.13.0_Release
RELBASE_20200831
28.12.0_Release
RELBASE_20200730
28.11.0_Release
RELBASE_20200712
RELBASE_20200711
28.10.0_Release
RELBASE_20200603
28.9.3_Release
RELBASE_20200506
28.9.2_Release
RELBASE_20200427
RELBASE_20200426
28.9.1_Release
RELBASE_20200408
RELBASE_20200324
28.9.0.2_Release
28.9.0.1_Release
28.9.0_Release
PM28.8.4_Release
v2020.02.18
PM28.8.3_Release
v2020.02.06
PM28.8.2.1_Release
PM28.8.2_Release
v2020.01.12
PM28.8.1_Release
PM28.8.0_Release
v2019.10.31
PM28.7.2_Release
v2019.09.12
PM28.7.1_Release
v2019.09.03
PM28.7.0_Release
PM28.6.1_Release
PM28.6.0.1_Release
PM28.6.0_Release
v2019.06.08
PM28.5.2_Release
PM28.5.1_Release
PM28.5.0_Release
PM28.4.1_Release
v2019.03.27
v2019.03.08
PM28.4.0_Release
v2019.02.11
PM28.3.1_Release
PM28.3.0_Release
v2018.12.18
PM28.2.2_Release
PM28.2.1_Release
PM28.2.0_Release
v2018.11.07
v2018.11.04
v2018.09.27
PM28.1.0_Release
v2018.09.05
PM28.0.1_Release
PM28.0.0.1_Release
PM28.0.0_Release
PM28.0.0_Build1
PM28.0.0b5_Unstable
PM28.0.0b4_Unstable
v2018.07.18
27.9.4_Release
PM28.0.0b3_Unstable
PM28.0.0b2_Unstable
PM28.0.0b1_Unstable
27.9.3_Release
PM28.0.0a4_Unstable
NSS_3.35_TEST
PM28.0.0a3_Unstable
v2018.06.01
PM28.0.0a2_Unstable
27.9.2_Release
27.9.1_Release
27.9.0_Release
27.8.3_Release
27.8.2_Release
27.8.1_Release
27.8.0_Release
Checkpoint_1
FullFunction_CP1
FF_Checkpoint_1
27.7.2_Release
27.7.1_Release
27.7.0_Release
27.6.2_Release
27.6.1_Release
27.6.0_Release
27.6.0-RC1
27.5.1_Release
27.5.0_Release
27.4.2_Release
27.4.1_Release
27.4.0_Release
27.3.0_Release
27.2.1_Release
27.2.0_Release
27.1.2_Release
27.1.1_Release
27.1.0b2
27.0.3_Release
27.0.2_Release
27.0.1_Release
27.0.0_Release
27.0.0b3r2
27.0.0b3
27.0.0b2
27.0.0b1
26.5.0_Release_Atom
26.5.0_Release
26.4.1_Release
26.4.1_Release_Atom
26.4.0.1_Release_Linux
26.4.0.1_Release_Atom_Linux
25.9.5_Release_Android
26.4.0_Release_Atom
26.4.0_Release
26.3.3_Release_Atom
26.3.3_Release
26.3.2_Release_Atom
26.3.2_Release
26.3.1_Release_Atom
26.3.1_Release
25.9.3_Release_Android
26.3.0_Release_Atom
26.3.0_Release
25.9.2_Release_Android
26.2.2_Release_Atom
26.2.2_Release
26.2.2_RC1
25.9.1_Release_Android
26.2.1_Release_Atom
26.2.1_Release
26.2.0_Release_Atom
26.2.0_Release
26.2.0_RC2
26.2.0_RC3
26.2.0_RC1
25.9_Release_Android
26.1.1_Release_Atom
26.1.1_Release
26.1.0_Release_Atom
26.1.0_Release
26.1.0b1
26.0.3_Release_Atom
26.0.3_Release
26.0.2_Release_Atom
26.0.2_Release
26.0.1_Release
26.0.1_Release_Atom
26.0.0_Release_Atom
26.0.0_Release
25.8.1_Release_Android
25.8.1_Release_Atom
25.8.1_Release
25.8.0_Release_Android
25.8.0_Release_Atom
25.8.0_Release
25.8.0_beta1
Goanna-publicbeta-3
Goanna-publicbeta-2
25.7.3.1_Release_Android
25.7.3_Release_Android
25.7.3_Release
25.7.3_Release_Atom
25.7.2_Release_Android
25.7.2_Release_Atom
25.7.2_Release
25.7.1_Release_Android
25.7.1_Release_Atom
25.7.1_Release
25.7.0_Release_Atom
25.7.0_Release
Goanna-publicbeta-1
25.6.0_Release_Atom
25.6.0_Release
25.6.0_beta2
25.6.0_beta1
25.5.0_Release_Atom
PM4XP64_25.5.0_RELEASE
PM4XP32_25.5.0_RELEASE
25.5.0_Release
25.5.0_beta1
PM4XP32_25.4.1_RELEASE
PM4XP64_25.4.1_RELEASE
PM4XP64_25.4.0_RELEASE
PM4XP32_25.4.0_RELEASE
25.4.1_Release_Atom
25.4.1_Release
PM4XP32_25.3.2_RELEASE
25.4.0_Release_Atom
25.4.0_Release
25.4.0_beta3
25.3.2_Release_Atom
25.3.2_Release
25.4.0_beta2
PM4XP64_25.3.1_RELEASE
PM4XP32_25.3.1_RELEASE
25.3.1_Release_Atom
25.3.1_Release
PM4XP32_25.3.0_RELEASE
PM4XP64_25.3.0_RELEASE
25.3.0_Release
25.3.0_Release_Atom
25.3.0_beta4
25.3.0_beta3
25.1.1_Release
25.3.0_beta2
PM4XP64_25.2.1_RELEASE
PM4XP32_25.2.1_RELEASE
25.3.0_beta1
25.2.1_Release_Atom
25.2.1_Release
SUMOZI_25.2.0_MERGE
PM4XP64_25.2.0_RELEASE
PM4XP32_25.2.0_RELEASE
25.2.0_Release_Atom
25.2.0_Release
25.2.0_RC2
25.2.0_beta3
25.2.0_beta2
25.2.0_beta1
25.1.1_Release-Android
25.0_Release
PM4XP32_25.1.0_RELEASE
PM4XP64_25.1.0_RELEASE
SUMOZI_25.1.0_MERGE
25.1.0_Release_Atom
25.1.0_Release
25.1.0_beta3
25.1.0_beta2
SUMOZI_25.0.2_MERGE
SUMOZI_25.0.1_MERGE
SUMOZI_25.0.0_MERGE
PM4XP64_25.0.2_RELEASE
PM4XP64_25.0.1_RELEASE
PM4XP32_25.0.2_RELEASE
PM4XP32_25.0.1_RELEASE
25.0.2_Release_Atom
25.0.2_Release
25.0.1_Release
25.0.1_Release_Atom
PM4XP32_25.0.0_RELEASE
PM4XP64_25.0.0_RELEASE
25.0.0_Release_Atom
PM4XP64_25.0.0_PRERELEASE
PM4XP32_25.0.0_PRERELEASE
25.0.0_Release
25.0.0_beta3
PM4XP64_24.7.2_RELEASE
SUMOZI_24.7.2_RELEASE
24.7.2_Release
24.7.1_Release
25.0.0_beta2
25.0.0_beta1
Milestone_25
PM4XP64_24.7.1_RELEASE
SUMOZI_24.7.1_RELEASE
SUMOZI_24.7.0_RELEASE
PM4XP64_24.7.0_RELEASE
24.7.0_Release_Android
24.7.0_Release
24.7.0_Release_build1
24.7.0_RC1
24.7.0_beta4
GUID_working_base
24.7.0_beta3
24.7.0_beta2
24.6.2-r2_Release
24.6.2_Release
24.6.1_Release
24.6.0_Release
24.6.0_RC_Build1
24.6.0_beta5
24.5.1_beta4
27.1.0_Release
28.17.0_Release
Labels
Clear labels
Good issue for contributors new to the project
Bookmarks/History
Site-Specific User Agent Overrides
Tab handling and switching
Apply labels
Assigned
Backed Out
Bounty
Bounty Paid
Browser-Parity
Bug
Build Bustage
Build System
Code Cleanup
Crash
Critical
Devtools
Documentation
Duplicate
Enhancement
Extensions
Fixed
Good first issue
Good issue for contributors new to the project
Help Wanted
High Risk Patch
Images
Incomplete
Invalid
Leave Open
Legal
Locale
Media
More Info Needed
Mozregression Wanted
On Hold
OS: Linux
OS: Mac OS X
OS: Other
OS: Windows
Performance
Places
Bookmarks/History
Plugins
Privacy
Question
Redirected to forum
Regression
Release Engineering
Security
SSUAO
Site-Specific User Agent Overrides
String changes
Sync
Tabbed browsing
Tab handling and switching
Theme changes
Theme/UI
Unconfirmed
Uplift Wanted
Verification Needed
Verified
Wontfix
Works For Me
No Label
Assigned
Backed Out
Bounty
Bounty Paid
Browser-Parity
Bug
Build Bustage
Build System
Code Cleanup
Crash
Critical
Devtools
Documentation
Duplicate
Enhancement
Extensions
Fixed
Good first issue
Help Wanted
High Risk Patch
Images
Incomplete
Invalid
Leave Open
Legal
Locale
Media
More Info Needed
Mozregression Wanted
On Hold
OS: Linux
OS: Mac OS X
OS: Other
OS: Windows
Performance
Places
Plugins
Privacy
Question
Redirected to forum
Regression
Release Engineering
Security
SSUAO
String changes
Sync
Tabbed browsing
Theme changes
Theme/UI
Unconfirmed
Uplift Wanted
Verification Needed
Verified
Wontfix
Works For Me
Milestone
Set milestone
Clear milestone
No items
No Milestone
Assignees
Assign users
Clear assignees
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: MoonchildProductions/Pale-Moon#1423
Reference in New Issue
There is no content yet.
Delete Branch '%!s(<nil>)'
Deleting a branch is permanent. It CANNOT be undone. Continue?
No
Yes
7x64, 27.5.1
Text editors such as Notepad++ and Sublime Text recognize encoding of that file correctly if that matters.
Edit: added the hash to verify the file is unaltered.
I suggest to try:
Type:
text/plain; charset=utf-8
Notepad++
, etc., see also: https://stackoverflow.com/questions/7256049/how-do-i-convert-an-ansi-encoded-file-to-utf-8-with-notepadInteresting (Linux):
Seems something is different between WebKit and Gecko/Goanna.
Notepad++ claims the text file (sha1 confirmed) is UTF-8 encoding by default here and when encoding is set to Windows-1251 turns into gibberish different than the browsers.
So guessing that text/plain on file drop/open schema may need the UTF-8 character set? Could need improvement on the equivalent of "sniffing" for the MIME type too.
If there is no signature in the text file indicating code page, and it's opened locally without available content-type, there is no way for the browser to know what code page or encoding to use. Plain text is assumed to be one of the available ISO code pages and not defaulting to UTF-8, because of general usage reasons.
This issue is therefore the question what to use as default for locally-opened files or files that do not have any indication of encoding. I'm assuming Webkit defaults to UTF-8 and doesn't do any sniffing at all.
Of note, you can change the displayed encoding on the fly from the Web Developer menu, if it is displayed incorrectly.
@wolfbeast, why do you say the file has no indication of encoding?
@sergeevabc
I did a hex dump of the file and there's nothing out of the ordinary at begin and end... so I assume
$ file
presumes UTF-8 or whatever the environment defaults to... in my case my Linux is UTF-8 as far as I know... and it's a Russian based distro.EDIT:
Guess this doesn't make a huge bit of difference on the hex side... just not detected as UTF-8.
@wolfbeast
Well there's always the looming UTF-16 floating around too... so not entirely sure.
Former is probably accurate. Latter would depend on BOM checking and how detailed their charset detector routine is.
Always forget that's there. :) Thanks.
UTF-16 never got much following and is hardly ever used. UTF-8 provides more than enough character space for any localization, and it's transparently compatible with low-bit ISO characters (UTF-16 is not). It's safe to assume that UTF-8 is the future.
I also don't think your *nix system locale matters. The problem seems to be that Mozilla in their attempt to "protect users from themselves" have precluded UTF-8 as a fallback user-override, or you would have been able to set this default.
I'll be taking this issue. It's a simple change; the question remains though if we should override the fallback to UTF-8 by default or not.
We coerce to UTF-8 on output from OUJS in our Buffers regardless of input type although we store in the DB whatever character set it is (e.g. unscathed)... if that helps. :)
I have no idea what you just said. XD
I'll keep "auto" the default.
Also, if you want your browser locale to match your OS locale by default, you need to enable that; otherwise your OS system locale has 0 influence on the browser locale (this is by design)
@Martii, there is no
ef bb ff
BOM signature here, but go further and check for high-bit characters.@wolfbeast
LOL if someone uploads a character set that isn't UTF-8 we store that as-is and then when we output the source it's always UTF-8. ;)
Anyhow... you know your project best... this issue ticket caught my interest in passing. Sorry if I'm roaming around the issue.
Western is what was picked here. And as I mentioned I forgot that option was even there so it's probably the default here.
As a side note node, our web serving platform which is V8 based (WebKit in other words), manipulates string tests (regular expression, etc) in UTF-16 which is why we have to always serve UTF-8 besides Userscript engines needing it that way.
@Martii I'm pretty sure I don't know who your "we" is here. This has nothing to do with database storage or retrieval or your particular individual workflow around it.
@sergeevabc checking for high bit characters isn't going to help much, since other code pages and even Western uses high bit characters as valid characters.
What I'll do is add "UTF-8" as a default fallback override option in
Options -> Content -> Fonts and colors -> Advanced
. I wouldn't want to override to UTF-8 by default since then you may end up with issues on the other side of the spectrum where UTF-8 is incorrect.A better solution has been found:
0792b6d75c
f137cc4b82
@wolfbeast
We is usually a term of collaborators and owners of a particular project or type of project e.g. not just myself.
It theoretically could if a host outputted UTF-16 however you picked up on my exact notices on how to fix in plain english rather than code. Falling back to UTF-8 and allowing that default is what I would have done and currently already do. Our projects could have supported UTF-16 but we don't. :)
e.g. the extra pair of fingers typing here helped you figure out a good solution to your issue whether consciously or unconciously. Glad you understood the issue presented from @sergeevabc . :)
Quote for today:
"There is nothing noble in being superior to your fellow man; true nobility is being superior to your former self."
@Martii 🤣 I was merely saying I don't know who your "we" is representing; clearly it's not our "we" for Pale Moon.
@wolfbeast, I can confirm that it works as expected on 27.6 with a fresh profile, but the output of current profile is still garbled. Update was done via zip package. Should I change something in
about:config
?@sergeevabc
Click Advanced button then:
Hmmm that is still garbled when selecting UTF-8 in my existing profile too. :\
Ahh pick
Fonts for: Cyrillic
as well... but still garbled in existing profile.@sergeevabc
about:config?filter=intl.charset.fallback.override ... present in both dirty and clean profile.
about:config?filter=font.language.group ...
missing in existing profile but present in clean.Scratch that... bug in about:config with this... locked up... is present.--safe-mode ... no difference in dirty profile. e.g. still garbled. Weird.
@sergeevabc
Sorry for all these posts.
This worked for me resetting about:config?filter=intl.charset.detector 👍
For completeness reset about:config?filter=intl.fallbackCharsetList.ISO-8859-1 as well... but doesn't seem to affect it.
Thanks for verifying the fix.
FTR: running in safe mode does not magically reset all your preferences or ignore user-set ones.
Testers always try everything. ;) :)
@wolfbeast, I saw you added this fix to Basilisk codebase, would you mind adding it to Fossamail as well?
FossaMail's unofficial 38.6.0a1 includes the fallback to UTF-8 from the platform-pickup. The only thing not included is the manual override choice for UTF-8 in FossaMail's front-end.