#1695 80-100% constant CPU load after waking up from sleep

Closed
opened 2 months ago by hiiamboris · 47 comments

See steps to reproduce

Although this involves a certain addon, according to the guys who researched it, this is a bug in UXP. And they have developed a patch. Would be nice if it got merged (;

See [steps to reproduce](https://github.com/JustOff/github-wc-polyfill/issues/10?notification_referrer_id=MDE4Ok5vdGlmaWNhdGlvblRocmVhZDEzNjM5MDExMjM6MjUzMDcwNDk%3D&notifications_query=is%3Aunread#issuecomment-738881168) Although this involves a certain addon, according to the guys who researched it, this is a bug in UXP. And they have developed a patch. Would be nice if it got merged (;
Moonchild commented 2 months ago
Owner

If research was already done and analysis made of the code, then it would be great if that information could be included in this issue, otherwise we can’t possibly evaluate what the problem is or any specific patch that might apply.

Which “certain add-on” is involved?
Who are the “guys who researched it”?
What is the “bug in UXP”, exactly?
What is the patch, and how does it address the bug?

If research was already done and analysis made of the code, then it would be great if that information could be included in this issue, otherwise we can't possibly evaluate what the problem is or any specific patch that might apply. Which "certain add-on" is involved? Who are the "guys who researched it"? What is the "bug in UXP", exactly? What is the patch, and how does it address the bug?
g4jc commented 2 months ago

Took some time to dig into this...

Which “certain add-on” is involved?

JustOff’s github-wc-polyfill.

Who are the “guys who researched it”?

Per thread, JustOff and various users.

What is the “bug in UXP”, exactly?
What is the patch, and how does it address the bug?

Proposed work around patch in that thread:


fixes CPU usage going to 100% whenever PC is resumed from standby on certain
sites like github.com.
the underlying cause is that some (probably XHR) connection is interrupted
by the wake-up but not registered as such in the browser code, and
HasPendingEvent always returns true for that specific fd (which is a pipe).

--- platform/netwerk/base/nsSocketTransportService2.cpp
+++ platform/netwerk/base/nsSocketTransportService2.cpp
@@ -441,7 +441,7 @@
         mPollList[0].out_flags = 0;
         pollList = mPollList;
         pollCount = mActiveCount + 1;
-        pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout();
+        pollTimeout = PR_MillisecondsToInterval(100);
     }
     else {
         // no pollable event, so busy wait...
Took some time to dig into this... > Which “certain add-on” is involved? JustOff's github-wc-polyfill. > Who are the “guys who researched it”? Per thread, JustOff and various users. > What is the “bug in UXP”, exactly? > What is the patch, and how does it address the bug? Proposed work around patch in that thread: ``` fixes CPU usage going to 100% whenever PC is resumed from standby on certain sites like github.com. the underlying cause is that some (probably XHR) connection is interrupted by the wake-up but not registered as such in the browser code, and HasPendingEvent always returns true for that specific fd (which is a pipe). --- platform/netwerk/base/nsSocketTransportService2.cpp +++ platform/netwerk/base/nsSocketTransportService2.cpp @@ -441,7 +441,7 @@ mPollList[0].out_flags = 0; pollList = mPollList; pollCount = mActiveCount + 1; - pollTimeout = pendingEvents ? PR_INTERVAL_NO_WAIT : PollTimeout(); + pollTimeout = PR_MillisecondsToInterval(100); } else { // no pollable event, so busy wait... ```
Moonchild commented 2 months ago
Owner

So it doesn’t address anything but just works around it by unconditionally throttling everything. Probably needs a better solution.

Still doesn’t answer what the bug is in UXP, exactly. A lot of guessing going on there. May as well be a bug in the extension’s polyfill not accounting for interruptions.

So it doesn't address anything but just works around it by unconditionally throttling everything. Probably needs a better solution. Still doesn't answer what the bug is in UXP, exactly. A lot of guessing going on there. May as well be a bug in the extension's polyfill not accounting for interruptions.
hiiamboris commented 2 months ago
Poster

I’m speaking from no knowledge of Mozilla code at all, but just from general logic here.
If there is a mechanism (periodic socket polling or whatever) that can do and often does (looks like Mozilla has a long history of similar problems) a stupid thing (polling that socket so often that it makes no sense), then I imagine that regardless of the specific conditions that happen to break this mechanism (and which are hard(?) to track), we could at the very least improve the mechanism itself, e.g.:

  • yield thread’s execution time after a poll if that poll succeeded in less than 1ms or so (thus, even if it still does the stupid thing, it causes no harm)
  • attach a metric to this mechanism that would watch for some predefined absurdity conditions, one of which may be “received no input in a few seconds and ate a huge amount of CPU”, and intervene when those conditions are met (by killing the script or whatever makes sense)
I'm speaking from no knowledge of Mozilla code at all, but just from general logic here. If there is a mechanism (periodic socket polling or whatever) that *can* do and *often does* ([looks like Mozilla has a long history of similar problems](https://bugzilla.mozilla.org/show_bug.cgi?id=213637)) a stupid thing (polling that socket so often that it makes no sense), then I imagine that regardless of the specific conditions that happen to break this mechanism (and which are hard(?) to track), we could at the very least improve the mechanism itself, e.g.: - yield thread's execution time after a poll if that poll succeeded in less than 1ms or so (thus, even if it still does the stupid thing, it causes no harm) - attach a metric to this mechanism that would watch for some predefined absurdity conditions, one of which may be "received no input in a few seconds and ate a huge amount of CPU", and intervene when those conditions are met (by killing the script or whatever makes sense)
Moonchild commented 2 months ago
Owner

You’re absolutely right, but hardcoding it to just be 100 ms regardless of circumstance is just a dirty hack around a symptom, hence me saying “needs a better solution”.

You're absolutely right, but hardcoding it to just be 100 ms regardless of circumstance is just a dirty hack around a symptom, hence me saying "needs a better solution".
Moonchild added the
C: Networking
label 2 months ago
Moonchild commented 2 months ago
Owner

The whole poll timeout logic is rather convoluted and seems to be a little broken.
I’ll see if I can properly fix it. Taking this bug.

The whole poll timeout logic is rather convoluted and seems to be a little broken. I'll see if I can properly fix it. Taking this bug.
Moonchild commented 2 months ago
Owner

The committed patch should fix the weird logic to something more sane. Please test the result?

The committed patch _should_ fix the weird logic to something more sane. Please test the result?
hiiamboris commented 2 months ago
Poster

Sure. Can you build the patched Win32 version and dropbox it? There’s absolutely no way I’m installing MSVC lol, and can’t reproduce this bug on the Ubuntu build.

Sure. Can you build the patched Win32 version and dropbox it? There's absolutely no way I'm installing MSVC lol, and can't reproduce this bug on the Ubuntu build.
Moonchild commented 2 months ago
Owner

Sure thing. I’ll make a 32-bit Windows unstable from the patched state and dropbox it for you.

Sure thing. I'll make a 32-bit Windows unstable from the patched state and dropbox it for you.
Moonchild commented 2 months ago
Owner
https://www.dropbox.com/s/k68ubavxj16mo6l/palemoon-29.0.0a6.win32.7z?dl=0
hiiamboris commented 2 months ago
Poster

I think the archive was only 76% uploaded.

I think the archive was only 76% uploaded.
Moonchild commented 2 months ago
Owner

Hm.. try again, I made a fresh copy in dropbox.

Hm.. try again, I made a fresh copy in dropbox.
hiiamboris commented 2 months ago
Poster

That didn’t work. Still jumps to 80%+ core load.

That didn't work. Still jumps to 80%+ core load.
Moonchild commented 2 months ago
Owner

I’ll see what else I can do.

I'll see what else I can do.
Moonchild commented 2 months ago
Owner

I’ve implemented something specific to deal with the no-wait timeout situation but I’m not sure if this will work for this, as it’s unclear whether the socket handler has an inherent timeout of 0 or not. Please let me know if this try build works any better. It will still allow a number of instant-polls so there will be a spike on resume, but it should go down afterwards.

I've implemented something specific to deal with the no-wait timeout situation but I'm not sure if this will work for this, as it's unclear whether the socket handler has an inherent timeout of 0 or not. Please let me know if [this try build](https://www.dropbox.com/s/uosxsunanrqdplm/palemoon-29.0.0a6.win32-poll-try2.7z?dl=0) works any better. It will still allow a number of instant-polls so there will be a spike on resume, but it should go down afterwards.
hiiamboris commented 2 months ago
Poster

Nope, still there and doesn’t go down :(

Nope, still there and doesn't go down :(
Moonchild commented 2 months ago
Owner

I wish there was a way for me to reproduce it myself, it would be a lot faster not having to work blind :P
Anyway, try number 3 should exclude any inherent 0-length timeouts on often-polled sockets.
Please let me know if this try build avoids the CPU load.

I wish there was a way for me to reproduce it myself, it would be a lot faster not having to work blind :P Anyway, try number 3 should exclude any inherent 0-length timeouts on often-polled sockets. Please let me know if [this try build](https://www.dropbox.com/s/knnygthtzfspsnr/palemoon-29.0.0a6.win32-poll-try3.7z?dl=0) avoids the CPU load.
hiiamboris commented 2 months ago
Poster

No, didn’t solve.. dammit :D

Did you try those steps to reproduce I mentioned, or you don’t have access to a windows machine?

No, didn't solve.. dammit :D Did you try those steps to reproduce I mentioned, or you don't have access to a windows machine?
Moonchild commented 2 months ago
Owner

I tried the steps to reproduce and I can’t reproduce it, period.
The published version of 28.16.0-x86 doesn’t display the issue for me. The patched version with fixed logic is fine, and the try builds are also fine.
This is running on Win 10 2004 (64-bit).

What I tried was adding a counter to each socket and increasing that counter whenever a poll occurred; if > 100 it would ignore the socket for “no-wait” scenarios (ignoring the pendingEvents flag) and (as of try 3) also exclude it from the calculation to find the minimum timeout value to use for polling. That should make it equal to not using PR_INTERVAL_NO_WAIT at all in the line you patched, and ignoring a problematic socket for calculating the desired timeout value...

I tried the steps to reproduce and I can't reproduce it, period. The published version of 28.16.0-x86 doesn't display the issue for me. The patched version with fixed logic is fine, and the try builds are also fine. This is running on Win 10 2004 (64-bit). What I tried was adding a counter to each socket and increasing that counter whenever a poll occurred; if > 100 it would ignore the socket for "no-wait" scenarios (ignoring the `pendingEvents` flag) and (as of try 3) also exclude it from the calculation to find the minimum timeout value to use for polling. That _should_ make it equal to not using `PR_INTERVAL_NO_WAIT` at all in the line you patched, and ignoring a problematic socket for calculating the desired timeout value...
hiiamboris commented 2 months ago
Poster

Interesting. Same Windows version. I wonder then if a wifi driver may play a role in all this, e.g. in how it handles sleep/wakeup events.. For the record it’s “Intel(R) Wi-Fi 6 AX201 160MHz” adapter, driver 21.90.2.1, on my side.

Thanks for your efforts nonetheless. I appreciate it. Let me know if/when you’ll have any other fix ideas ;) GH is so useless without this addon, lol, but Palemoon.. Palemoon rocks! I’m not changing it for any chromium crap.

Interesting. Same Windows version. I wonder then if a wifi driver may play a role in all this, e.g. in how it handles sleep/wakeup events.. For the record it's "Intel(R) Wi-Fi 6 AX201 160MHz" adapter, driver 21.90.2.1, on my side. Thanks for your efforts nonetheless. I appreciate it. Let me know if/when you'll have any other fix ideas ;) GH is so useless without this addon, lol, but Palemoon.. Palemoon rocks! I'm not changing it for any chromium crap.
Moonchild commented 2 months ago
Owner

I have one more try build; moved a few things around and changed where the counter is set to its initial value, in case the weird out of bounds polling is constantly stopping and starting the timeout.
Please let me know if that works.

If not I’m just gonna toss my patch as it’s clearly not working and the added complexity would be unneeded.

I have [one more try build](https://www.dropbox.com/s/al2mvexecm8v06t/palemoon-29.0.0a6.win32-poll-try4.7z?dl=0); moved a few things around and changed where the counter is set to its initial value, in case the weird out of bounds polling is constantly stopping and starting the timeout. Please let me know if that works. If not I'm just gonna toss my patch as it's clearly not working and the added complexity would be unneeded.
hiiamboris commented 2 months ago
Poster

No, that didn’t fix it either 🤷

No, that didn't fix it either 🤷
Moonchild commented 2 months ago
Owner

Right, in the bin it goes then. :(
I’ll see if I can make a workaround with a preference instead so if you’re suffering from this you can clamp the socket polling to a minimal interval duration.

Right, in the bin it goes then. :( I'll see if I can make a workaround with a preference instead so if you're suffering from this you can clamp the socket polling to a minimal interval duration.
Moonchild commented 2 months ago
Owner

Right, try this one.
It has a new preference network.websocket.timeout.clamped which, if set to true, should avoid the problem at the expense of not having accurate socket timing.

Right, try [this one](https://www.dropbox.com/s/qcjfsqtoobqn4ra/palemoon-29.0.0a6.win32-poll-try5.7z?dl=0). It has a new preference `network.websocket.timeout.clamped` which, if set to true, should avoid the problem at the expense of not having accurate socket timing.
hiiamboris commented 2 months ago
Poster

Thanks man! That totally solved it. CPU load is about 0.5%

Thanks man! That totally solved it. CPU load is about 0.5%
Moonchild commented 2 months ago
Owner

Thanks for verifying!

Thanks for verifying!
Moonchild closed this issue 2 months ago
Moonchild added the
Fixed
label 2 months ago
Moonchild added the
Verified
label 2 months ago
Moonchild commented 2 months ago
Owner

Reopening this; as a result of investigating a past commit for something else, I may have found what caused this issue -- meaning that it may be possible to properly solve it instead of the workaround that was put in place.

@hiiamboris are you up for checking one or two more builds? ;-)

Reopening this; as a result of investigating a past commit for something else, I may have found what caused this issue -- meaning that it may be possible to properly solve it instead of the workaround that was put in place. @hiiamboris are you up for checking one or two more builds? ;-)
Moonchild reopened this issue 2 months ago
Moonchild commented 2 months ago
Owner

This build has sleep mode logic restored that seems to have been removed in error in the past. Could you try that one? Remember it still has the pref so you need to disable that workaround when testing.

[This build](https://www.dropbox.com/s/4cgtj8scb49e567/palemoon-28.17.0.win32-sleepmode-try.7z?dl=0) has sleep mode logic restored that seems to have been removed in error in the past. Could you try that one? Remember it still has the pref so you need to disable that workaround when testing.
hiiamboris commented 2 months ago
Poster

Confirmed. This older version survives the sleep, with clamping off. 👍

Confirmed. This older version survives the sleep, with clamping off. 👍
Moonchild commented 2 months ago
Owner

Great! Thanks. I’ll remove the workaround then.

Great! Thanks. I'll remove the workaround then.
Moonchild closed this issue 2 months ago
hiiamboris commented 2 months ago
Poster

Have this sleep logic not been pushed into the 29.0.0a6 version? It certainly has CPU load problems.

Have this sleep logic not been pushed into the 29.0.0a6 version? It certainly has CPU load problems.
Moonchild commented 2 months ago
Owner

Yes it has, this is in the unstable of the 21st.

Yes it has, this is in the unstable of the 21st.
hiiamboris commented 2 months ago
Poster

Aaaalright. I see now. While this sleep logic fix does not have a problem with the GH Polyfill addon, it still has a problem with NoScript :D

Please add the clamped property to the next release, let it be there just in case, as this bug’s killswitch.

Aaaalright. I see now. While this sleep logic fix does not have a problem with the GH Polyfill addon, it still has a problem with NoScript :D Please add the clamped property to the next release, let it be there just in case, as this bug's killswitch.
hiiamboris commented 2 months ago
Poster

Now that’s funny. The poll-try5 version (with the “clamped” property) doesn’t work with NoScript anymore either, though it worked for almost 2 weeks! So unless I’m missing smth obvious, I must presume that someone at GH is actively exploiting this situation.

Another explanation would be that update to 29a6 changed some thing in config or somewhere and me just copying the files provided by you over the portable installation doesn’t roll that thing back.

Now that's funny. The poll-try5 version (with the "clamped" property) doesn't work with NoScript anymore either, though it worked for almost 2 weeks! So unless I'm missing smth obvious, I must presume that someone at GH is actively exploiting this situation. Another explanation would be that update to 29a6 changed some thing in config or somewhere and me just copying the files provided by you over the portable installation doesn't roll that thing back.
Moonchild commented 2 months ago
Owner

Sorry but NoScript is no consideration for us. It’s known to cause problems and excessive CPU usage is one of the milder symptoms of its use. Please stop using it and use any of the viable alternatives instead like eMatrix. Ask on the forum if you want assistance or advice migrating off of NoScript.

And no, there is no difference in terms of the code added to the workers on either branch (release or unstable).

Sorry but NoScript is no consideration for us. It's known to cause problems and excessive CPU usage is one of the *milder* symptoms of its use. Please stop using it and use any of the viable alternatives instead like eMatrix. Ask on the [forum](https://forum.palemoon.org/viewforum.php?f=46) if you want assistance or advice migrating off of NoScript. And no, there is no difference in terms of the code added to the workers on either branch (release or unstable).
hiiamboris commented 2 months ago
Poster

Thanks! I didn’t know of an alternative. Hopefully it will prove just as useful.

Thanks! I didn't know of an alternative. Hopefully it will prove just as useful.
hiiamboris commented 2 months ago
Poster

So, that didn’t help :/
In fact, my testing routine was a bit flawed.

Took me some time to nail it down, but this is what I’m experiencing:

  • take any working palemoon-32 or 64 (try5 or 28.17 with sleep logic or 29a6 with sleep logic)
  • grab a fresh portable installation (28.17)
  • copy working palemoon into bin/ of the portable installation
  • run Palemoon-Portable.exe
  • install wc-polyfill
  • go to github.com/notifications & pin this tab
  • close Palemoon & run Palemoon-portable.exe again
  • enter sleep, wake up
  • 80+% CPU usage

Doesn’t matter what clamped is set to - still eats CPU cycles.
When I run palemoon.exe (not from the portable launcher!) - both clamped and sleep logic disarm the bug.

Any ideas? ;)
Is there another way to launch it in portable mode, without this third-party launcher?

So, that didn't help :/ In fact, my testing routine was a bit flawed. Took me some time to nail it down, but this is what I'm experiencing: - take any working palemoon-32 or 64 (try5 or 28.17 with sleep logic or 29a6 with sleep logic) - grab a fresh portable installation (28.17) - copy working palemoon into bin/ of the portable installation - run Palemoon-Portable.exe - install wc-polyfill - go to github.com/notifications & pin this tab - close Palemoon & run Palemoon-portable.exe again - enter sleep, wake up - 80+% CPU usage Doesn't matter what `clamped` is set to - still eats CPU cycles. When I run `palemoon.exe` (not from the portable launcher!) - both `clamped` and sleep logic disarm the bug. Any ideas? ;) Is there another way to launch it in portable mode, without this third-party launcher?
Moonchild commented 2 months ago
Owner

No ideas since I still can’t reproduce it myself.
The launcher isn’t third-party, by the way.

But, you can launch it the same way by launching palemoon.exe with a given profile path:

palemoon.exe --no-remote -profile "C:\path\to\profile"

No ideas since I still can't reproduce it myself. The launcher isn't third-party, by the way. But, you can launch it the same way by launching palemoon.exe with a given profile path: `palemoon.exe --no-remote -profile "C:\path\to\profile"`
hiiamboris commented 2 months ago
Poster

Yeah I mean it’s ‘winPenPack’ something.
Anyway, that didn’t help, so launcher is ruled out.
This is soo weeeird...

Yeah I mean it's 'winPenPack' something. Anyway, that didn't help, so launcher is ruled out. This is soo weeeird...
Moonchild commented 2 months ago
Owner

Yeah it’s based on the winpenpack launcher. I didn’t write it from scratch :P
Anyway semantics, I guess.

I didn’t expect the launcher to be a problem, though.

Yeah it's based on the winpenpack launcher. I didn't write it from scratch :P Anyway semantics, I guess. I didn't expect the launcher to be a problem, though.
hiiamboris commented 2 months ago
Poster

I’ve found the problem :D
The (working with patches) profile in %appdata% dir was using wc-polyfill version 1.16, while on (never working) profiles I was installing newer 1.17.
Downgraded and everything’s OK.
Thanks for your help again!

I've found the problem :D The (working with patches) profile in %appdata% dir was using wc-polyfill version 1.16, while on (never working) profiles I was installing newer 1.17. Downgraded and everything's OK. Thanks for your help again!
mattatobin commented 2 months ago
Collaborator

I am not sure it is as simple as that. While extensions can cause some issues it would be nice to know why and to fix it if possible in the platform.

Always remember, workarounds AREN’T solutions.

I am not sure it is as simple as that. While extensions can cause some issues it would be nice to know why and to fix it if possible in the platform. Always remember, workarounds AREN'T solutions.
Moonchild commented 2 months ago
Owner

So what exactly is the difference between those two versions?

So what exactly is the difference between those two versions?
mattatobin commented 2 months ago
Collaborator

You’d have to ask @justoff

You'd have to ask @justoff
hiiamboris commented 2 months ago
Poster

Line 159 is the only difference for PaleMoon 0ba2373fb6 (diff-fddf6277c3cfc353553f23e248adc299caae322fd1e29848baf53b37a9d7fa41L159)

how does it deadlock is a mystery to me though ;)

Line 159 is the only difference for PaleMoon https://github.com/JustOff/github-wc-polyfill/commit/0ba2373fb6b65a2aa0fa119f3a60679f9ea39449#diff-fddf6277c3cfc353553f23e248adc299caae322fd1e29848baf53b37a9d7fa41L159 how does it deadlock is a mystery to me though ;)
JustOff commented 2 months ago

You need to use the latest version of github-wc-polyfill so that Pale Moon can load GitHub socket library. Otherwise, the background tasks that trigger this bug (notifications, live comment updates, etc.) are simply not executed.

You need to use the latest version of `github-wc-polyfill` so that Pale Moon can load GitHub socket library. Otherwise, the background tasks that trigger this bug (notifications, live comment updates, etc.) are simply not executed.
Moonchild commented 2 months ago
Owner

Well that doesn’t help.
FTR I’m still unable to reproduce this bug so i have no way forward without getting more information. I’ve verified that our logic is now the same for workers on suspend/resume as what Firefox uses, and the previous obvious bugs in its logic are fixed.
If anyone figures out the specific issue for workers with resuming from standby in our code then please make a targeted new issue for it.

Well that doesn't help. FTR I'm still unable to reproduce this bug so i have no way forward without getting more information. I've verified that our logic is now the same for workers on suspend/resume as what Firefox uses, and the previous obvious bugs in its logic are fixed. If anyone figures out the specific issue for workers with resuming from standby in our code then please make a targeted new issue for it.
Moonchild locked as Resolved and limited conversation to collaborators 2 months ago
Sign in to join this conversation.
No Milestone
No Assignees
5 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.