Feature Prefs Corruption Part II: timein vs saveToFile

fronobulax · Jul 5, 2023

Mod note: This is a second and different approach to a specific instance of the problem in a thread about automatic backups of user prefs. That thread is here: https://kolmafia.us/threads/prefs-file-backup-restore-draft-pr.28887/

This is implemented for user prefs. Should it also be done for GLOBAL_prefs.txt? With the ping data there, it now it seems like the chances of simultaneous writes is higher than it was before (even if the chance went from "miniscule" to "miniscule plus one").

MCroft · Jul 6, 2023

That's not a bad extension of this.

I'm pretty sure we haven't solved the whole problem (and we couldn't hear; it's just a backup). But we can definitely extend the design to also backup the global prefs.

Considering the number of recent anecdotal reports involving scripts and wake-from-sleep, I'm wondering about the timing in function and if it plays a role in the wipes.

Veracity · Jul 6, 2023

Timing in saves current preferences to disk and reloads them.
LoginManager.java:

Code:

  public static void timein(final String username) {
    // Save the current user settings to disk
    Preferences.reset(null);

    // Reload the current user's preferences
    Preferences.reset(username);

    // Close existing session log and reopen it
    RequestLogger.closeSessionLog();
    RequestLogger.openSessionLog();
...

That is called in only one place: from LoginRequest.processLoginRequest, which is called after you have succeeded in logging in and KoL redirects you game.php, say. Or perhaps afterlife.php, if you are in Valhalla. I haven't actually seen that in operation.

MCroft · Jul 6, 2023

So, my hypothesis is that a write is happening on another thread (possibly a script thread or a relay thread?) while the timein is running, so there's a race condition. I know that at times, I've had scripts wake from sleep and be just fine, but that's the thing about race conditions.

If Preferences.saveToFile() runs between the Preferences.reset(null) and Preferences.reset(username) (or at least before it finishes enough to have saved the data), then it could write a bunch of nothing to the file.

You know this code a lot better than I do, but I don't see the timein invalidating the file lock or stopping the script, and I'm not sure if it should. Another approach might be to have saveToFile() check for LoginManager.isTimingIn, and wait for that to complete before writing.

This could be completely wrong (and might even be obviously so), but I'm not sure what other theories would explain the observations.

MCroft · Jul 6, 2023

This may be something we could test, but I'll have to see if I can write a failing test for it first.

Veracity · Jul 6, 2023

If you decide to synchronize reading/writing to the internal Preferences data structure and reading/writing the file that it them on disk, somehow, obviously this is one of the places that you'll have to think about.

Veracity · Jul 6, 2023

MCroft said:
If Preferences.saveToFile() runs between the Preferences.reset(null) and Preferences.reset(username) (or at least before it finishes enough to have saved the data), then it could write a bunch of nothing to the file.

Those two calls could/should be made atomic.
They essentially are a “save and reload user prefs” operation, and the prefs should not be accessed in any way from another thread while that is happening.

MCroft said:
You know this code a lot better than I do, but I don't see the timein invalidating the file lock or stopping the script, and I'm not sure if it should.

What file lock? Stop what script? It could be “the script” which triggered the time in.

MCroft said:
Another approach might be to have saveToFile() check for LoginManager.isTimingIn, and wait for that to complete before writing.

saveToFile triggered by what? A preference write?

Seems like reads/writes to the internal preferences data structure should be synchronized, somehow, since multiple threads can leave things inconsistent.

(Two scripts that each do a read/modify/write of the same property can leave data inconsistent, but that’s a computer science problem with multi threading, not solvable solely within Preferences.)

But blocking reads/writes while we are loading the file from disk, and blocking writes while we are saving the file to disk - and making the save/load sequence atomically block both, as mentioned above - seem like they’d keep the memory and disk copies consistent, at least.

MCroft said:
This could be completely wrong (and might even be obviously so), but I'm not sure what other theories would explain the observations.

Plausible.

fronobulax · Jul 6, 2023

The file I/O has some synchronization but I never convinced myself that I really understood it. But I think it dates back to hola which means I expect it to be robust.

if I am writing a preference to memory then I want to block all preference reads until I finish.

If I am reading from disk then that is a preference to memory write and reads need to be blocked.

If I am writing to disk then I need to block all writes to memory until the disk write is finished.

So I think there would be two different locks and some operations would only need one, others both.

In a world where users want every preference change written to disk and preferences change with every adventure I can see the above having an impact on performance. So this is easy to kick down the road. And whatever we do for user preferences won't be bulletproof for global preferences since we allow simultaneous logins of different characters.

Veracity · Jul 6, 2023

fronobulax said:
if I am writing a preference to memory then I want to block all preference reads until I finish.

If I am reading from disk then that is a preference to memory write and reads need to be blocked.

If I am writing to disk then I need to block all writes to memory until the disk write is finished.

I think that is literally what I suggested.

fronobulax said:
In a world where users want every preference change written to disk and preferences change with every adventure I can see the above having an impact on performance. So this is easy to kick down the road.

The first "rule" of optimization:

You can always make your code as fast as you want if you don't care about correctness.

fronobulax · Jul 6, 2023

Veracity said:
I think that is literally what I suggested.

Sometimes repeating what I thought I just heard is a learning experience

Veracity said:
The first "rule" of optimization:

You can always make your code as fast as you want if you don't care about correctness.

I learned it from Jon Bentley paraphrased as:

It is easier to make working code fast than to make fast code work.

Probably could have learned more from him

MCroft · Jul 6, 2023

I learned the Kent Beck version: "Make It Work, Make It Right, Make It Fast"

The Preferences.saveToFile() method syncronizes with a monitor object named lock, which is not used in any of the other methods of Preferences.

Can we reuse the lock across methods, or is it just for one method, which is why every synchronized call has different monitor objects? Can we use a lock in a different Object, like synchronized(Preferences.lock) in timein?

Maybe RequestLogger.timein() should call Preferences.timein(), which synchronizes the null/user resets. I'm not sure if that works the way I'm imagining it...

MCroft · Jul 7, 2023

Did more research into sync, which is a new area to me. What I learned:
Any things that use the same monitor object are mutually exclusive (one of anything that is in a synchronized (lock name) block, regardless of the code block)

Synchronized methods and class effectively are like a synchronized (this) blocks on the entire object. Static methods similarly.

So, here's the thing. I can't figure out how to build a failing test case, so I want to run it for a while before I declare it "no worse". And I'll certainly take offers of review and suggestions for how to test this.

lock is the same monitor object used for a synchronized block in saveToFile(), with the hope that it will prevent race conditions when updating and resetting...

The first two lines of LoginManager.timein(username) are refactored to call the Prefs method with the lock.

Java:

private static final Object lock = new Object(); // used to synch io
...
public static void saveAndReloadUser( String username ) {
synchronized ( lock ) {
// Save the current user settings to disk
reset( null );

// Reload the current user's preferences
reset( username );
}
}

heeheehee · Jul 7, 2023

Veracity said:
blocking writes while we are saving the file to disk

Blocking writes but not reads is somewhat atypical, especially in the context of ReadWriteLocks (where you generally can have many simultaneous readers but only one writer).

It seems reasonable in this context, but sounds like it would need to be handled carefully (regarding performance, deadlock avoidance, etc).

MCroft said:
Can we reuse the lock across methods, or is it just for one method, which is why every synchronized call has different monitor objects? Can we use a lock in a different Object, like synchronized(Preferences.lock) in timein?

I believe the answer to the first and third questions is "yes", as you've determined.

MCroft said:
I can't figure out how to build a failing test case

Can you spawn one thread that calls timein in a loop, and a second that attempts to increment a preference, say, 10 times, persisting to disk after each update?

MCroft · Jul 7, 2023

heeheehee said:
Blocking writes but not reads is somewhat atypical, especially in the context of ReadWriteLocks (where you generally can have many simultaneous readers but only one writer).

It seems reasonable in this context, but sounds like it would need to be handled carefully (regarding performance, deadlock avoidance, etc).

Does the fact that we have the prefs in memory (until we clear them) make the case atypical? I don't think we read from disk unless we're timing in.

heeheehee said:
I believe the answer to the first and third questions is "yes", as you've determined.

Can you spawn one thread that calls timein in a loop, and a second that attempts to increment a preference, say, 10 times, persisting to disk after each update?

That seems like a promising approach. I'd have to read from disk after I wrote to make sure it was persisted and nothing was lost.

Irrat · Jul 7, 2023

A bit of an offshoot from the current discussion, but in the context of corrupted preference files that are still readable we could add another insurance policy.

A pref that exists only to log how many lines were written, and to get how many we read.

Code:

If properties.size() != properties.getInt("savedPreferencesCount").orDefault(-1)
  restoreBackup();

Aka, if property count is not equal to the saved property count, restore a backup.
If the saved property count cannot be found, default it to -1
Although, probably not a good solution as it's written as it'd obviously have problems with the property not existing until they update.

MCroft · Jul 7, 2023

Well, it's very interesting, and I hope I found what I was looking to find. The bad news is my sync fix above didn't solve it. I made a draft PR of just the test in case anyone wants to dig into it.

method timeinDoesNotCauseRaceCondition

sets an unrelated pref
starts a timeinThread
starts up to 40 threads incrementing an int pref in a loop.
Prints the state of the pref in question before starting the next thread
I am printing the output to System.out so that I can see what's happening, otherwise, it exits on the assertion when it fails.

About 10 of the increment threads in, it zeros out the int pref and the string thread (pretty sure prefs are clobbered by that point). If I remove the timeinThread, the loop performs as expected. Note the pref is the same for iterations 9 and 10 and 0 for iterations 11-15, and then starts incrementing again. Sometimes it integration 5, sometimes it makes it all the way through.

Rich (BB code):

iteration number: 0 incremented Pref: 100 coalmine: canary
iteration number: 1 incremented Pref: 101 coalmine: canary
iteration number: 2 incremented Pref: 102 coalmine: canary
iteration number: 3 incremented Pref: 103 coalmine: canary
iteration number: 4 incremented Pref: 104 coalmine: canary
iteration number: 5 incremented Pref: 105 coalmine: canary
iteration number: 6 incremented Pref: 106 coalmine: canary
iteration number: 7 incremented Pref: 107 coalmine: canary
iteration number: 8 incremented Pref: 108 coalmine: canary
iteration number: 9 incremented Pref: 109 coalmine: canary
iteration number: 10 incremented Pref: 109 coalmine: canary
iteration number: 11 incremented Pref: 0 coalmine:
iteration number: 12 incremented Pref: 0 coalmine:
iteration number: 13 incremented Pref: 0 coalmine:
iteration number: 14 incremented Pref: 0 coalmine:
iteration number: 15 incremented Pref: 0 coalmine:
iteration number: 16 incremented Pref: 1 coalmine:
iteration number: 17 incremented Pref: 2 coalmine:
iteration number: 18 incremented Pref: 3 coalmine:
iteration number: 19 incremented Pref: 4 coalmine:
iteration number: 20 incremented Pref: 5 coalmine:
etc...

MCroft · Jul 8, 2023

I shouldn't be surprised that GitHub allows only limited resources and fails these tests when my laptop passes them. I need to see what I can do about that.

Changing the monitor object for reset to match the one for saveToFile, plus another few fixes seems to have gotten rid of the wipeout problem. Results look good, canary doesn't get nuked at all.

Still tracking down a reset of the incremented pref (or seeing if the fix I put in works on GitHub as well as my computer).

MCroft · Jul 9, 2023

Tests pass.

I'm not convinced I've killed every possible bad interaction, but I think it's better and no worse and that I killed one problem at least.

MCroft · Jul 9, 2023

well, it's exhibiting a race condition again, so I may split "the part that works" from "the part I'm still working on" in the PR. IDK, I'll try to fix it before I refactor my changes. I definitely want the "don't erase it all" part into the codebase.

I think the new behavior is that it's failing to write while it's timing in.

I thought the lock would take care of that, but maybe it's not expansive enough.

Feature Prefs Corruption Part II: timein vs saveToFile

fronobulax

Developer

MCroft

Developer

Veracity

Developer

MCroft

Developer

MCroft

Developer

Veracity

Developer

Veracity

Developer

fronobulax

Developer

Veracity

Developer

fronobulax

Developer

MCroft

Developer

MCroft

Developer

heeheehee

Developer

MCroft

Developer

Irrat

Member

MCroft

Developer

MCroft

Developer

MCroft

Developer

MCroft

Developer