Feature - Implemented GZip Support for Session Logs

Saklad5

Member
Given enough time, session logs from KoLmafia can take up quite a lot of storage space. Thankfully, the nature of these logs means that they are extremely compressible using tools like Gzip.

Many scripts benefit from accessing these logs through the session_logs ASH function, though, and this would break that functionality. I’d like to change that function so that it automatically decompresses and reads Gzipped logs along with uncompressed logs, as if they were normal text. Java has a built-in class for that, so it should be fairly easy to implement.

Compressing the files in the first place could be left to the user. Gzip isn’t computationally intensive (it is already used for web requests in KoLmafia, after all), but adding a setting for it would probably confuse a lot of people.
 

zarqon

Well-known member
I support this idea!

Sometime in the beginning of each year I make a zip file of the previous year's logs and move it to another folder (because I sync my KoLmafia directory with Dropbox and have relatively limited space). This feature would increase my time between purgings.
 

fronobulax

Developer
Staff member
In contrast, on the first day of every month I move all my logs to a zip file. I have no use case for using KoLmafia to access logs older than a month and am quite comfortable with manually unzipping if one developed. I am also quite content to parse logs using tools other than ash scripts so I have no benefit from compressing individual files since my tools read text and act on all (matching) files in a directory.

That said it does seem as if adding a new choice of input stream wouldn't be too hard.

But...

Should this support a collection of individually compressed files in a subdirectory, a compressed archive of files or both? Since KoLmafia is cross platform which compression and which archive formats should be supported? Should it be expected to merge results from an archive and directory?

Given my available storage, the ease of adding files to a zip file on Window and the different process required to create a collection of zip files, each of which contains one file, I am most interested in this if it supports archives and then (since the use case is session_logs) if it handles duplicates (same file name in archive and in directory) or renamed files.
 

Saklad5

Member
I think gzip is ideally suited for this type of compression, and it is already supported by basically everything. As I said, KoLmafia actually already uses gzip to speed up web requests. ZIP files are less consistent across systems, lose certain UNIX metadata, take longer to compress and decompress, and produce a larger file. However, more formats can be added if you feel that is necessary (xz, for example, boasts a much better compression ratio than gzip, though it is more computationally intensive and not built into Java itself), but that seems like a bit of a slippery slope.

As for multi-file archives, I think it would simplify things tremendously to only support individual files in the existing directory with unchanged names (extension aside). That rules out all but one potential conflict: the presence of both an uncompressed file and a compressed one. In that case, the most recently-modified file should be used.

By the way, gzip is not quite the same as ZIP. Rather than compressing groups of files into a single archive, gzip takes arbitrary data (file, stream, whatever) and spits it back out. You are supposed to use a different tool (tar, almost always) to bundle multiple files without compressing them, then feed that into gzip. This is why you’ll often see multiple extensions on gzipped archives (tar.gz). It decompresses into a single tar file (.tar), and that contains multiple files/folders. In contrast, ZIP basically includes both of these functions. We can skip tar (.gz) if we only support individual files for now, though.
 
Last edited:

Saklad5

Member
because I sync my KoLmafia directory with Dropbox and have relatively limited space

I highly recommend Resilio Sync (formerly known as BitTorrent Sync) for this. Rather than using cloud storage, it uses the torrent protocol to synchronize data directly between your devices, no server involved. It’s extremely fast and efficient, and improves if you add more devices. The only downside is that at least one of the devices with the most recent version of the folder has to be online at the same time as the devices without it for changes to propagate.
 

Saklad5

Member
To give an idea of what impact this could have, I just compressed my 153MB sessions folder with tar and xz. The resulting file is 4.5MB.

xz is probably not a good candidate for this, unless we want to support multiple formats, for reasons I gave previously: more intensive, not built into Java, etcetera. Gzip is nearly as good without any of those drawbacks, which is why it is baked into the Web.

7zip actually uses the same algorithm as xz (LZMA2 to be precise), so the same applies to it.
 
Last edited:

fronobulax

Developer
Staff member
I think gzip is ideally suited for this type of compression, and it is already supported by basically everything. As I said, KoLmafia actually already uses gzip to speed up web requests. ZIP files are less consistent across systems, lose certain UNIX metadata, take longer to compress and decompress, and produce a larger file. However, more formats can be added if you feel that is necessary (xz, for example, boasts a much better compression ratio than gzip, though it is more computationally intensive and not built into Java itself), but that seems like a bit of a slippery slope.

As for multi-file archives, I think it would simplify things tremendously to only support individual files in the existing directory with unchanged names (extension aside). That rules out all but one potential conflict: the presence of both an uncompressed file and a compressed one. In that case, the most recently-modified file should be used.

By the way, gzip is not quite the same as ZIP. Rather than compressing groups of files into a single archive, gzip takes arbitrary data (file, stream, whatever) and spits it back out. You are supposed to use a different tool (tar, almost always) to bundle multiple files without compressing them, then feed that into gzip. This is why you’ll often see multiple extensions on gzipped archives (tar.gz). It decompresses into a single tar file, and that contains multiple files/folders. In contrast, ZIP basically includes both of these functions. We can skip tar if we only support individual files for now, though.

You may proclaim the technical superiority as much as you want but your task is to either submit a patch for review or convince someone else to do the work. I am probably the least prolific of the current devs but I am willing and able to work on things that benefit me personally or that I think are really, really important.

My use case is that all logs being actively "mined" are in text format in a single directory. Compressed files are in a zip archive because that is what is easiest to create on Windows. If a file in an archive needs to be processed it is manually extracted.

If this new feature is to be of any use to me, it needs to be able to treat a (zip) archive as augmenting the contents of a directory. So my comments were an attempt to bridge the gap between your *nix centered requirements and my Windows centric way of doing things.

By the way, gzip is not quite the same as ZIP.

Please do not make assumptions about my technical knowledge and talk down to me.
 

fronobulax

Developer
Staff member
I highly recommend Resilio Sync (formerly known as BitTorrent Sync) for this. Rather than using cloud storage, it uses the torrent protocol to synchronize data directly between your devices, no server involved. It’s extremely fast and efficient, and improves if you add more devices. The only downside is that at least one of the devices with the most recent version of the folder has to be online at the same time as the devices without it for changes to propagate.

My personal use case for Dropbox or any other synch service is that computers do not have to be online at the same time. If it's too sensitive for someone else's cloud I just don't synch it. YMMV, obviously.
 

heeheehee

Developer
Staff member
In my limited testing (over my ~50MB of session files from the past month), it seems like gzip provides roughly 15x compression ratios, while bzip2 and xz provide roughly 30x. Surprisingly, gzipping files individually didn't seem to have a significant negative impact on achieved ratios (maybe 10% larger compared to gzipping the entire tarball; bzip2 yielded ~25% improvement over individually-compressed files; xz was even more drastic with a 40% improvement over individually-compressed files).

Also, maybe I'm still in the stone age with my rsync usage... I have no idea what this resilio sync is, but it looks like it doesn't send deltas, or compress files in transit?
 

fronobulax

Developer
Staff member
As long as we are off on a tangent - at one point in Ye Olden Days compressing files did not save as much disk space as people hoped because of the disk sector size. If the sector size was 2048 bytes compressing a 2040 byte file down to 200 bytes did not free up 1840 bytes for another file. This was used as an argument for putting files in an archive and (typically) compressing the archive. The savings are trivial on a modern PC but there are still a few embedded devices in use in the real world where bytes matter.
 

Rinn

Developer
I do this with a post-run bash script. Here are some stats:

Code:
Size: 1 820 349 739
Packed Size: 53 461 743
Folders: 0
Files: 3 213
CRC: 80C172DA
----------------------------
Path: Dropbox\kolmafia\sessions\Epicgamer.7z
Type: 7z
Physical Size: 53 501 753
Headers Size: 40 010
Method: LZMA2:26
Solid: +
Blocks: 1

This is the solid archive for my main character, it has session logs from September 2008 until end of year 2017. It's has a 2.91% compression ratio.

Code:
Size: 271 421 367
Packed Size: 29 471 797
Folders: 0
Files: 222
CRC: B9FAF451
----------------------------
Path: Dropbox\kolmafia\sessions\Epicgamer.zip
Type: zip
Physical Size: 29 499 535

This is the ongoing zip that I update on a daily basis with my script that runs after mafia is closed. This has a 10.85 compression ratio which seems inline with what everyone else is getting. I re-compress the solid archive once a year and purge those logs from the daily zip.

Here is my script if you're curious:

Code:
#!/bin/bash


CHAR=$1


DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )


pushd ${DIR}/../sessions


chmod 664 ${CHAR}_*.txt


7z u -mx9 ${CHAR}.zip ${CHAR}_*.txt


oldest=$(ls -td ${CHAR}_*.txt | sort -r -n -t _ -k 2 | awk 'NR>7')
if [ -n "$oldest" ]; then
    echo Removing old session files...
    rm ${oldest}
fi


popd
 
Last edited:

Saklad5

Member
That’s almost exactly what I’d be doing if some scripts didn’t need to access the logs. I’d do it with a cron job at rollover.

If everyone agrees this is a useful feature, I might code Gzip support myself and submit a patch. We can add support for archives with multiple files and different formats later, if anyone else wants to code it. For now, I want to keep this feature request simple.
 
Last edited:

Saklad5

Member
I wrote and extensively tested a patch to implement support for individually compressed session logs. It’ll work the same for uncompressed files, while properly reading gzipped logs if they have the “.gz” extension instead of “.txt”. If both a compressed and uncompressed file exist, it will use the uncompressed file for now.

View attachment Gzip Support.patch

I feel it is ready to be implemented in the daily build.
 

Crowther

Active member
It’ll work the same for uncompressed files, while properly reading gzipped logs if they have the “.gz” extension instead of “.txt”.
The normal convention in the UNIXish world it to have both extensions: file.txt.gz
 

fronobulax

Developer
Staff member
Since you actually made the effort to compile and had success, I gave it a look. You did a few things that offended my sense of compartmentalization, primary introducing DataUtilities but after a couple tries to get around that I realized we either call DataUtilities or edit DataUtilities and FileUtilities. Given that choice your code didn't bother me as much. There were a couple things I did change so I would also like to test it. I'll do that ASAP but I might not be able until the weekend. But if it works for me, it will be in eventually.
 

Saklad5

Member
Thanks for helping out, especially since it won’t help you personally.

It is extremely trivial to add ZIP support for individual files (it works identically to gzip in Java), but handling archives is less straightforward. You’d need to add support for nested folders, for a start. Ideally, everything would be broken into specific folders that may or may not be archives, so KoLmafia doesn’t have to look through everything to determine something isn’t there.

For instance, there could be three valid locations for fronobulax_20180731.txt (or .txt.gz, etc):

  • sessions/
  • sessions/2018/
  • sessions/2018/07/
Any level of folder could be an archive, and KoLmafia would decompress them as needed. If it doesn’t find something, it won’t decompress a decade of log files before concluding it is missing. It will never check sessions/2017, for instance.
Actually moving or writing things to those subfolders would be a breaking change, but looking in them if they exist would not be. And it sounds like many people are using this style of organization system already.

Since DataUtilities and FileUtilities appear to be third-party libraries, I thought it would be a very bad idea to edit them. If we could edit those, it might be cleaner to actually change getReader() to look for compressed versions of any file it can’t find. Or add another function there to do as such. Session logs are hardly the only thing that could benefit from compression, especially if we eventually decide to compress things when writing.

I could tell DataUtilities wasn’t really meant to be imported directly like that, but as you said, there wasn’t really a better option. Besides duplicating that functionality wholesale, I suppose.

Any other advice to keep in mind when contributing to KoLmafia in the future? I could certainly use the help.
 
Last edited:

Darzil

Developer
Presumably tested on a range of O/S ?

Personal opinion at present is that whilst it's useful to be able to unpack archived files (of one specific variant), it'd be more useful if the option to zip files automatically were also present (maybe zip session logs from before the current game day when Mafia runs it's new day code?)
 

Saklad5

Member
fronobulax uses Windows, and I am running KoLmafia on both Ubuntu and macOS. So that’s fairly comprehensive.

I don’t see why the ability to compress files has to be added right now. It can be added later, though the variety of options and settings to choose from makes it harder to find the right balance for every system. And if we do want to add compression to KoLmafia, it would probably be better to do it in real-time, using GZIPOutputStream and such.

Besides, we can implement any number of formats for reading, but we can only implement one for writing. Unless you want to add that many settings, that is. I think it would be best to let users automate that themselves for now, and just make sure KoLmafia can work with it. We can add ZIP next, and possibly import libraries for xz, 7zip, bzip2, and so on.

If we add archive formats like ZIP and 7zip, we might need to add subfolder (and tar) support as well. It would be very confusing to only allow single files in those. The rest only support one file in the first place, so there isn’t any potential for confusion there.
 
Last edited:
Top