I think I have found a way to negate my dread to a certain extend.
Earlier this week, I tried some Linux distro. I discovered that the squashfs compression was a thing. I soon discovered how bloody efficient it was. I wondered if there was a way to achieve a similar level of compression on Windows.
Thus, I spend my entire researching compression formats. I made an important discovery that blew my mind.
As it turns out, by default 7zip and Winrar do not account for duplicates files in an archive (regardless of archive format). For instance, if you would have an archive with two binary-identical 100 mb files, you would get an archive weighting 200 mb.
Before doing my research I kinda assumed that this was the whole point of compressing files, but apparently it does something different.
After messing around with 7zip, I have found the most effecient compression format, behold:
As it turns out, the best way to compress file is by making a .7z archive with the parameters "qs". This account for any duplicate file and thus reduce data to it's simplest expression without any form of data lost.
This means this entire thing, all the compressed files I have ever made where bloody inefficient.
This can be EXTREMLY useful. For instance, on my computers I had a backup of multiple version of a given game. As the all versions only contained bug fixes, most assets were repeated in multiple folder. Using ths compression the archive went from 3gb to 1gb.
This change everything for me as I have a lot of duplicates files that I whish to keep (this allows me to do essentially do a similar things to version control, but with raw assets). The fact that this settings is not the default baffles me.
The only downside to this method is that adding files to an existing archive will not take into account if files are duplicate. That being said the nice thing about this method is that it's directly redable (unlike something like tar.gz). It is the closest to squashfs compression you can get on Windows (being extremly close in term of efficiency). There is also the added bonus of it being OS-agnostic.
---------
Here's some result I obtained during this research.
This research was conducting on two version of a game, with the executable being the only different between the two folders (both foders having the full art assets that were hex-identical):
-Individual zipped archives are 575 mb (original folders)
-Combined zipped archives are 575 mb
-Combined tar file is 586 mb
-Combined 7zip file is 566 mb (directly readable)
-Combined rar file is 573 mb
-Combined tar.xy file is 567 mb (directly readable) (normal settings)
-Combined tar.gz file is 574 mb (not directly readable)
-Combined tar.xy file is 567 mb (directly readable) (ultra settings)
-Combined rar5 file is 573 mb (directly readable)
-Combined tar.xy file is 567 mb (directly readable) (128 mb dictionnary)
-Combined .squashfs is 402 mb (directly readable)
-Combined .iso file is 587 mb (indirectly readable)
-Combined .tar.gz file is 574 mb (with shared file detection)
-Combined .tar file is 586 mb (with shared file compression)
-Combined 7zip file is 405 mb (directly readable) (with "qs" parameters")
As you can see the difference between 7zip with qs and squashfs versus everything else is like night and day. Using this methods, art assets are only stored once, meaning no file has any duplicates.
This method essentially "compiling instructions" that refer to the originals file. This mean I can essentially command 7zip to recreate any version of the game I want using a minimal amount of data.
I hope my explanation was clear. I might showcase a concrete example tomorrow because this is an extremely good thing, especially for archivists.
Earlier this week, I tried some Linux distro. I discovered that the squashfs compression was a thing. I soon discovered how bloody efficient it was. I wondered if there was a way to achieve a similar level of compression on Windows.
Thus, I spend my entire researching compression formats. I made an important discovery that blew my mind.
As it turns out, by default 7zip and Winrar do not account for duplicates files in an archive (regardless of archive format). For instance, if you would have an archive with two binary-identical 100 mb files, you would get an archive weighting 200 mb.
Before doing my research I kinda assumed that this was the whole point of compressing files, but apparently it does something different.
After messing around with 7zip, I have found the most effecient compression format, behold:
As it turns out, the best way to compress file is by making a .7z archive with the parameters "qs". This account for any duplicate file and thus reduce data to it's simplest expression without any form of data lost.
This means this entire thing, all the compressed files I have ever made where bloody inefficient.
This can be EXTREMLY useful. For instance, on my computers I had a backup of multiple version of a given game. As the all versions only contained bug fixes, most assets were repeated in multiple folder. Using ths compression the archive went from 3gb to 1gb.
This change everything for me as I have a lot of duplicates files that I whish to keep (this allows me to do essentially do a similar things to version control, but with raw assets). The fact that this settings is not the default baffles me.
The only downside to this method is that adding files to an existing archive will not take into account if files are duplicate. That being said the nice thing about this method is that it's directly redable (unlike something like tar.gz). It is the closest to squashfs compression you can get on Windows (being extremly close in term of efficiency). There is also the added bonus of it being OS-agnostic.
---------
Here's some result I obtained during this research.
This research was conducting on two version of a game, with the executable being the only different between the two folders (both foders having the full art assets that were hex-identical):
-Individual zipped archives are 575 mb (original folders)
-Combined zipped archives are 575 mb
-Combined tar file is 586 mb
-Combined 7zip file is 566 mb (directly readable)
-Combined rar file is 573 mb
-Combined tar.xy file is 567 mb (directly readable) (normal settings)
-Combined tar.gz file is 574 mb (not directly readable)
-Combined tar.xy file is 567 mb (directly readable) (ultra settings)
-Combined rar5 file is 573 mb (directly readable)
-Combined tar.xy file is 567 mb (directly readable) (128 mb dictionnary)
-Combined .squashfs is 402 mb (directly readable)
-Combined .iso file is 587 mb (indirectly readable)
-Combined .tar.gz file is 574 mb (with shared file detection)
-Combined .tar file is 586 mb (with shared file compression)
-Combined 7zip file is 405 mb (directly readable) (with "qs" parameters")
As you can see the difference between 7zip with qs and squashfs versus everything else is like night and day. Using this methods, art assets are only stored once, meaning no file has any duplicates.
This method essentially "compiling instructions" that refer to the originals file. This mean I can essentially command 7zip to recreate any version of the game I want using a minimal amount of data.
I hope my explanation was clear. I might showcase a concrete example tomorrow because this is an extremely good thing, especially for archivists.