ImageJ/Fiji should support opening .bz2 and other compressed files

performance
compression
scifio
Tags: #<Tag:0x00007fd542ea24e0> #<Tag:0x00007fd542ea2378> #<Tag:0x00007fd542ea2210>

#1

Currently a .zip file containing a list of TIFF stack can be opened via File - Open.

Our institute, and I suspect others, tend to compress archived image stacks in .bz2 format. To open them, one has to copy them locally, decompress them, then open them. Ideally, one would open them directly from the mounted fileshare. There are java libraries for opening a number of compressed file formats. So, the HandleExtraFileTypes could run twice: first to detect compression, then to assign the appropriate plugin to open the decompressed image file into a stack.

This would save headaches and speed up operations everywhere.


#2

Bio-Formats and SCIFIO do currently support reading compressed BZip2. Unfortunately, the support is very slow:

  • I took the mitosis.tif sample image (~33M), compressed it using bzip2 (~22M afterward), then benchmarked reading it using Bio-Formats via its showinf CLI tool. The operation took ~4.5 minutes.
  • I also tried reading the file using SCIFIO, via the scifio CLI tool’s show command, and the operation took ~30-60 seconds. I saw a similar result with the File :arrow_forward: Import :arrow_forward: Image… command that uses SCIFIO.

To further explore this, and as a potential short term workaround for you, I wrote a Groovy script that directly uses the CBZip2InputStream class to decompress in memory and then feed the resulting decompressed stream to SCIFIO. Here is the script:

#@ File file
#@ LogService log
#@ LocationService ls
#@ DatasetIOService ds
#@output Dataset dataset

start = System.currentTimeMillis()
def msg(msg) {
	t = System.currentTimeMillis()
	println("[" + (t - start) + "] " + msg)
}

import java.io.FileInputStream
import io.scif.io.ByteArrayHandle
import io.scif.io.CBZip2InputStream
import org.scijava.io.ByteArrayByteBank

msg("Decompressing " + file.getAbsolutePath())

is = new FileInputStream(file)
is.read(); is.read() // skip the BZ header
bz2is = new CBZip2InputStream(is, log)
buf = new byte[512 * 1024]
bank = new ByteArrayByteBank()
while (true) {
	r = bz2is.read(buf)
	if (r < 0) break; // EOF
	bank.appendBytes(buf, r)
}
bz2is.close()

msg("Finished decompressing")
// NB: Assume file is named foo.bar.bz2, where foo.bar is the inner file.
id = file.getName().substring(0, file.getName().length() - FileUtils.getExtension(file).length() - 1)
msg("Opening decompressed bytes of " + id)

// NB: We are in the process of rewriting the SCIFIO I/O support as
// part of SciJava Common's org.scijava.io package. It is much better,
// but we are still not done migrating SCIFIO to the updated framework.
// So for now, we need to do this nasty LocationService.mapFile thing,
// and also unfortunately make one more copy of the bytes.
handle = new ByteArrayHandle(bank.toByteArray())
ls.mapFile(id, handle)
dataset = ds.open(id)

msg("Finished opening")

(Note that the use of ByteArrayByteBank above requires a newer version of SciJava Common than what ImageJ2 currently ships; I tested with version 2.66.1.)

When opening the bzip2-compressed mitosis sample in this way, my system took ~19s for the decompression followed by ~100ms for the display. Faster, but still pretty slow, unfortunately.

By comparison: the bunzip2 CLI tool requires ~2.6s on my system to decompress the compressed mitosis sample. So there will always be some overhead to working with bzip2 streams. For now, you can use a modified version of the above script that invokes bzcat as a subprocess:

#@ File file
#@ LogService log
#@ LocationService ls
#@ DatasetIOService ds
#@output Dataset dataset

start = System.currentTimeMillis()
def msg(msg) {
	t = System.currentTimeMillis()
	println("[" + (t - start) + "] " + msg)
}

import java.io.FileInputStream
import io.scif.io.ByteArrayHandle
import org.scijava.io.ByteArrayByteBank
import org.scijava.util.FileUtils

msg("Decompressing " + file.getAbsolutePath())

p = new ProcessBuilder("bzcat", file.getAbsolutePath()).start()
is = p.getInputStream()
buf = new byte[512 * 1024]
bank = new ByteArrayByteBank()
while (true) {
	r = is.read(buf)
	if (r < 0) break; // EOF
	bank.appendBytes(buf, r)
}
is.close()

msg("Finished decompressing")
// NB: Assume file is named foo.bar.bz2, where foo.bar is the inner file.
id = file.getName().substring(0, file.getName().length() - FileUtils.getExtension(file).length() - 1)
msg("Opening decompressed bytes of " + id)

// NB: We are in the process of rewriting the SCIFIO I/O support as
// part of SciJava Common's org.scijava.io package. It is much better,
// but we are still not done migrating SCIFIO to the updated framework.
// So for now, we need to do this nasty LocationService.mapFile thing,
// and also unfortunately make one more copy of the bytes.
handle = new ByteArrayHandle(bank.toByteArray())
ls.mapFile(id, handle)
dataset = ds.open(id)

msg("Finished opening")

The above script opens and displays the bzip2-compressed mitosis sample in ~2s on my system.

The executive summary here is that @gab1one and I are actively working on improving the SciJava Common + SCIFIO data handle API and supported plugins, so things will get better. But for bzip2 specifically, we will likely need to profile and address where CBZip2InputStream is slow, and/or find a more performant bzip2 decompression routine, to make bzip2 truly usable in the way that you want.


#3

Thanks so much!

Please remember to look into why opening .zip files using SCIFIO takes so long. That is a major reason for me to always deactivate SCIFIO in Fiji. Perhaps it is the same slow-down issue as with bzip2.


#4

I do not have time right now, but it should be possible to adapt the information I posted above for ZIP, and do some similar benchmarks. I doubt the slowdown with ZIP is for exactly the same reasons, since java.util.zip package is part of Java core, and likely to be much better optimized than that bzip2 class.