Ops-experiments update - JavaCpp + FFT benchmarks

Tags: #<Tag:0x00007fa30af743d8> #<Tag:0x00007fa30af74130> #<Tag:0x00007fa30af7bf70>


Hi all

I’ve added javacpp and nd4j as dependencies in the ops-experiments repo. My intention was to access fast FFTs through nd4j, but that was a dead end, as all the FFTs gave a “not supported message”, apparently they have interfaces, but at this time are unable to support complex numbers, so the FFT implementations have been disabled.

So instead I tried javacpp-presets, and was able to write some FFTW wrapper ops, and a benchmark

javacpp-presets seems pretty useful. It is a bunch of wrappers to all kinds of native stuff, it relies on the underlying native libraries being installed and on the native path. However it may be possible to write some ops that use javacpp-presets, and check for required libraries in conforms.

javacpp-presets also wraps CUDA. @kephale you should check out the CUDA example as it implements a neural network, piece by peice, as we attempted to do last year with ops.

Anyway quick question. Does anyone know how to set the number of threads for jtransform?? I could not figure that out.


@bnorthan Thank you very much for the update!

I agree. It is weird to me that more people aren’t recommending or using it.

Digging through the JTransforms source, it seems there is code like this:

import pl.edu.icm.jlargearrays.ConcurrencyUtils;
import org.jtransforms.utils.CommonUtils;

Does that help at all?


Hi Curtis

Thank you. Yes that helps. I added that code as well as cufft code to the fftbenchmark.

I also added:

Self contained javacpp example which shows how to convert an java array to a pointer and pass it to c code.

Mavanized javacpp example also see the plugin section of pom.xml.


Hi all

I am continuing to work on ops-experiments as time permits. One of the main goals of this project, is to work out the easiest way to integrate native and gpu based algorithms into imagej-ops.

Earlier in the summer I was playing around with java-cpp-presets. The presets are rich enough that you could write complete algorithms with them.

However, because I also want to call native algorithms from matlab, and potentially pure native programs, as well as from ImageJ I have spent some time learning how to wrap my own c++ programs with JavaCpp.
I’ve added convolution and Richardson Lucy using the MKL-FFTW wrapper to ops-experiments (In the near future I am planning to follow the same steps to wrap a GPU implementation, of which there are several available, for example here, here.and within the SPIM codebase.

I followed the following steps (Windows 64 only right now, as this is meant as proof of concept, not a polished build yet)

  1. Implement Native Algorithm. I used CMake, and MKL (for performance and licensing reasons), and implemented convolution and Richardson Lucy here.

  2. Implement a java wrapper. Note you use annotations to specify the native include and link files and locations. The Richardson Lucy wrapper is here.

  3. Configure the java-cpp plugin in the pom. See here.

  4. Add a test to verify wether it works

Next step is to try and wrap one of the GPU Richardson Lucy implementations, then target multiple platforms and wrap into an op.


Hi All

A few updates on the native and GPU deconvolution experiments…

I’ve added a JavaCpp wrapper to the YacuDecu Cuda Richardson Lucy impelmenation by Bob Pepin.

Ops Experiments now has native and cuda sub-directories. These each have native source code and an associated CMakeList.txt file.

To make it work you build the native code it outside the ops-experiments directory structure. If you inspect the javacpp plugin section of the pom you can see where JavaCpp is expecting the native headers and libraries.

The wrapper files (ie CudaRicharsonLucyWrapper) specify where it looks for third party libraries (such as Cuda and MKL). (only Windows 64 is supported right now).

There is a test that runs Ops, MKL, and Cuda deconvolution, and benchmarks them.

The test has debugging statements, as I am still tracking down memory corruption issues.

Preliminary times on my machine for 100 iterations of Richardson Lucy…

Ops - 260 seconds
MKL - ~16 seconds
Cuda ~4 seconds

Notes: I am in the process of profiling the ops version, to see why it is so slow. I would expect it to be slower than the MKL version, but not that much slower.


Thank you @bnorthan for your continued efforts on this! It is exciting to see someone working on native code integrations. I agree that JavaCpp is a promising tool. I look forward to hearing more about the profiling. Let me know if there is anything I can do to help.


Hi all

I thought I should post another update on the ops-experiments project since we’ve accomplished a lot the last while.

The last couple of weeks @hadim and I have made some good progress on the build for YacuDecu deconvolution. Below is a summary of rough notes and build instructions I wrote down the last few weeks. In the coming weeks I hope to clean up these instructions and add them to the readme of ops-experiments and the ImageJ wiki. I emphasize these are “rough” notes. I’m posting this for the benefit of any early adapters and for myself (so I can remember what we did to get things to work).

First off some timing results:

100 iterations Ops Richardson-Lucy ~ 220 seconds
100 iterations DeconvolutionLab2 - RL ~ 100 seconds
100 iterations Ops-Experiments - RL c++ with MKL libraries ~ 10 seconds
100 iterations Ops-Experiments Cuda -Yacu Decu RL ~ 2 seconds

These are a bit deceptive as Ops-Richardson-Lucy definitely still has some inefficiencies. I suspect DeconvolutionLab2 could also be optimized further, as I’d expect the speed of an optimized java version to be much closer to the c++ MKL version. Take home point, is that GPU deconvolution is much faster than other implementations.

@hadim has contributed some big improvements to the javacpp build process, in ops-experiments. The process is based on the javacpp-presets project.

  1. Native projects are placed in the native directory and platforms and sub-projects are defined in cppbuild.sh.

  2. Each subproject has it’s native build commands defined in it’s own cppbuild.sh file. For example see YacuDecu cppbuild.sh and MKLFFTW cppbuild.sh. As things currently stand, only the linux builds have been implemented. MKLFFTW is using CMake while Yacu Decu is just using a Makefile. (This is simply because I haven’t figured out how to get the linking to Cuda correct from CMake yet, however it does demonstrate how you can use different native build tools for different projects, and define the native build steps for each OS).

  3. Use exec-maven-plugin to execute the native builds.

  4. Define a java wrapper class to the native code, for example here is the wrapper to YacuDecu.

  5. Use the javacpp maven plugin to build and link the wrappers.

To build the project there are two native tools that need to be installed

  1. Cuda 9.0
  2. MKL

Note: @hadim made a simplified branch that only has the YacuDecu wrapper, thus only needs Cuda and not MKL.

After installing Cuda and MKL people should be able to build the project by simply typing mvn at the command line.

Currently, to install into Fiji, copy ops-experiments-0.1.0-SNAPSHOT.jar and javacpp-1.3.jar into Fiji.app/jars.

At this point you can test the installation by running DeconBenchmarkTheoreticalPSF.java.

More to come soon.


Hi all

Another update on ops-experiments. I’ve been hacking on it at the Moscow Hackathon, and spent some time making the project multi module. This should make it easier for people to build the individual examples they are interested in, without having to install libraries they do not care about (ie now you can build the Cuda example without having to worry about installing MKL).

Thanks again to @hadim for contributing a clean maven linux build. And @eric-czech has contributed an example showing how to integrate tensor flow with ops.

I also added an example showing how to deconvolve an image in sub-cells based on this imglib2-cache example. The example is overkill for the small test image, however it would be very useful for a real application with big data that would not fit in GPU memory, and could be adapted for a cluster of multi-GPU system.