Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare with optimized convolutions #40

Open
mratsim opened this issue Nov 26, 2018 · 1 comment
Open

Compare with optimized convolutions #40

mratsim opened this issue Nov 26, 2018 · 1 comment
Assignees

Comments

@mratsim
Copy link

mratsim commented Nov 26, 2018

Hey there, promising work on a C++ JIT.

Can you compare your JIT results with state-of-the-art convolution or at least im2col + GEMM convolution and report the GFLOP/s reached and theoretical peak?

Here are all the resources I gathered regarding convolution optimisation.

The main issue with naive direct convolution are the cache misses and poor utilisation of the CPU cache hierarchy.

On benchmarks on my CPU, a i5-5257U 2.7Ghz dual core Broadwell supporting AVX+FMA the theoretical compute peak is 172.8 GLOP/s, however a naive convolution can reach only 2.6 GFLOP/s. When reframing as a im2col + GEMM (matrix multiplication), I can reach 20+ GFLOP/s.

I didn't finish yet but I hope to reach 120+ GFLOP/s using my own BLAS which attains 98% of the speed of OpenBLAS (72.4 GFLOPS vs 73.8 GFLOPS single threaded, 136 GFLOP/s vs 145 GFLOP/s multithreaded) and fusing im2col with the matrix multiplication repacking steps.

Other promising approaches that should reach 100+ GFLOP/s are MKL-DNN and libxsmm which is described in great detail in this paper.

Also Halide has an optimised JIT generation for computational imagery and already relies on LLVM.

@jmmartinez jmmartinez self-assigned this Dec 23, 2018
@jmmartinez
Copy link
Owner

Hello,
First of all, sorry for the ridiculous delay in my response.

Do you have a small reference C/C++ code I could use for benchmarking using any of those methods? I would love to check what happens if I use the jit compiler on them.

One thing to take into account, the convolution benchmark I used is just for depicting the use of the jit compiler. It's not meant to be a real-world scenario. The library still remains a toy.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants