Skip to content

Fast String Subsequence Kernel (SSK) implementation in Python (Cython)

License

Notifications You must be signed in to change notification settings

helq/python-ssk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast String Kernel (SSK) implementation for python

Implementation of string kernel as described in Lodhi et al (2002) (aka. SSK).

The main function is written Cython (Python with C type-level annotations) and is the second fastest implementation of the SSK Kernel that I know of. The fastest SSK implementation I found is Shogun's SubsequenceStringKernel.cpp. I copied a small trick from Shogun's implementation that speeds the computation substantially.

Requisites

  • Python
  • Cython
  • Numpy

How to use

Take a look at main.py. That hopefully should suffice.

As an example, the following is a simple wrapper around string_kernel() to use SSK in scikit-learn:

def get_ssk_kernel_for_scikit(max_substring, lambda_decay):
    def strker(il, ir):
        #print("Shape of gramm matrix to create ({},{})".format(len(il), len(ir)))
        # assuming that `il` and `ir` are lists of strings.
        # `len(il)` may fail to give you the size real size if you're using `np.array`s
        # the idea is to reshape your data to be `np.array`s of sizes (n, 1) and (m, 1)
        l = np.array(il).reshape((len(il), 1))
        r = np.array(ir).reshape((len(ir), 1))
        return string_kernel(l, r, max_substring, lambda_decay)
    return strker

max_substring = 5
lambda_decay = .8

my_ssk_kernel = get_ssk_kernel_for_scikit(max_substring, lambda_decay)

TODO

  • string_kernel should accept arbitrary lists not only python strings and numpy arrays

How does ssk() actually works?

If you're interested on how the recursive functions from the paper got converted into the very confusing, three-looped Cython function, I recommend you to check the changes in the code from the first to the fourth commits.

License

I wanted to use WTFPL to license the code, but CC0 is recommended over it as CC0 is more "legal" ;)

About

Fast String Subsequence Kernel (SSK) implementation in Python (Cython)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages