Java Naive Bayes Classifier (JNBC)

Posted by

We’ve released a new version of the opensource Java Naive Bayes Classifier (JNBC), so it can now run on RocksDB for the fast key-value store. It is released with a LGPLv3 license and accessible from Maven Central Repository. Learn more at https://naivebayesclassifier.org/

You can find several explanations on the Internet on how Naive Bayes Classification works. The example we chose for illustration is the classic ‘Play/not Play Golf’ prediction based on different weather conditions (outlook, temperature, humidity and wind). You can find all the calculation details here (PDF).

Why Naive Bayes ?

Naive Bayes is relatively simple to understand and it is doing an excellent job at classification, especially when the training sets are large. Also, it tends not to ‘over fit’ too easily. The algorithm is fast and computationally efficient. The simplicity of the base statistical formulas make prediction results easily explainable.

Why a new implementation ?

Every machine learning framework has its own implementation of the Naive Bayes Classification. But we wanted a clean self-contained implementation that would have the following features :

  • simplicity
  • scalability
  • explainability

Simplicity : to represent the past observation statistics, we use a large single key-value store.

Excerpt from ML-Classification-NaiveBayes-2014.pdf

The above tables summarizing observations can be represented as a list of key-value with a path and a counter value :

key (path)value (counter)
~gL14
~gL//~cA//Yes9
~gL//~cA//Yes//~fE//wind=Weak6
~gL//~cA//Yes//~fE//wind=Strong3
~gL//~cA//Yes//~fE//wind9
~gL//~cA//Yes//~fE//temp=Mild4
~gL//~cA//Yes//~fE//temp=Hot2
~gL//~cA//Yes//~fE//temp=Cool3
~gL//~cA//Yes//~fE//temp9
~gL//~cA//Yes//~fE//outlook=Sunny2
~gL//~cA//Yes//~fE//outlook=Rain3
~gL//~cA//Yes//~fE//outlook=Overcast4
~gL//~cA//Yes//~fE//outlook9
~gL//~cA//Yes//~fE//humidity=Normal6
~gL//~cA//Yes//~fE//humidity=High3
~gL//~cA//Yes//~fE//humidity9
~gL//~cA//No5
~gL//~cA//No//~fE//wind=Weak2
~gL//~cA//No//~fE//wind=Strong3
~gL//~cA//No//~fE//wind5
~gL//~cA//No//~fE//temp=Mild2
~gL//~cA//No//~fE//temp=Hot2
~gL//~cA//No//~fE//temp=Cool1
~gL//~cA//No//~fE//temp5
~gL//~cA//No//~fE//outlook=Sunny3
~gL//~cA//No//~fE//outlook=Rain2
~gL//~cA//No//~fE//outlook5
~gL//~cA//No//~fE//humidity=Normal1
~gL//~cA//No//~fE//humidity=High4
~gL//~cA//No//~fE//humidity5

For Laplace smoothing, we store a few additional counters (like the number of distinct values for a particular feature, ex. Outlook=2 and Temp=3).

Scalability : the library has two implementation flavors :

  • NaiveBayesClassifierTransientImpl is very fast storing the Key-Values in-memory, using ConcurrentHashMap. It can scale to billions of facts on a computer with enough memory (we use servers with 148GB RAM in production) ;
  • NaiveBayesClassifierRocksDBImpl is vertically scalable as you can use a hard drive or SSD for key value storage and because it uses RocksDB key-value store, it is still very fast. NB/ the previous version was using LevelDB but we decided to move to RocksDB.

Explainability : the formula to make a prediction is very simple. All you need to explain a decision is the algorithm input (the feature values) and the a snapshot of the key-value store.

What’s the road map for future developments?

Firstly, we’d like to have a better interface for the library, especially the explainability part. We could for example store an audit trail of the algorithm inputs (the feature values), the key value store snapshot and the algorithm outputs. The can be a critical feature for organisations (banks, recruiters etc.) who need to explain any past decision made using so-called artificial intelligence (A.I.) especially with respect to any gender, ethnic or racial bias.

Secondly, we’d like to improve the functional test cases and add some technical tests (performance, cache tuning etc.), as we will migrate from our current LevelDB implementation to the new RocksDB implementation.

Thirdly, we might consider offering horizontal scalability using a distributed key-value store like Reddis.

If would like to contribute, please feel free to join the open source project on GitHub at https://github.com/namsor/Java-Naive-Bayes-Classifier-JNBC

About NamSor

NamSor™ Applied Onomastics is a European vendor of sociolinguistics software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people. We proudly support Gender Gap Grader.
Reach us at: contact@namsor.com