Home > java > Java Neuroph tutorial – The code classifier

Java Neuroph tutorial – The code classifier

An article was written a while back about how neural networks can be used to classify source code. Yes the source code that you write to feed to compilers / interpreters.

The article explains at a high level what method could be used to perform this activity. In the end the author claims some level of success and wonders how other neural-network implementations / techniques would solve the same problem. This got me curious enough and I spent a weekend trying to crack this with Neuroph, the neural network library for java. I present to you my analysis and results below. For the impatient here is the code-classifier DEMO.

Why solve this problem with neural networks ?

So where do we begin ? How about the question ‘Why do we need neural networks to solve this problem ?’. Well actually it might not be the ‘best’ way to solve the problem. A lexer is a better solution to this problem since it can scan the source file and probably give you a much better accuracy on the clasification. But neural networks are also great for this problem since they can keep learning based on the input that is provided. Add one more programming language into the bag and the network can learn about this language without a programmer needing to type extra code into a lexer.

So how do we do it ?

One of the features that differentiates one source code type from another is the keywords used by the programming language in question. So lets target that. Our neural network will try to identify if source code belongs to the java language or the python language.

It takes as its input a list of all keywords available in both languages. Mind you that there may even be some overlap between the keywords. For example the keyword ‘if’ is used in both languages. But there may be keywords that are used solely in one language alone.

Neural network viewed in graph mode: (The network is too huge to fit into this picture :) )

A middle layer will contain a few nodes that will back propagate anything that it learns. The output layer (just one node) will decide if the language is java / python. The value will range from 0-1, where 0 represents closeness to python and 1 represents closeness to java.

How do we train it ?

The process of training is made easy by the neuroph API. First we need a few java / python files. A program was written that reads each file and splits it based on the whitespace delimiter. Based on a known set of keywords, the network is fed the number of times a keyword appears in a source file.

1
2
3
4
5
6
7
8
for(Map<Double, Double> map : keywordStrength)
{
    Collection<Double> values = map.values();
    Vector<Double> input = new Vector<Double>();
    input.addAll(values);
    SupervisedTrainingElement supervisedTrainingElement = new SupervisedTrainingElement(input,output);
    trainingSet.addElement(supervisedTrainingElement);
}

For example, recognized keywords may be ‘public’ represented by the input index 0 and ‘static’ represented by the input index 1.  A java file may contain the word pubic 10 times and static 5 times. The input to the network in this case would be 10 5. We can now use these numbers to tell the network ‘Hey if you see this pattern it is likely that the source code belongs to language X’. Neat !. Once the network is trained you can load it and use the same process to classify source code for any given input String.

Well… show us some stats:

Before I say anything else, I did not expect this approach to succeed more than 10-20% of the time. This is because the number of files that I trained this network with was around 20. Java + Python included. But I was surprised to see the network classifying the data correctly most of the time. There were very few cases where it failed.

The real test for this network is now in your hands. I made a java web app on top of this ‘code classifier’ application and built a simple UI using Jquery. You can type in any code / junk you want and hit the trained network. If it manages to classify your code correctly at least 60% of the time, I would consider that an enormous victory (for Neuroph ?) :)

So what are you waiting for, take her for a spin.

How do I game the results:

This is pretty simple. There are certain rules you must follow. Not following them will give you weird results :D

1. Make sure that your code does not contain comments. I do not strip comments from code before analyzing them for keywords. Including java like keywords in a python file comment will confuse the network.
2. Input only those files that will compile / run correctly on Java / Python. In theory you could feed in a Shakespeare novel to my network and it will still try to classify it as source code :D

So does the network work for you ? Feel free to leave a comment.

I also wrote a simpler tutorial for Neuroph that is less practical and more academic. Check it out if it interests you





Categories: java Tags: , ,
  1. brainless
    July 27th, 2010 at 12:39 | #1

    x86 assembly (wikipedia hello world sample) looks like assembly at 56% :p

  2. brainless
    July 27th, 2010 at 12:40 | #2

    in fact it’s 82%

  3. July 27th, 2010 at 16:27 | #3

    @brainless

    Yeah the possibility of wrongly classifying input data is great when the input is not pre-formatted. For example the input string “The code classifier has been release to the public” will result in the classification – java :D

    Thanks for trying out the tool though

  1. No trackbacks yet.