Projects



Source code retrieval for bug localization


The goal in automatic bug localization is to find bugs in the codebase of a large software repository. I use Information Retrieval (IR) techniques to build a software search engine. An IR based bug localization system finds ranked list of relevant source code files. The traditional way to do bug localization research is by using what's famously called a Bag-of-Words (BOW) approach. In a BOW approach only the frequencies of terms appearing in bug reports and source code files are considered in the model. And the position and ordering of words or terms have no meaning. This is a problem since words in any language, be it programming language or natural language, have contexts. Therefore, a term-term dependency model is needed which imposes positional and ordering constraints. I use Markov Random Field (MRF) based approach to enforce position and ordering constraints in IR model. If you are interested in knowing how MRF can be used to build a search engine, you will enjoy reading our MRF paper. For datasets you can refer to the BUGLinks dataset or moreBugs dataset.

I also created SCOR which is a source code retrieval tool that combines the power of MRF with semantic word embeddings to enhance retrieval precision. SCOR was published in MSR conference in 2019. I also published semantic word embeddings used in SCOR online. SCOR word embeddings are trained using the popular word2vec algorithm on 35000 Java repositories that contained over 30 million source code files and 1 billion software term tokens. It contains semantic word vectors for half million software vocabulary terms.

In addition to SCOR, I also created Bugzbook which is a large-scale and diverse bug localization dataset containing over 20000 bug reports belonging to Java, C/C++, and Python programming languages. Bugzbook will also be made available here. After experimenting with Bugzbook using eight retrieval algorithms we showed that SCOR outperforms all previous algorithms including MRF based algorithms.



Mapsd: A map of the world based on software development activity


Using GitHub open source repositories and the location information associated with each GitHub developer account I created a map of the world based on software development activity happening in different parts of the world. The medium blog discussing the method and results is publicly available.



3D Modeling of Dormant Fruit Trees


Creating a very accurate 3D model of fruit trees is a challenging task. The ultimate goal of the project is to automate the process of pruning fruit trees. For this precision agriculture task a 3D model of the tree is required to locate candidate branches to prune. For details about this project please refer to the project webpage here. If you want to play with Kinect2 depth images of indoor and outdoor dormant trees, you can download the datasets here, here, and here. If you use these datasets please give us credit by citing our works. Collecting depth images of outdoor orchard trees was quite an experience! That too in snow! Most of the source code is in MATLAB and C++.