A Decision Tree Based Approach to Filter Candidates for Software Engineering Jobs Using GitHub Data
A challenge for companies hiring software engineers is the large number of candidate profiles on LinkedIn, Monster.com, and other job websites and the inability to easily filter top candidates from these lists. In this paper, we propose a novel approach for utilizing the social network structure in GitHub and a decision tree algorithm to solve this problem efficiently and filter candidate software engineers. The approach is based off the idea that the centrality value of a node (i.e., candidate engineer) in the graph of GitHub users is an approximate indicator of the value of the programmer. To reduce the number of candidates that are considered in a job selection process, a threshold centrality value can be used to filter job candidates based on their importance in the GitHub user graph. A challenge with this approach is that, since GitHub has millions of users, calculating the centrality for every node in the GitHub user graph is an expensive operation. To overcome this challenge, we train a decision tree to predict a user’s centrality based on a limited subset of their attributes. To generate training data for the decision tree from the unlabeled GitHub user graph, a threshold centrality value is chosen and a part of the user graph is labeled with Accepted or Rejected based on whether or not the corresponding user meets the threshold centrality. We also collect the total number of the each kind of public GitHub event each user has generated and we use the number of these GitHub events as training attributes for each user in the training dataset. Once decision trees are built with this training dataset, recruiters can use these decision trees to process large quantities of software engineering job candidates and to improve the judgment of HR departments. Based on empirical results from experiments that we conducted with GitHub user data, our approach can reach a precision of 96%. Moreover, this method saves future expensive network centrality computation as the GitHub social graph changes over time.