God Components in Apache Tika

How did big, bulky software components come into being? In this project, we explore the evolution of so-called God Components; a software anti-pattern where pieces of software with a large number of classes or lines of code get so large over time they become hard to maintain and reason about.

The codebase of choice is chosen to be Apache Tika, a content analysis toolkit built in Java. This Jupyter Notebook provides a structured analysis on data mined using Designite and using data from the Git repository. A report on the git contributers can be found here (made using gitinspector).

→ By Jeroen Overschie and Konstantina Gkikopouli.

Loading datasets

Import dependencies.

Load commit data.

Note that some commit authors are actually the same person, but under different aliases. Let's fix this using a simple mapping.

Dispose some columns we don't need.

Make sure the date rows are indeed parsed as a datetime.

Load the aggregated report file, all_reports.csv. Contains Designite report data for every single Tika commit.

Dispose some columns.

Add commits to report data, combining them into one big dataset, gcdata.

God component lifetime

General statistics on lifetime.

Average over the above stats

Total amount of God Components

Amount of classes per God Component

Computing the # classes chronological difference (delta)

Save a small version of is gc metric the # classes dataframe only where abstract difference > 3

God Component growth in terms of Lines Of Code (LOC)

Load up data on Lines Of Code for every God Component at the state of every commit.

Add commit datetime.

Investigating what developers contribute to GC buildup

How many developers contributed to God Components? We aim to answer the question in terms of both God Component (1) buildup and (2) refactoring. We do this, by considering the # classes added and # classes removed for each developer.

Also, show that there are not a lot of developers working on the entire project at all. Developers versus LOC's added/removed:

Jira Issues analysis

Add the Tika Jira issue tracker information as a data source.

Keep only issue key and issue type columns.

Generally, how many and what types of issues are in Apache Tika's Jira issue tracker?

... of which these amounts are involved in God Component commits:

... which is this percentage:

We can also check what issue types are represented most in the God Component commits:

Build a pivot table and show heatmap.

Issue types that contribute to GC buildup

This time around, only include those commits that actually 'build up' or 'decrease' the size of a God Component; i.e. those commits that actually: add or remove classes to a GC.