Analyzing code base through GIT history

data: 24 października, 2016

czas czytania: 10 min

autor: Jarosław Porwoł

Categories

A few weeks ago I participated in one of the most popular developers conferences in Poland: DevDay 2016. This conference is organized by ABB company in Cracow and this year we saw the 6th edition of what have become a two-day conference with three rooms full of people and speakers from all over the world.

One of the sessions caught my eye: Seven Secrets of Maintainable Codebases by Adam Tornhill. While I didn’t really count all those secrets, a few things were really worth nothing.

What Adam was talking about is not really how to organise your code or how to build maintainable systems. He assumed that everyone knows it or, at the very least, researches it on their own, aspiring to be better software engineers. Instead, he concentrated on something very dear to heart of masses of developers all over the world – legacy code. Let’s be honest – legacy code isn’t something that a previous team created and you are the one to maintain the mess. At least not all legacy code is that kind of code. A big part of legacy is right here, under your nose. Code you and I create every day, writing fast and dirty proofs of concept that somehow got moved to production. Code without tests and documentation. Design decisions being made in discussions and staying only in minds of few people that participated in creating them, never to be stored in code comments, documents, confluence or whatever it is for you. Yet, we have to keep this code alive, since trying to rewrite software each time and discovering complexity of such approach is the shortest path to losing one’s mind.

There is another way though. Small improvement. Incremental steps forward, where the goal is to leave the place a little bit better every day. Being agile about code, dare I say. You go and find one place that you can fix and you do that. You go about fixing bugs, and when seeing some dragons (https://en.wikipedia.org/wiki/Here_be_dragons) – you fight them and conquer them. However, there are some gotchas. It is easy to get carried away, so be modest with your changes, make sure you know why something is the way it is and take care so you don’t break some unspoken rules that could create even more issues with code. And most importantly – pick you battles.

But how to know which battle is worth fighting? Most of us believes they know a bad code when they see it. This few hundred lines of code, with ifs all over the place, nested five level deep? You get it, that’s asking for trouble – and you and I both want to fix this. Do you remember the code that you see from time to time when debugging, and it gives you creeps? We both know we want to get dirty and fix it – after all, every problem seems to be touching that.

Is it really though? Adam proposes different approach. A scientific one. Get your data and make it talk to you, reveal what is really going on. After all – we have all the data we need, right under our fingertips. Everyone of us uses some kind of version control tools. Why not to go right there and get hard evidences. One of the simplest things to look for would be a rate of change for each file. Or just simple number of changes that have been made to file during project lifetime. If there are some files that are being constantly changed,doesn’t it sound like trouble? Of course that is a simplified view of things, after all this file might just be csproj file, updated each time you add new class or database migration. This may be file that was changing a lot, but now you just finished refactoring it. However that would still be a good place to start, I think.

There are ways to access any source control data on file history. Given its popularity around the world and in our company, I’ve decided to try figure out how to get this data from git. It stores every version of the file, since project’s start and you can access this historical data easily by looking through git’s history of commits. Every commit has information about what has changed, what is the current state of the project (meaning: how did every file from project look at the time of commit), as well as who made the change, at what time and day etc. More than enough to get us going.

I’ve picked LibGit2Sharp library to help me access git data programmatically. This library is actually a whole git client on its own – which means it doesn’t communicate with your command line tools but works directly on git files on a disk. You could try to do it by parsing text output from git commands in command line as well, but using object model seems to simplify things.

Idea is simple – go through the whole history of the project on git and for each commit, get a list of files that were changed. Then, summarise this data by getting number of occurrences for each file. That will give us a nice picture of what is going on. Let me share my solution with you. In two languages – C# and F#.

C#:

 class Program
    {
        private static Dictionary filesFrequency = new Dictionary();

        static void Main(string[] args)
        {
            using (var repo = new Repository(args[0]))
            {
                foreach (var log in repo.Commits.QueryBy(new CommitFilter() { SortBy = CommitSortStrategies.Time }))
                {
                    if (log.Parents.Any())
                    {
                        var oldTree = log.Parents.First().Tree;
                        var changes = repo.Diff.Compare(oldTree, log.Tree);
                        foreach (var change in changes)
                        {
                            UpdateFileFrequency(change.Path);
                        }
                    }
                    else
                    {
                        foreach (var entry in log.Tree)
                        {
                            CountUpdatedFilesFrequency(entry);
                        }
                    }
                }
            }

            foreach (var f in filesFrequency.Select(f => new { Frequency = f.Value, File = f.Key })
                                            .OrderByDescending(f => f.Frequency))
            {
                Console.WriteLine($"{f.Frequency}\t{f.File}");
            }

            Console.ReadKey();
        }

        private static void CountUpdatedFilesFrequency(TreeEntry file)
        {
            if (file.Mode == Mode.Directory)
            {
                foreach (var child in (file.Target as Tree))
                {
                    CountUpdatedFilesFrequency(child);
                }
            }
            else
                UpdateFileFrequency(file.Path);
        }

        private static void UpdateFileFrequency(string file)
        {
            int frequency = 0;
            filesFrequency.TryGetValue(file, out frequency);
            frequency++;
            filesFrequency[file] = frequency;
        }
    }

F#:

 
open LibGit2Sharp
	open System

	let rec getFileNamesFromTree (treeEntry:TreeEntry) =
		match treeEntry.Mode with
			| Mode.Directory ->
				treeEntry.Target :?> Tree
				|> Seq.collect getFileNamesFromTree
			| _ ->  seq { yield treeEntry.Path }

	let getRepoFiles (log:Commit) =
		log.Tree
		|> Seq.collect getFileNamesFromTree

	let getCommitChangedFiles (repo:Repository) (log:Commit) =
		let oldTree = log.Parents |> Seq.find(fun x -> true)
		repo.Diff.Compare(oldTree.Tree, log.Tree)
		|> Seq.map(fun x -> x.Path)

	let getChangedFiles (repo:Repository) (log:Commit) =
		match not (Seq.isEmpty log.Parents) with
		| true -> getCommitChangedFiles repo log
		| false -> getRepoFiles log

	[]
	let main argv = 
		printfn "%A" argv
		use repo = new Repository(argv.[0])
		let filter = new CommitFilter()
		let getChangedFilesForRepo = getChangedFiles repo
		filter.SortBy <- CommitSortStrategies.Time
		repo.Commits.QueryBy filter
		|> Seq.collect getChangedFilesForRepo
		|> Seq.countBy id
		|> Seq.sortByDescending (fun (x,y) -> y)
		|> Seq.iter (fun (x,y) -> printf "%s %i\n" x y)

		System.Console.ReadKey() |> ignore

		0

And here are the results from running this code on one of my side projects. It gives a list of all files that belong to the project along with the number of times a given file was changed in a commit. This list is ordered from most changed files, to the ones that were barely touched.


	83 PaCode.Raim\Content\Arena.js
	67 PaCode.Raim\Content\raimGraphics.js
	51 PaCode.Raim\Model\Arena.cs
	43 PaCode.Raim\Home\RaimHub.cs
	41 PaCode.Raim\PaCode.Raim.csproj
	41 PaCode.Raim\Model\Player.cs
	34 PaCode.Raim\Content\Raim.js
	26 PaCode.Raim\Content\userInput.js
	25 PaCode.Raim\Home\Index.html
	24 PaCode.Raim\Model\Bullet.cs
	22 PaCode.Raim\Model\MapGenerator.cs
	19 PaCode.Raim\Content\raim.css
	14 PaCode.Raim\Content\PlayersList.js
	12 PaCode.Raim\Home\ArenaTicker.cs
	12 PaCode.Raim\Content\mapBuilder.js
	11 PaCode.Raim\Model\CollisionEngine.cs
	9  PaCode.Raim\Model\Vector2d.cs
	8  PaCode.Raim\Model\IGameObject.cs
	8  PaCode.Raim\Model\Obstacle.cs
	7  PaCode.Raim\gulpfile.js
	7  PaCode.Raim\ArenaDefinitions\build1.txt
	7  PaCode.Raim\Content\KeyboardInput.js
	6  PaCode.Raim\Model\BoundingBox.cs
	6  PaCode.Raim\Scripts\helpers.js
	5  PaCode.Raim\Bundles.json
	5  PaCode.Raim\Home\MapBuilder.html
	5  PaCode.Raim\Startup.cs
	4  PaCode.Raim\web.config
	4  PaCode.Raim\ArenaDefinitions\Arena51.txt
	4  PaCode.Raim\Model\Range.cs
	3  PaCode.Raim\Home\HomeModule.cs
	3  PaCode.Raim\Model\IDestroyable.cs
	3  PaCode.Raim\Home\PlayerInput.cs
	3  PaCode.Raim\packages.config
	2  PaCode.Raim\package.json
	2  PaCode.Raim\Model\QuadTree.cs
	2  PaCode.Raim\Content\favicon.png
	2  PaCode.Raim\PaCode.Raim.sln
	2  README.md
	2  PaCode.Raim\Content\DSP2016.png
	2  PaCode.Raim\Home\MoveDirection.cs
	2  .gitignore
	2  PaCode.Raim\Bundles\raim_main.js
	2  PaCode.Raim\Bootstrapper.cs
	2  PaCode.Raim\Scripts\jquery-1.6.4-vsdoc.js
	2  PaCode.Raim\Scripts\jquery-1.6.4.js
	2  PaCode.Raim\Scripts\jquery-1.6.4.min.js
	1  PaCode.Raim\Model\ILimitedTimelife.cs
	1  PaCode.Raim\Content\keysInput.js
	1  PaCode.Raim\Content\moveDirections.js
	1  PaCode.Raim\Scripts\jquery-2.2.1.js
	1  PaCode.Raim\Scripts\jquery-2.2.1.min.js
	1  PaCode.Raim\Scripts\jquery-2.2.1.min.map
	1  PaCode.Raim\Scripts\jquery.signalR-2.2.0.js
	1  PaCode.Raim\Scripts\jquery.signalR-2.2.0.min.js
	1  PaCode.Raim\HomeModule.cs
	1  PaCode.Raim\Properties\AssemblyInfo.cs
	1  PaCode.Raim\web.Debug.config
	1  PaCode.Raim\web.Release.config
	1  LICENSE

Arena related files, as well as graphics handling and hub class – something seems to be not well there! Those files are changing way too often compared to other files (and this is the measurement I decided to use to pinpoint potential troublemakers). So if I were to concentrate on fixing something in this projects – those files would be worth looking into. And indeed, those files are responsible for much of a project’s complexity. I don’t even know how much time I’ve spent working on arena code on client and server side. Those are the most complicated elements of a project, responsible for most of the functionality in an application. And that simply means: those are too big elements. Functionality should have been moved to more specialized modules.

Of course this is not a perfect solution. What we have here is barely scratching the surface. It is just a very simplified functionality to show the idea. It doesn’t take into account that even though a hub file was indeed changing quite a bit at some point, for quite some time, it was refactored and this functionality moved somewhere else. So looking just at frequency might give you some false positives. But it is a start, the first piece of data on the long journey to great codebase. Other measurements that you could look at would be change rate for files over last few weeks, maybe months. Days of a week on which those changes happen (are those genuine changes, or people are just trying to fix stuff at the last minute, just before code freeze or release?). You can look at authors for those files, look for knowledge silos. And I am confident that you can come up with some more things to measure that could help you improve status of your project.

So do take this from this blog post: look for deeper truth about your codebase, don’t judge just based on your gut feeling or emotions. The dragons are there, but their lairs are hidden, not for anyone to discover.