Mark A. Grant: Compression and Communication in Software Development

The process of refactoring software code is something akin to a compression algorithm. As an example let us take a simple lossless image algorithm that takes runs of the same colour and stores the colour and the number of pixels in the run. We can think of this as finding two functions that are the same, but with different names and creating 1 function and pointing the original two callers to the new function. This process is about abstraction.

Fair enough you might say, but we can take this a bit further. Thinking about compression triggered some thoughts about Information theory I learnt at University. Namely, Shannon's Law- so I looked it up on Wikipedia, and here is what it had to say:-

it is a felony offense to discharge firearms randomly into the air, resulting in injury or death.

Err, ok maybe not that Shannon's Law (in Arizona). A little further digging on the right pages reveals Coding theory (no it not about programming!!!!!). Ah, and now I have found the bit of information I am after (and its a 1 - sorry poor joke -it's a Sunday evening!!!)

Entropy of a source is the measure of information. Basically source codes try to reduce the redundancy present in the source, and represent the source with a fewer bits that carry more information.

SOURCE ENCODING

Translated to software development - there is a minimum number of lines of code required for each program, where there is absolutely no redundant code. This can commonly be seen in some coding competitions, where the winner of a challenge solves the problem in the fewest lines of code.

Putting aside readability, quality, comments, maintainability, the meaning of your code or communicating your code with other developers etc, we can surmise the following. When your code is represented in the most concise manner, no two lines or pieces of code should look the same, do the same thing or repeat any other lines of code. This is because any lines of code with the same data are redundant and do not transfer any new information to the compiler.

So here is the problem, in many languages it is IMPOSSIBLE to make code concise and remove redundancy - thereby reducing the entropy of the code. Take for example design patterns, they are seriously REDUNDANT!!!! The mere fact that they exist, they provide the same construct in multiple scenarios really means they should have been written once, not millions of times by millions of developers!!! (Mind you I have now used generics to write a Singleton - ONCE)

C# has just created some "syntactic suger" to reduce the amount of code required to write a property, which meets one of the key tenets of Information Theory. Namely, use short words for the most frequently used words!

So, on the flip side, is:

CHANNEL ENCODING

When communicating information we need to add redundant information to ensure our message is understood - think listening to a fuzzy radio station - sometimes it is possible to understand without hearing every word!

So, in programming terms, we have two key audiences/recipients for our code:-

The compiler
You, the developer, and other developers

We add large amounts of redundancy for comments and in terms of code structure in order to effectively communicate the meaning of our code. At this point, due to tiredness I am starting to forget what I really wanted to say! A quick wrap up and I'll think about it some more!!!

1. We can apply some of the theories of Information theory to Software development

2. We apply Source Encoding as we refactor/abstract our code

3. We apply Channel Encoding to improve the communication of our code

4. Finally, we are slaves to our compilers when it comes to removing redundant lines of code

3 Comments:

At 10:44 am, Jim said...: In enterprise development I think it's mostly about the channel encoding to optimize developer to developer communication and maintainability - we refactor to make cleaner, more comprehensible structures.

I'd regard the syntax tweaks in C# and Java as less lossy channel encodings.

It's all different in mobile phones and games consoles and their ilk - the human definitely comes second!

It would be interesting to see if you could apply any signal theory to derive better code metrics than we use now.
At 1:05 pm, MAG said...: Errr damn, you are absolutely spot on. The question is can the entropy of the code tell us

a. How well factored the code is?
b. Whether one language is more efficient than another.

And quite right again - is it possible to separate the source and channel encoding i.e. the communication/maintainability of the code from how well abstracted it is?

My brain is already going on how a code metric algorithm could be devised - if you have any ideas Jim that would be great
At 9:46 pm, MAG said...: So, a little bit more resarch has revealed a whole field I didn't know about as a developer. Tapping "measuring software entropy" into google reveiled a whole bunch of IEEE and HP papers on the subject.

Hacing dug around a little, most sources are not free, but this article on OO Software and Entropy may provide a start:

http://ww1.ucmss.com/books/LFS/CSREA2006/SER4122.pdf

and this one:

http://eos.uom.gr/~steph/pdfs/BCIachatSteph.pdf

A bit academic, but I haven't found any tools that really utilise it preferring simpler metrics.

<< Home

Mark A. Grant

Sunday, March 25, 2007

Compression and Communication in Software Development

3 Comments:

About Me

Previous Posts