Saturday, June 23, 2007

Searching Code

Integrated Development Environment's (IDE's) are stuck in the doldrums when it comes to finding the right bit of code for a job. Today's web 2.0 applications lead the way in information retrieval, but these tools are not available in the world of software development.

IDE's today

Current IDE's are based heavily around directory structures (namespaces) for grouping similar code together. Think System.XYZ. They have some simple tools for following references, i.e. go to the definition of a function or find all references to a function, and a rudimentary search engine.

The current state-of-the-art in Web 2.0 Information Retrieval

For large sets of data Search has kicked Directory structures into touch. Search however is not contextual (its dumb and has no understanding of context). This leads to billions of results may of which are irrelevant. Web 2.0 companies have tried to overcome this through tagging and mark-up (the Semantic Web), which has more meaning, but requires effort on the part of the user (i.e. adding that tag). This is being overcome by smart algorithms that personalise your information based on your previous history of activities - think Amazon and "People who read this also read....".

What are we missing??
  1. Viewing related information to the task currently being undertaken - hey, you are changing interface X - here are all the classes, which implement it. Or, you have just named a class Y here are all the other classes with similar names
  2. Better Search - common can we get a "Code Rank" algorithm - here are the most popular functions/classes based on their links (references) - and don't give me "you need to decouple your code more", some algorithms are more fundamental - like adding to a list
  3. Personalised Search - using the historical results from the work you are doing to hone your search into something useful.
  4. Better Collaboration - none of this I'm going to shove IM in a toolbar and let you get on with it. I want to get realtime visualisation of which of my team members are working on which bits of code etc

The Solution

  1. Make the IDE a Web-App running on a server, with integrated backend into the Repository = Solves 4 for Collaboration and also Version Control
  2. Builds all running on the Server-side = Faster Builds
  3. UI update to include contextual results panes - Solves 1
  4. Server-side search engine to cross correlate user's activity - Solves 2 and 3 for personalisation

Of course, this does come with a downside - us developers like to have the fastest, the whizziest, pc beast on the block, and if its running like a webapp we're not going to get the budget - Damn!

Monday, May 07, 2007

Nesting Splitters

To recap on my previous post - C# provides the "yield" keyword that allows any function that returns an IEnumerable object to provide lazy iteration capability (a Custom Enumerator). This allows us to iterate across an array or list and filter elements, convert them to different types, aggregate them with another list etc.

My last post detailed how to "Split" an IEnumerable into two separate mutually exclusive enumerators. I.e. the equivalent of an if-else statement acting on each element.

At the end of the post I suggested that it would be great to output multiple enumerators i.e. the equivalent of if-then-else or switch type patterns. This may not be required as we can Nest Enumerators!

It goes like this:

inputList => split1 and split2 using our first test/predicate
split1 => split3 and split4 using our second test/predicate on the results of split1

We can then split the results add infinitum - creating a Binary Tree Splitter. At each node in the tree we queue the results of the "lagging" enumerator until it catches up. What is the benefit?

We can split out multiple enumerators and process each of them differently.

Wednesday, May 02, 2007

Splitting an IEnumerable into Multiple Custom Iterators

Bill Wagners MSDN column on custom iterators is excellent. It demonstrates clearly how to create a custom iterator so that IEnumerable iterators can iterate through collections in a lazy fashion. That is only processing the elements in the collection that are required.

As soon as it becomes evident that custom iterators take an IEnumerable as an input and output an IEnumerable it also becomes evident that we can "pipeline" custom iterators.

For Example:


  1. Start with a Collection of People
  2. Filter out all people with surname Grant
  3. Filter out all people with first name Mark
  4. Display List of Results on screen

This is a simple pipeline like so: 1 -> 2 -> 3 -> 4

We can also Transform iterators into different types and we can Merge or Aggregate multiple iterators into one be iterating over each in turn. This got me thinking is it possible to split an iterator into mutiple iterators.

By splitting an iterator it allows us to process the different sets of results differently. As an example:

We have an iterator that recurses a file structure. We need to split out two iterators -one for files and one for directories.

Out input iterator is of type IEnumerable and splits into a directoryIterator and fileIterator of type IEnumerable. These can then be Transformed into an IEnumerable and an IEnumerable respectively.

Having asked Bill if he could blog a solution I had some inspiration! Here is a test case for the solution:


using MarkGrant.Collections;

using NUnit.Framework;
using System;
using System.Collections;
using System.Collections.Generic;
using System.Text;

namespace MarkGrantTest
{

/// <summary>
///
/// </summary>
/// <originalAuthor>mark.grant</originalAuthor>
/// <creationDate>01/05/2007 23:58:19</creationDate>
[TestFixture]
public class TestSplitter
{
[Test]
public void Test()
{
List<int> list = new List<int>();
for (int i = 0; i < 20; i++)
{
list.Add(i);
}

// intialize our splitter
Splitter<int> splitter = new Splitter<int>(list, delegate(int number) { return number % 3 == 0; });
IEnumerable<int> factor3 = splitter.GetMatchingElements();
IEnumerable<int> nonFactor3 = splitter.GetNonMatchingElements();

foreach (int factor in factor3)
{
Assert.IsTrue(factor % 3 == 0);
}

foreach (int nonFactor in nonFactor3)
{
Assert.IsTrue(nonFactor % 3 == 1 nonFactor % 3 == 2);
}
}
}
}


The test creates a Splitter class that Matches multiples of 3 and non-multiples. The IEnumerables can be called in any order and can overlap. As we don't know which iterator will be called the splitter will need to record the current state of the two iterators - i.e. which elements are left, which iterator they are related too etc. This can be achieved by queuing elements that are requested but belong to the other iterator.

The Splitter class:


using System;
using System.Collections.Generic;
using System.Text;

namespace MarkGrant.Collections
{
/// <summary>
/// Split an input enumerable into two mutually exclusive enumerables
/// </summary>
/// <typeparam name="T"></typeparam>
public class Splitter<T>
{

/// <summary>
/// Are we queuing the results from the first enumerator or the second - it
/// depends on which one is ahead
/// </summary>
private bool _queuingFirstIterator;

/// <summary>
/// We Queue one Enumerator unless the two are perfectly synchronised
/// </summary>
private Queue<T> _unyieldedElements = new Queue<T>();

/// <summary>
/// Our filter
/// </summary>
private Predicate<T> _predicate;

/// <summary>
/// Our input enumerator
/// </summary>
private IEnumerator<T> _inputEnumerator;

/// <summary>
/// Construct our Splitter with the input enumerator and the filter
/// </summary>
/// <param name="inputEnumerable">the input enumerator</param>
/// <param name="predicate">our filter</param>
public Splitter(IEnumerable<T> inputEnumerable, Predicate<T> predicate)
{
if (inputEnumerable == null)
{
throw new ArgumentNullException("inputEnumerable");
}

if (predicate == null)
{
throw new ArgumentNullException("predicate");
}
_inputEnumerator = inputEnumerable.GetEnumerator();
_predicate = predicate;
}

/// <summary>
/// All elements that match the predicate
/// </summary>
/// <returns></returns>
public IEnumerable<T> GetMatchingElements()
{
return GetNextElement(true);
}

/// <summary>
/// All elements that don't match the predicate
/// </summary>
/// <returns></returns>
public IEnumerable<T> GetNonMatchingElements()
{
return GetNextElement(false);
}

/// <summary>
/// Returns the next element for our enumerator
/// </summary>
/// <param name="isMatching"></param>
/// <returns></returns>
private IEnumerable<T> GetNextElement(bool isMatching)
{
while (true)
{
// if we have an element in the queue for this iterator then
// yield it otherwise next element in the inputEnumerator is
// for this enumerable then yield it otherwise add it to the queue
// break when no more elements remain in either
if (_queuingFirstIterator == isMatching && _unyieldedElements.Count > 0)
{
yield return _unyieldedElements.Dequeue();
}
else if (_inputEnumerator.MoveNext())
{
if (_predicate(_inputEnumerator.Current) == isMatching)
{
yield return _inputEnumerator.Current;
}
else
{
_unyieldedElements.Enqueue(_inputEnumerator.Current);

_queuingFirstIterator = !isMatching;
}
}
else
{
break;
}
}
}

}
}





The Splitter class is initialized with the input IEnumerable and the Predicate that is used to filter the elements into the Matching and NonMatching itereators. As we enumerate through the input iterator, we keep iterating until we find an element that matches the one we are after, all the elements that we don't want we queue and a flag is set to record which iterator is "lagging" behind and therefore requires its elements to be queued.

We could for example alternate the requesting of results for either iterator. For example when matching factors of 3 starting with the iterator that matches 3, 6, 9 etc and an input list of 1,2,3,4,5,6 we get, where M is the Matching iterator (i.e. anything which is a factor of 3):

  1. M - Queue (1)
  2. Yield (1)
  3. M - Queue (2)
  4. Yield (2)
  5. M - Yield (3)
  6. Yield (4)
  7. M - Queue (5)
  8. Yield (5)
  9. M - Yield (6)

If anyone fancies improving the code to include multiple predicates, please let me know.