Site data query and parallel queries

As I showed in my first post, I also have my Activities entity – and from reading (way) too much through the various literature on how to make SharePoint scale with large lists, I see that one of the recommended ways of scaling is to distribute your data into multiple lists. Also with my last post in mind, about throttling, I figure, that keeping many relatively small lists would be better, that one big potentially unreadable list.

We also have the Activity type to take into consideration, so let us decide, that a list, with a given list template is the “Activity type”, and the individual items in the list are the “Activities”. I’ll even index the appropriate columns, to make them searchable if we should get above the throttle level, since the queries for activities aren’t that complex – mostly something about searching for the activities belonging to a particular “Case”, or with a due-date within a given time span.

Kimblog1

I can then use the SPSiteDataQuery to search for my activities, as long as I keep of my “too big” “Cases” list, so I start searching for all the “blue” lists, where the Case ID equals a specific case, and I get a “ThrottleException”.

That was a bit of a turn off, particularly since I know that there are no lists with items above the throttle level – and if there where, the Case ID is indexed, and there will be a maximum of 2-3 activities in each blue list, with a given Case ID – and that’s on a good day.

It seems that the reason for this is, that the template matching happens after the site data query has been run, which again means that it will search my cases list. In this list the Case ID is not indexed (well – it doesn’t even exist in the Cases list, so there’s absolutely no point in indexing it), and since doing a *potential* search for Case ID in the Cases list would require a full read of all items, and the number of items are above the throttle level, we get the evil ThrottleException.

Kimblog2

But we still need the to get the items out of the blue lists – so the only other option is to start searching each lists sequentially – or even better : in parallel.

Multithreading in Web application has always annoyed me a bit: We are not allowed to use the threading classes, but are left with an implementation that are bound to the code-behind of our Forms. Doing a layered application, this poses some difficulties, because the form code-behind should have no involvement in how (and where) I fetch my entities.

What I (until now) have been able to get is:

kimblog3

Where what I want is:

kimblog4

I read on a blog somewhere (and unfortunately I can’t seem to remember where – but it was a “Microsoft official“ one), that from .NET 3.5 you are allowed to use the threading library in your web applications, which is good, since we are now able to implement the wanted solution into our search classes.

First of all, we need to refactor our original “SPQuerySource” model slightly – it’s nothing major, but I want to

factor out the generation of the query itself into methods I can use in my parallel queries:

        protectedstring _listName = string.Emtpy;publicSPWeb Web{get{return _web;}}protectedvirtualstring GetQueryString(DataQuery q){string qstr = “”;for (int i = 0; i < q.Criterias.Count; i++){if (i == q.Criterias.Count – 1)

{

qstr += getCriteria(_listName, q.Criterias[i]);

}

else

{

qstr += “<And>” + getCriteria(_listName, q.Criterias[i]);

}

}

for (int i = 0; i < q.Criterias.Count – 1; i++)

{

qstr += “</And>”;

}

return qstr;

}

protectedvirtualstring GetSortString(DataQuery q)

{

string sortStr = “”;

if (q.SortBy.Count > 0)

{

sortStr += “<OrderBy>”;

foreach (SortOrder so in q.SortBy)

{

sortStr += getSortField(_listName, so);

}

sortStr += “</OrderBy>”;

}

return sortStr;

}

publicvirtualList<Target> Search(DataQuery q, ToEntityFunction<Target, SPListItem> func)

{

var returns = newList<Target>();

SPQuery query = newSPQuery();

query.Query = “<Where>” + GetQueryString(q) + “</Where>” + GetSortString(q);

query.RowLimit = (uint)q.MaxItems;

SPList l = Web.Lists.TryGetList(_listName);

if (l != null)

{

 

SPListItemCollection col = l.GetItems(query);

foreach (SPListItem item in col)

{

if (func != null)

{

Target Entity = func(item);

returns.Add(Entity);

}

}

}

return returns;

}

 

Next I define a couple of helper objects for my parallel query. I need some kind of “configuration” object, telling what site and web I’ll be searching in. Since the queries will run in its own thread, I can’t use the main site/web construct – I can’t use the returned SPListItems from the threads in the main thread either, since these will be bound to the thread context, but luckily I really don’t care about the SPListItems: I want to return my own entity objects (so you can see everything coming together nicely). I also need something configuring explicitly the individual threads.

The configuration object will also hold the returned values.

I’ll also be using the ManualResetEvent, and send a “Set” when all threads have been finished. The automatic reset event has a limit to, how many threads can be spawned and waited for (I think it is 64), and I don’t want to impose such a restriction.

 

        privateclassParallelThread<Target>{publicstring ListName = “”;publicParallelConfiguration<Target> Configuration = null;public ParallelThread(ParallelConfiguration<Target> configuration, string listName){Configuration = configuration;ListName = listName;}}privateclassParallelConfiguration<Target>{publicList<Target> ReturnsObjects = newList<Target>();

publicobject LockHandle = newobject();

publicManualResetEvent CommonWaitHandle = newManualResetEvent(false);

publicint NumberOfThreadsNotCompleted;

publicGuid SiteColID;

publicGuid WebId;

public string QueryString = string.Empty;

publicToEntityFunction<Target, SPListItem> EntityFunction;

public ParallelConfiguration(

Guid siteColID,

Guid webId,

int numberOfThreads,

ToEntityFunction<Target, SPListItem> entityFunction

string queryString)

{

NumberOfThreadsNotCompleted = numberOfThreads;

Query = query;

SiteColID=siteColID;

WebId=webId;

EntityFunction = entityFunction;

QueryString = queryString;

}

 

publicvoid Finally()

{

if (Interlocked.Decrement(ref NumberOfThreadsNotCompleted) == 0)

{

CommonWaitHandle.Set();

}

}

}

 

Then I just need to subclass my original SPQueryObject, and redo the “search” method, to search in parallel in the supplied lists. I’ll be using the “ThreadPool” object, since I don’t want to concern myself too much about scheduling etc., and the ThreadPool should exactly take of all these details for me. Since the entities will enter my result list, in an undefined order, and I still want to be able to do some sorting, I need to take care of this manually at the end of my parallel query – which, because of the generic implementation, will require some reflection – but nothing too fancy, so we will survive.

 

    publicclassSPParallelQuery<Target>: SPQuerySource<Target> where Target:class{privateList<string> _listsToSearchIn = null;public SPParallelQuery(SPWeb web,List<string> listsToSearchIn):base(web,””){_listsToSearchIn = listsToSearchIn;}publicoverrideList<Target> Search(DataQuery q, ToEntityFunction<Target, SPListItem> func){if (string.IsNullOrEmpty(_listName)){_listName = _listsToSearchIn[0];} 

string queryString = “<Where>” + GetQueryString(q) + “</Where>” + GetSortString(q);

ParallelConfiguration<Target> conf = newParallelConfiguration<Target>

(Web.Site.ID,

Web.ID,

_listsToSearchIn.Count,

func,

queryString);

foreach (string s in _listsToSearchIn)

{

 

ParallelThread<Target> t = newParallelThread<Target>(conf, s);

ThreadPool.QueueUserWorkItem(newWaitCallback(ExecuteQueryOnList), t);

}

conf.CommonWaitHandle.WaitOne(newTimeSpan(0, 2, 0));

//wait maximum 2 minutes…… the return whatever we got

 

List<Target> returns = conf.ReturnsObjects;

if (q.SortBy.Count > 0)

{

try

{

SimpleSortOrder sort = q.SortBy[0] asSimpleSortOrder;

if (sort != null)

{

 

string fieldName = sort.Field;

if (typeof(Target).GetProperty(fieldName) == null)

{

foreach(PropertyInfo p intypeof(Target).GetProperties())

{

object[] attributes = p.GetCustomAttributes(

typeof(LinqExtender.Attributes.NameAttribute), true);

if (attributes.Length == 1 &&

((LinqExtender.Attributes.NameAttribute)attributes[0]).

Name ==fieldName)

{

fieldName = p.Name;

break;

}

}

}

 

if (typeof(Target).GetProperty(fieldName) != null)

{

returns.Sort((target1,target2) =>

{

try

{

PropertyInfo prop1 = target1.GetType().GetProperty(fieldName);

PropertyInfo prop2 = target2.GetType().GetProperty(fieldName);

if ((prop1.GetValue(target1) isIComparable) &&

(prop2.GetValue(target2) isIComparable))

{

IComparable val1 = prop1.GetValue(target1) asIComparable;

IComparable val2 = prop1.GetValue(target2) asIComparable;

return val1.CompareTo(val2);

}

}

catch (Exception ex)

{

// unable to compare

}

return 0;

});

if (sort.SortOrder == SortDirection.descending)

{

returns.Reverse();

}

}

}

}

catch (Exception ex)

{

// unable to sort

}

}

return returns;

}

privatevoid ExecuteQueryOnList(object state)

{

ParallelThread<Target> t = state asParallelThread<Target>;

if (t != null)

{

try

{

using (SPSite site = newSPSite(t.Configuration.SiteColID))

{

using (SPWeb web = site.OpenWeb(t.Configuration.WebId))

{

SPQuery query = newSPQuery();

query.Query = t.Configuration.QueryString;

query.RowLimit = (uint)5000;

SPList l = web.Lists.TryGetList(t.ListName);

if (l != null)

{

 

try

{

SPListItemCollection col = l.GetItems(query);

foreach (SPListItem item in col)

{

lock (t.Configuration.LockHandle)

{

if (t.Configuration.EntityFunction != null)

{

Target Entity = t.Configuration.EntityFunction(item);

t.Configuration.ReturnsObjects.Add(Entity);

}

}

}

}

catch (Exception ex)

{

// nothing much we can do about it here……

}

}

}

}

}

finally

{

t.Configuration.Finally();

}

}

}


Kimblog6

Now we can make our “SPActivityData” data access object, that will search for all activities, matching a given criteria.

    publicclassSPActivityData : IEntityData<Entities.Activity>{privateSPWeb _web = null;privateIQuerySource<Entities.Activity, SPListItem> _searchSource = null;public SPActivityData(){// THIS WOULD NORMALLY BE FACTORED OUT_web = SPContext.Current.Web;List<string> Lists=newList<string>();Lists.Add(“TestA”);Lists.Add(“TestB”);_searchSource = newSPParallelQuery<Entities.Activity>(_web, Lists);}public SPActivityData(SPWeb web)

{

// THIS WOULD NORMALLY BE FACTORED OUT

List<string> Lists = newList<string>();

Lists.Add(“TestA”);

Lists.Add(“TestB”);

_web = web;

_searchSource = newSPParallelQuery<Entities.Activity>(_web, Lists);

}

 

publicList<Entities.Activity> GetItems(Query.DataQuery q)

{

return _searchSource.Search(q, ListItem => ListItem.ToActivity());

}

}

And I can then write our ActivityRepository (with Linq support, of course).

    publicclassActivityRepository{privatereadonlyIEntityData<Entities.Activity> _caseDataInterface = null;public ActivityRepository(){_caseDataInterface = new DataAccess.SharePoint.SPActivityData(SPContext.Current.Web);}public ActivityRepository(IEntityData<Entities.Activity> dataContext){_caseDataInterface = dataContext;}publicList<Entities.Activity> GetActivities(Entities.Case parentCase){returnnewList<Entities.Activity>();

}

publicIQueryContext<Entities.Activity> Activities

{

get

{

IQueryContext<Entities.Activity> returns = new

SPQueryContext<Entities.Activity>(_caseDataInterface);

return returns;

}

}

publicList<Entities.Activity> GetAllWithTitleA()

{

var l = from activity in Activities

where activity.Title == “Title a”

orderby activity.Title

select activity;

return l.ToList();

}

}

As I see it, there are some inconsistencies in the SharePoint query engine, when it comes to the throttling. If doesn’t feel like it has been implemented fully, and I don’t think that the statement “SharePoint now supports large lists” completely holds true – but as you hopefully can see, it is possible to get descent results out of your SharePoint installation.

And as I wrote in the previous post: As long as you have analyzed your business requirements, and verified that you are able to write queries on indexed expressions, you can make workable solutions with large lists.

Kimblog7If you wish, you can write an e-mail to me kim.hermansen@visma.com and get some of the code handed out.