The Old Joining Dots Blog

Please note this site has been retired and is no longer updated.
Please visit the new blog at www.joiningdots.com/blog/

12 June 2006

Microsoft Search Strategy - Part 2

Part 1 looked at the current portfolio of search products and technologies available from Microsoft. In this post, we'll look at new features being provided in the next releases. As before, the content here is based on the search conference I attended last week, combined with some notes from the SharePoint conference held in May and some thoughts I had along the way...

As with the current portfolio, the products and technologies span four areas - operating system, enterprise, web and desktop. Microsoft groups them together in three buckets - web, desktop and enterprise - but it caused some confusion during the conference because the OS-provided search cropped up in various places, so I'll separate it out here...

  • Operating System: MS Search (free, provided within the OS)
  • Enterprise: SharePoint Server 2007 (product licences)
  • Web: Live Search (replacing MSN Search, still free)
  • Desktop: Live Search for Desktop (replacing MSN Desktop Search, still a free download)

Operating System

The major operating system release in the next 12 months will be Windows Vista, to replace Windows XP on desktop computers. Way back when, Vista was due to ship with a new file system - WinFS - that would have all sorts of funky features for improving storage and retrieval of modern content (i.e. life beyond docs). But WinFS got dropped from Vista last year and took a chunk of search improvements with it. Vista will still contain the basic full-text indexing and include some new features, such as property-based filtering, saved searches, the ability to search across other computers and enhanced security (for example, the index will be obfuscated with the user's unique identification key (SSID)). I believe Windows Vista will also ship with Live Search for Desktop (details later in this post) installed by default, but could be wrong.

MS Search is also being improved within server products. For example, Exchange Server 2007 will do full-text indexing of everything (as opposed to just message bodies) and will now index natively instead of first translating content into raw text - both features will help with compliance requirements. Within Outlook 2007, you get a new advanced search box that includes the ability to search by message properties (from, to, subject etc.) and hit highlighting within results.

Enterprise

For the enterprise, SharePoint Server 2007 includes a number of improvements to its core search and indexing capability. From a technology standpoint, SharePoint Server is introducing new and improved ranking algorithms to improve the relevancy of search results and scaling the index size (currently testing to 50 million documents indexed). For example, click distance - promoting results based on their distance from authoritative sites, and URL depth - relevance weakens for content located further down a site's navigation structure. Based on internal testing, Microsoft is claiming a 500% improvement in relevance based on common queries. (There is currently no external research to support the claim.). The user interface is being enhanced. The search and results pages are all based on ASP.NET v2 web parts, making it much easier to customise the standard search interface and to add search capabilities to non-SharePoint web sites that are built on ASP.NET v2. Alerts are being improved and will benefit from the addition of RSS feeds. And Microsoft is improving management and scale beyond increasing the quantity of documents indexed. One of the neatest improvements to hear about was the introduction of a security-only crawl, that just re-indexes the ACLs (access control list - determines who has access to what) on content - very useful for records (i.e. content doesn't change, but access to it might).

In the current version (SharePoint Portal Server 2003), the focus has been on unstructured documents with some limited features for also locating people (through user profiles and audience targeting). SharePoint Server 2007 moves beyond unstructured documents to also target people (social networks) and structured content (application data) through the introduction of two new search features: Knowledge Network and Business Data Catalog.

Knowledge Network

Microsoft will provide three levels of people search in SharePoint Server 2007

  • Basic: User profiles within SharePoint Server 2007 (similar to current version)
  • Intermediate: Combine SharePoint Server 2007 with Outlook 2007 to provide results based on social distance (by analysing email sent/received)
  • Advanced: Knowledge Network - Requires SharePoint Server 2007 and Outlook 2003/2007, to create skills profiles and discover contacts (internal and external)

The Knowledge Network (KN) is populated with data based on the contents of email, activities and contacts within Outlook (2003 or 2007) and extends the user profiles and search results within SharePoint Server 2007. Results are organised based on social distance - direct contacts (know you) and indirect contacts (know your colleagues). To get in contact with an indirect contact, you click on the link and are presented with a list of people you know who know this person. When Microsoft tested KN internally with 1,000 users, they uncovered 80,000 external contacts that were not centrally recorded and are now contactable through this method. It will be interesting to compare the quality of the results in this feature versus the traditional customer relationship management (CRM) system... (the decision to publish external contacts is controlled by the owner of those contacts). For more information, see the related post: MS Knowledge Network.

Business Data Catalog

The Business Data Catalog (BDC) introduces structured content searches to SharePoint Server and potentially turns SharePoint Server 2007 into an enterprise platform for creating composite applications (aka 'mash-ups'). The BDC enables you to add applications as content sources and expose key properties within search results. For example, a search on a customer name would return the usual unstructured stuff - documents - but could also return records from applications, publishing information such as account team, order history and credit rating. Clicking on a record can present a customised results page that provides more information from the application in a readable format. This is a huge leap forward for Microsoft. Previously, not only was it difficult to index structured content in the first place (no native connectors provided out of the box) but it was impossible to tailor the results into a user-friendly readable format.

So with all that love, what's the bad news about the BDC? Well, for starters the BDC will only be available in one licensing flavour of SharePoint - Enterprise edition. (See Licensing later in this post). Second, Microsoft are keen to state that the BDC doesn't require any code to be written to create connections to applications. That's true if you don't consider writing XSLT statements as coding. Personally, I call that declarative programming, i.e. code. Sure it's pretty straightforward compared to the type of code you'd normally have to write, but claiming no code is required and then quickly doing a cut/paste of some XSLT 'lines of data' whilst the audience blinks is stretching the truth a little (note: this happened at the SharePoint conference in Seattle but the same 'no code' statement was also made at the Search conference in London). The third issue that came to mind (it wasn't covered at the search conference) is licensing requirements for the applications being indexed. For example, an organisation may have deployed a CRM application only to their sales people. The BDC could be used to tie the CRM application to a manufacturing application and enable support engineers to view results based on warranty repairs... but to do so would likely require the organisation to purchase additional licences for everyone instead of just a few users. I can see this causing a few headaches because the potential for BDC to improve the accessibility and usability of information is huge and disruptive and, well just plain lovely. See related blog post: SharePoint Mash.

Licensing

From a licensing standpoint, SharePoint Server will be available in two forms - the full product and a subset 'Search only'. SharePoint Server 2007 Search Edition will provide the base indexing and search functionality, with two licence options - Standard (limited to indexing 400,000 documents) and Enterprise (unlimited indexing). The Search Edition will not include either the Knowledge Network or Business Data Catalog. It is targeted at indexing unstructured content across the corporate intranet. The full SharePoint Server 2007 product comes with two licence options - Standard and Enterprise. The Standard edition does not include the Business Data Catalog.

VersionSharePoint Server 2007 Search Edition - StandardSharePoint Server 207 Search Edition - EnterpriseSharePoint Server 2007 - StandardSharePoint Server 2007 - Enterprise
Search and indexingyesyesyesyes
Index size400,000 docsunlimited*unlimited*unlimited*
Knowledge Networknonoyesyes
Business Data Catalognononoyes
Platform for building apps**nonoyesyes

* index has been tested up to 50 million documents at time of writing. Final recommended figures will be confirmed at launch

**Features beyond search, e.g. business intelligence (Excel Services and Report Center), collaboration (Workflow), and process automation (Forms server)

This table was based on a slide at the conference. At some point, I'll try and create a table that should help explain what you get with each version.

Web

As you can probably tell by now, the large percentage of the search conference time was spent on SharePoint Server 2007. So what's coming up on the Web...

Live Search is the new name for MSN Search, which joins the portfolio of Live services currently in beta. It will include some interesting new enhancements, perhaps the most novel being the delivery of all results on a single page (you can see it for yourself by doing a search up on http://www.live.com). It also includes a density bar that changes the concentration of results For me, the most interesting feature demo'd at the conference was the user interface (UI) - it had a similar look and feel to the new UI in Live Search for Desktop (see next section of this post), very much in the same way that Outlook Web Access (OWA) mimics the full Outlook client as much as possible within the confines of the browser.

Desktop

Live Search for Desktop is the new name for MSN Desktop Search, and also joins the portfolio of Live services currently in beta. As already mentioned, the new user interface has many similarities to the new UI for Live Search in the web browser (for example, they both have tabs and a density bar) whilst retaining features from the current version such as the 'deskbar' view (pane that pops up results as you type in the search etc.) It now includes hit highlighting within results and the ability to drag and drop a file from a search result into a message in Outlook (currently you'd either have to remember where it was saved and retrieve it from the file store or open the document and use 'Send to' within the native application, if supported). There will be out-of-the-box support for over 200 file types (I did chuckle on hearing that, it sounds just like what Autonomy used to say...) plus a software development kit for additional custom types.

Search extends beyond the local computer to include indexing Exchange (both the inbox and public folders) and network file shares, as well as connecting to the index generated by SharePoint Server 2007 for intranet searches and Live Search on the web for Internet searches. Indexing network file shares on an individual basis should be treated with caution. Multiple users each indexing the same file share could clog up network bandwidth. Indexing Exchange content carries the same recommendation as Knowledge Network - make sure Outlook is configured for cached mode to prevent affecting performance of the Exchange Server.

As mentioned in the first post, Microsoft's Desktop Search is aiming to become the gateway for accessing all forms of content, structured (application data) and unstructured (documents and pages), local (the desktop) and networked (internal and external to the organisation). It will be interesting to see how this approach plays against Google's efforts to extend into the enterprise.

Not Included

So with the journey from OS to enterprise to web back to the desktop where it all started complete, what didn't get mentioned at the conference?

There was no discussion about futures regarding the long awaited replacement file system - WinFS, and the new search capabilities it was set to introduce. And there was no mention of Longhorn Server. More surprising was a lack of focus on tagging, given its surge in popularity over the past two years on the web. There are two types of tagging - manual (generated by users, examples include Amazon wish lists, Del.icio.us, Digg and blogging tools like TypePad and WordPress) and automatic (generated by the system, based on a training set of docs). Both have benefits for quickly locating related information, especially within social networks of peers who want to share their content. SharePoint Portal Server 2001 did include an automatic tagging feature (categorisation) that was actually quite good as long as your index was no larger than a couple of thousand documents. Lack of scale led to it disappearing and, according to the conference panel, it is not set to return just yet. SharePoint Portal Server 2003 does include some manual tagging features ('Add to Portal' link) and categories (where content can be promoted to the top of results by nominated moderators). I'm surprised if there really are no new developments in this space - the urgency to include blogs, wikis and RSS support in SharePoint Server 2007 should also have been applied to tagging.

There do not appear to have been improvements to indexing and searching media files. This is still a very immature area, but there are some great advancements being demonstrated by start-ups on the web, such as face recognition applied to pictures to auto-categorise your photos and early attempts at indexing audio and video content. Perhaps we will see developments from the Speech team and MS Research transferring to Search in the next release... (in fact, a session from the MS Research guys would have been a great ending to a very useful seminar.)

Finally, it is worth noting that Groove has not yet been added to the list of supported content sources indexed by SharePoint Server 2007. According to the panel on the day, discussions are taking place with the Groove team. I hope they end positively. Groove has an unclear future at the moment - its sessions at the SharePoint conference were not inspiring - and not being included in a SharePoint index should raise eyebrows.

Final final note - this is the fourth demo of SharePoint Server 2007 search I've seen that begins with a web page that looks oh so like Google. Maybe I'm just a stooge, but I don't like this approach (caveat: it was amusing the first time, which was an internal demo when I was still at MS). If the demo starts with a blatant copy of something well-established, it risks switching off the audience (or, worse for MS, makes them wonder if that's how the Google Enterprise Search appliance works...)? If it were me, I'd start with a search box integrated into a normal intranet home page and also show it integrated into desktop search - I think that's how many organisations would use it by default and most don't really care that Microsoft has woken up to the benefits of a clean web page with a simple search box on the web.

That's all for now... this has turned into a much longer post than originally planned. I'll try and write a short summary of both posts, but it will have to join the queue - bit behind with writing at the moment...