Implementing Incremental Crawling for BDC data
Search is a very powerful component of the Business Data Catalog as it allows your users to find information about your business from many different line of business systems. The actual indexing and crawling process can be quite intensive both from a SharePoint perspective, but also in terms of your line of business system. If you have 100,000 customer records in your database you may want to be careful when want SharePoint to be querying that much data so not affect end user performance.
Thankfully when you configure the BDC and Search you can configure a full search and incremental searches the same way you can do when crawling normal SharePoint content such as sites and documents
– A full BDC crawl will index all the records
– An incremental BDC crawl will index only records who’s data has changed
But how does the indexer know which BDC records have changed? For it to know this we have to implement a property in our Entity called the __BdcLastModifiedTimestamp. Nice name huh!
Now a small admission also. Whenever we describe the IdEnumerator method we always say that it only returns the primary key fields for an entity. This is generally true – except for when you want to implement an incremental crawl. If you want to do this, your IdEnumerator method must also return a DateTime field that will indicate to the indexer when it was last modified. The indexer can then compare this to the previous LastModified value it holds and if it is different, it can index the entire row of data.
How to implement the BdcLastModifiedTimestamp in BDC Meta Man
1, Open BDC Meta Man and connect to SQL Server, for this example we are going to use the sample database AdventureWorks2000.
2, Drag and drop the Addresses table onto the design surface.
Notice this table has a field called ModifiedDate. This is updated for each row everytime the information in it is changed. We’ll use this as our BdcLastModifiedTimestamp column.
3, Right click on the entity and choose Edit Entity
4, In the Entity Admin screen select the Methods tab along the top
5, Select the IdEnumerator method from the list
6, When you select this you’ll notice a little exclamation mark appear towards the bottom right of the form. This indicates that the BdcLastModifiedTimestamp field has not been set for this entity.
Select ModifiedDate from the available columns and then click the ‘Set BdcLastModified Timestamp’ button and the highlighted column will move across.
7, Now save the entity, generate the XML application definition file, and import into SharePoint.
8, Please check this article on Business Data Catalog Search on how to configure your BDC content source.
How do we know an incremental crawl is working?
When you set an incremental crawl going (or it executes via schedule), how do you know the indexer is only getting the records that have changed and not the entire lot again?
Well if the crawl runs and takes only a few seconds rather than minutes – that’s a good indicator! But I’d rather be certain! Here’s how to do it with SQL Server 2008
1, Open up SQL Server 2008 Enterprise Manager
2, From the main menu open Tools – SQL Server Profiler. With this tool we can see exactly what queries are being executed against our database.
3, After choosing which server to connect to, in the Trace Properties window move to the Events Selection tab.
4, Click the Show all Columns checkbox, and then move the scroll bar along so you can see the Database Name column in the window. Select this column for all the types of events.
5, Now you need to add a filter, so click the Column Filters button
6, Select the Database Name column from the list on the left, expand the Like tree node and add in the name of your database. Finally click the checkbox to exlude rows that do not contain values.
7, Click OK to add the Filter, and then Run back on the Events Selection Window.
Now Query Profiler is looking at every query being run on our AdventureWorks2000 database. Make sure you have at least one record of data modified since you did a full crawl of your content source and that the necessary BdcLastModifiedTimestamp field has been updated. Now start an incremental crawl going. You will see at least two queries being executed:
1, The IdEnumerator query is run and pulls back all the primary key fields and BdcLastModifiedTimestamp values.
2, After checking the BdcLastModifiedTimestamp the indexer will execute a specific finder query for any rows that have been updated.