ShowTable of Contents
Introduction
By default, incremental crawling of IBM® Web Content Manager
TM (WCM) content should be quite fast and perform well because, by its very nature, it should be handling changes in content since the prior crawl. However, depending on various factors, some customers may notice that performance is affected significantly enough that sometimes it takes just as long---if not longer---than a full base crawl that would ordinarily handle all content. Typically the following factors are relevant:
- Number of site areas and content in general. More site areas means more path and content to process.
- Schedule and frequency of crawl. Crawling during minimum activity periods ensures the activity and changes on the server are minimal, and the crawl is likely to finish faster.
- Frequency of design changes. Modifying, adding, or deleting site areas implies changes for that site area and all content under that path, meaning more work during the crawl
This article discusses some simple steps to identify possible problems from incremental crawl logs, beginning with the necessary and useful trace parameters, followed by the process to measure time to generate a seedlist.
The queries associated with the seedlist in an incremental crawl are listed, and we examine the performance of the individual queries and discuss an approach to duplicate and verify any problem query performance, finishing up with possible recommendations to enhance incremental crawl performance.
Relevant troubleshooting trace parameters
The following Java
TM Content Repository (JCR) and WCM traces are useful in debugging and troubleshooting crawling issues:
com.ibm.icm.*=finest:com.ibm.icm.ci.schema.impl.SchemaService=info
com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer=all
com.ibm.workplace.wcm.seedlist.*=finer
The following IBM Support MustGather Technotes detail all the possible trace options:
• "Collecting Data: Problems with JCR for WebSphere Portal 7.0"
• "Collecting Data: Seedlist and Search for Web Content Manager 7.0"Seedlist operations and main queries
Much of the discussion in this document centers around understanding the activities involved in a crawl. The primary activity is generating an incremental crawl seedlist that accounts for all the changes since the previous crawl. The seedlist is then used as the basis for documents that need to be fetched and added to the search collection.
The seedlist generation process consists of various queries that are used to determine what has been changed (added, modified, deleted). The main queries are listed below, but here is a sample use case for seedlist generation:
During a crawl (and/or generic WCM activities) IBM WebSphere® Portal sends a request (for an incremental seedlist) in the form of an XPath query. The XPath is then translated into an SQL query, executed at the database, and the results are returned to the Portal server for further processing and later fetching of each document noted in the returned seedlist.
So, obviously, there are two views of each query, XPath and SQL. For simplicity's sake, the SQL and XPath statements matching the queries below are in Section 8 of this document, “
Appendix - XPATH and SQL syntax for core queries." Refer to that section for the raw SQL and XPath statements.
Below are the five main queries. Note that the various IDs, timestamps, etc. differ for every WCM implementation and are only used for demonstration purposes.
QUERY 1: Find all content, site areas, and content links under the site area “53227d92-aa26-4fed-9462-71c2aaed3cc9” that are published or expired, have been modified since <timestamp>, and order the results by the UUID value.
QUERY 2: Find all draft content, content links, or file components in library with UUID “027ef7bf-0efd-4876-92ed-36ca4c1a15e3” modified since <timestamp>.
QUERY 3: Find all deleted content, content links, or file components in library with UUID “ff27da4f-bb61-4cb5-84b6-e4d7523b8bbe” modified since <timestamp>.
QUERY 4: Find all sites areas under a specified UUID where the site areas have been modified since <TIMESTAMP>.
QUERY 5: Find all site area items (siteareas, webcontent, webcontlink) under specified UUID with workflowstatus=1.
We will have more discussion on the queries later but, for troubleshooting purposes, we are mostly interested in how long the seedlist generation process takes. For this reason you can run various commands to time or assess the seedlist generation performance in lieu of the whole incremental crawl process, which where the most problems and debugging take place.
To this end, you can use the following time command and the wget program (Linux®/UNIX®) to measure the entire seedlist creation time, replacing tokens with appropriate ones from your environment:
time wget -O Seedlist.xml -T 7200 "https://<host>:<secure-port>/seedlist/myserver?SeedlistId=<someID>&Source=com.ibm.workplace.wcm.plugins.seedlist.retriever.WCMRetrieverFactory&Action=GetDocuments&Range=10000" --no-check-certificate --http-user=<searchUser> --http-password=<searchUserPwd>
If successful, the time it took for the seedlist to be generated will be noted, and a seedlist.xml file with seedlist information is created at the location from which the command was issued. The time should be used as reference for future debugging.
Note also that the time returned is for both executing the queries and then building the seedlist. The query execution is usually in seconds, but the seedlist document creation can take some more time. Obviously, the amount of content information contained in the seedlist affects how long it takes to be created. Our principal interest is in the query execution performance, which is discussed in the next section.
Query performance
Determining how much time is spent in each query associated with the seedlist generation process is often the primary focus of troubleshooting incremental crawl. With appropriate traces enabled, the following shows excerpt sample trace entries (in trace.log) that indicate how much time is spent in each of the five main queries:
QUERY 1 (4 seconds)
6/4/12 13:05:46:126 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
processQuery Processing query: //element(*, ibmcontentwcm:siteArea) [@jcr:uuid = '53227d92-aa26-4fed-9462-71c2aaed3cc9']//(element(*,ibmcontentwcm:siteArea) | element(*, ibmcontentwcm:webContent) | element(*, ibmcontentwcm:webContentLink))
[(@ibmcontentwcm:workflowStatus = 1 or @ibmcontentwcm:workflowStatus = 2) and @icm:lastModified >= '2012-04-29T07:51:17.872-05:00'] order by @jcr:uuid ascending
...
[6/4/12 13:05:50:198 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
processQuery
Finished Query in 4,072 ms.
QUERY 2 (less than 1 second)
[6/4/12 13:06:05:798 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
processQuery Processing query: (//element(*, ibmcontentwcm:webFolder) [@jcr:uuid = '027ef7bf-0efd-4876-92ed-36ca4c1a15e3'])/element(*, ibmcontentwcm:draftSummary)[@icm:lastModified > '2012-04-29T07:51:17.872-05:00' and@ibmcontentwcm:draftClassification = ('Content','Content Link','File Component')]
...
.[6/4/12 13:06:06:445 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
processQuery
Finished Query in 647 ms.
QUERY 3 (less than 1 second)
Find all deleted content, content links, or file components in library 'ff27da4f-bb61-4cb5-84b6-e4d7523b8bbe' modified since <timestamp>
[6/4/12 13:06:06:455 CDT] 00000051 JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer processQuery
Processing query: (//element(*, ibmcontentwcm:webFolder)
[@jcr:uuid = 'ff27da4f-bb61-4cb5-84b6-e4d7523b8bbe'])/element(*,
ibmcontentwcm:deletionSummary)[@icm:lastModified > '2012-04-29T07:51:17.872-05:00' and @ibmcontentwcm:deletionClassification = ('Content','Content Link','File Component')]
….
[6/4/12 13:06:07:248 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
processQuery
Finished Query in 793 ms.
QUERY 4 (2 seconds)
[6/4/12 13:06:07:562 CDT] 00000051 JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
hasUpdatedSiteAreas Processing query: //element(*, ibmcontentwcm:siteArea)[@jcr:uuid = '53227d92-aa26-4fed-9462-71c2aaed3cc9']//element(*,ibmcontentwcm:siteArea) [@jcr:lastModified>='2012-04-29T07:51:17.872-05:00']
….
[6/4/12 13:06:09:626 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
hasUpdatedSiteAreas
Finished Query in 2,064 ms.
QUERY 5 (3572 seconds!!!!)
[6/4/12 13:06:09:642 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
processQuery Processing query: //element(*, ibmcontentwcm:siteArea)
[@jcr:uuid = '53227d92-aa26-4fed-9462- 71c2aaed3cc9']//element(*,ibmcontentwcm:siteArea)
[@jcr:lastModified>='2012-04-29T07:51:17.872-05:00']//(element(*, ibmcontentwcm:siteArea) | element(*, ibmcontentwcm:webContent) | element(*, ibmcontentwcm:webContentLink)) [@ibmcontentwcm:workflowStatus = 1] order by @jcr:uuid ascending
….
[6/4/12 14:05:42:038 CDT] 00000051
JCRItemGather 2 com.ibm.workplace.wcm.seedlist.gatherer.JCRItemGatherer
processQuery
Finished Query in 3,572,396 ms.
In general each query should take less than 10 seconds and, in most cases, takes significantly less. Query 5 above took nearly an hour (3600 seconds) and clearly would be cause for concern. The next logical step would be to find out what makes that operation so expensive.
Verifying any site area updates
Incremental crawling in an environment in which site areas are frequently modified is quite processor intensive. These modifications refer to deletions/additions/modifications of site area names (and titles) and, to a lesser extent, site area content. Search for the following trace strings for the possibility of updated site area:
“hasUpdatedSiteAreas RETURN true”
Any occurrence indicates that site areas have changed or are being updated. Here is a basic explanation as to why any such changes are usually expensive:
Seedlist generation during an incremental crawl is stateless, meaning that previous values of a site area may not be known. However, when the hasUpdatedSiteAreas call returns “true,” we know only that there have been changes to the site area and no idea where exactly the change occurred.
Therefore all paths of and all content under the site area must be added again to ensure the updates have been accounted for. This is because a path and content under the site area will have the new site area information in their path information, and thus must be essentially recrawled.
The basic key here is, when hasUpdatedSiteAreas returns true, it is likely the operations were rather expensive and played a major role in the query's performance. Obviously, the effect will depend on the number of site areas and sub-site areas, and where the changes were made with respect to the depth or level of the site-area navigation structure.
Investigating individual queries
This section uses Query 5 from above as the basis for investigating the problem query. We isolate the query and attempt to duplicate the performance, using either or both XPath queries per the IBM Collaboration Solutions Catalog site, “
IBM Support Tools portlet for IBM Web Content Manager and raw SQL queries via the db2batch benchmarking tool.
Using XPATH queries
1. Download the
IBM Support Tools portlet for Lotus WCM, install it, and add the portlet to a page.
2. Access the portlet, select “Run XPath Query”, enter the appropriate XPath query as below, and execute:
element(*, ibmcontentwcm:siteArea)[@jcr:uuid = '53227d92-aa26-4fed-9462-71c2aaed3cc9']element(*,ibmcontentwcm:siteArea)
[@jcr:lastModified>='2012-04-29T07:51:17.872-05:00'(element(*, ibmcontentwcm:siteArea) | element(*, ibmcontentwcm:webContent) | element(*, ibmcontentwcm:webContentLink)) [@ibmcontentwcm:workflowStatus = 1] order by @jcr:uuid ascending
Upon completion, the query times should be noted and, ideally, should confirm the earlier analysis as the problem activity. Comprehensive information about using the WCM Tools portlet is in the developerWorks white paper, “
IBM Support Tools portlet for IBM Web Content Manager: Empowering you!”. Section 3.2 of the paper, “Run Xpath Query,” provides details on running XPATH queries.
Using SQL queries
Typically you should also run the query at the database end. It is important to note that the query must be run with the db2batch benchmarking tool and with the option of using the parameter marker. If run the db2batch command with the SQL statement and just the values, it will likely not show any significant signs of poor query performance, so be sure to run the query via DB2batch and with parameter markers.
Below are the strings from the log for the problem query. You could search for “Generated SQL with param markers included:” between the start and end of the query, as noted in Section 4 above. The relevant
portion is below, but you may need your Database Administrator or the appropriate database support to assist in the data collection. For DB2, refer to the MustGather document, “
Collecting Data for DB2 Compiler Issues
To run db2batch for a query with parameter markers, use the -m option and specify the values in a file:
Generated SQL with param markers included:
WITH ALLLEVELS AS (SELECT Links_21.SIID , Links_21.SVID , Links_22.TIID , Links_22.TVID , Links_22.TIX , Links_22.TCTID , Links_22.TNAME , Links_22.TNSID FROM jcr.ICMSTJCRLV00001 Links_21, jcr.ICMSTJCRLR00001 LinkRel_23, jcr.ICMSTJCRLV00001 Links_22 WHERE (Links_21.LID = LinkRel_23.ALID) AND (Links_22.LID = LinkRel_23.DLID)) SELECT DISTINCT ALLLEVELS_4.TIID , 1 WSID , NodesTab_17.VID , NodesTab_17.CTID , NodesTab_17.UUID , NodesTab_17.COMPID , Properties_19.PROPVAL ORDERVAL20 FROM ALLLEVELS ALLLEVELS_4, jcr.ICMSTJCRLV00001 Links_1, jcr.ICMSTJCRN00001 NodesTab_17, (SELECT NodesTab_18.IID ITEMID , NodesTab_18.VID VERSIONID , NodesTab_18.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_18 ) Properties_19 WHERE (((((((ALLLEVELS_4.TCTID = 1580) OR (ALLLEVELS_4.TCTID = 1668)) OR (ALLLEVELS_4.TCTID = 1331)) AND ((Links_1.TCTID = 1668) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT NodesTab_2.IID ITEMID , NodesTab_2.VID VERSIONID , NodesTab_2.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_2 ) Properties_3 WHERE ((Properties_3.ITEMID = Links_1.TIID) AND (Properties_3.VERSIONID = Links_1.TVID)) AND (Properties_3.PROPVAL = x'35333232376439322d616132362d346665642d393436322d373163326161656433636339'))))) AND (ALLLEVELS_4.SIID = Links_1.TIID)) AND (((EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_5.ITEMID , ibmcontentwcm_5.VERSIONID , ibmcontentwcm_5.ATTR0000001129 PROPVAL FROM jcr.ICMUT01580001 ibmcontentwcm_5 UNION ALL SELECT ibmcontentwcm_6.ITEMID , ibmcontentwcm_6.VERSIONID , ibmcontentwcm_6.ATTR0000001129 PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_6 UNION ALL SELECT ibmcontentwcm_7.ITEMID , ibmcontentwcm_7.VERSIONID , ibmcontentwcm_7.ATTR0000001129 PROPVAL FROM jcr.ICMUT01331001 ibmcontentwcm_7 ) Properties_8 WHERE ((Properties_8.ITEMID = ALLLEVELS_4.TIID) AND (Properties_8.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_8.PROPVAL = 1))) OR (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_9.ITEMID , ibmcontentwcm_9.VERSIONID , ibmcontentwcm_9.ATTR0000001129 PROPVAL FROM jcr.ICMUT01580001 ibmcontentwcm_9 UNION ALL SELECT ibmcontentwcm_10.ITEMID , ibmcontentwcm_10.VERSIONID , ibmcontentwcm_10.ATTR0000001129 PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_10 UNION ALL SELECT ibmcontentwcm_11.ITEMID , ibmcontentwcm_11.VERSIONID , ibmcontentwcm_11.ATTR0000001129 PROPVAL FROM jcr.ICMUT01331001 ibmcontentwcm_11 ) Properties_12 WHERE ((Properties_12.ITEMID = ALLLEVELS_4.TIID) AND (Properties_12.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_12.PROPVAL = 2)))) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_13.ITEMID , ibmcontentwcm_13.VERSIONID , ibmcontentwcm_13.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01580001 ibmcontentwcm_13 UNION ALL SELECT ibmcontentwcm_14.ITEMID , ibmcontentwcm_14.VERSIONID , ibmcontentwcm_14.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_14 UNION ALL SELECT ibmcontentwcm_15.ITEMID , ibmcontentwcm_15.VERSIONID , ibmcontentwcm_15.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01331001 ibmcontentwcm_15 ) Properties_16 WHERE ((Properties_16.ITEMID = ALLLEVELS_4.TIID) AND (Properties_16.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_16.PROPVAL >= '2012-04-29 12:51:17.872'))))) AND (ALLLEVELS_4.TIID = NodesTab_17.IID)) AND ((Properties_19.ITEMID = ALLLEVELS_4.TIID) AND (Properties_19.VERSIONID = ALLLEVELS_4.TVID)) ORDER BY ORDERVAL20 ASC
Recommendations
Once the problem query has been identified and confirmed, you can take steps to either improve the query performance or make other changes to alleviate its effects, for example:
- If you change your site areas all the time, then using incremental search is NOT ADVISED! In this scenario you should ALWAYS crawl ALL ITEMS anew. So in the crawler, enable "Force full crawl", to disable incremental crawling.
- If incremental crawls are necessary in your environment, they should be infrequent and done during periods of minimal activity on the server.
- Explore and add additional indexes for the problem queries, starting with the Technote, “Additional Indexes on JCR table ICMSTJCRLINKS may be necessary to improve WCM incremental crawl performance."
Basically, work on the various optimization levels for the queries and review which indexes are useful for the your environment. Everyone has different WCM implementation and usage patterns, so the exact indexes will be unique to your environment. The important point is that you should engage your Database Administrator to investigate optimizing any problem queries and adding necessary indexes.
Appendix: XPATH and SQL syntax for core queries
QUERY 1
Find all content, site areas, and content links under the site area '53227d92-aa26-4fed-9462-71c2aaed3cc9' that are published or expired, have been modified since <timestamp>, and order the results by the UUID value:
XPATH
//element(*, ibmcontentwcm:siteArea) [@jcr:uuid = '53227d92-aa26-4fed-9462-71c2aaed3cc9']//(element(*,ibmcontentwcm:siteArea) | element(*, ibmcontentwcm:webContent) | element(*, ibmcontentwcm:webContentLink))[(@ibmcontentwcm:workflowStatus = 1 or @ibmcontentwcm:workflowStatus = 2) and @icm:lastModified >= '2012-04-29T07:51:17.872-05:00'] order by @jcr:uuid ascending
SQL
Generated SQL with param markers included:
WITH ALLLEVELS AS (SELECT Links_21.SIID , Links_21.SVID , Links_22.TIID , Links_22.TVID , Links_22.TIX , Links_22.TCTID , Links_22.TNAME , Links_22.TNSID FROM jcr.ICMSTJCRLV00001 Links_21, jcr.ICMSTJCRLR00001 LinkRel_23, jcr.ICMSTJCRLV00001 Links_22 WHERE (Links_21.LID = LinkRel_23.ALID) AND (Links_22.LID = LinkRel_23.DLID)) SELECT DISTINCT ALLLEVELS_4.TIID , 1 WSID , NodesTab_17.VID , NodesTab_17.CTID , NodesTab_17.UUID , NodesTab_17.COMPID , Properties_19.PROPVAL ORDERVAL20 FROM ALLLEVELS ALLLEVELS_4, jcr.ICMSTJCRLV00001 Links_1, jcr.ICMSTJCRN00001 NodesTab_17, (SELECT NodesTab_18.IID ITEMID , NodesTab_18.VID VERSIONID , NodesTab_18.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_18 ) Properties_19 WHERE (((((((ALLLEVELS_4.TCTID = 1580) OR (ALLLEVELS_4.TCTID = 1668)) OR (ALLLEVELS_4.TCTID = 1331)) AND ((Links_1.TCTID = 1668) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT NodesTab_2.IID ITEMID , NodesTab_2.VID VERSIONID , NodesTab_2.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_2 ) Properties_3 WHERE ((Properties_3.ITEMID = Links_1.TIID) AND (Properties_3.VERSIONID = Links_1.TVID)) AND (Properties_3.PROPVAL = x'35333232376439322d616132362d346665642d393436322d373163326161656433636339'))))) AND (ALLLEVELS_4.SIID = Links_1.TIID)) AND (((EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_5.ITEMID , ibmcontentwcm_5.VERSIONID , ibmcontentwcm_5.ATTR0000001129 PROPVAL FROM jcr.ICMUT01580001 ibmcontentwcm_5 UNION ALL SELECT ibmcontentwcm_6.ITEMID , ibmcontentwcm_6.VERSIONID , ibmcontentwcm_6.ATTR0000001129 PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_6 UNION ALL SELECT ibmcontentwcm_7.ITEMID , ibmcontentwcm_7.VERSIONID , ibmcontentwcm_7.ATTR0000001129 PROPVAL FROM jcr.ICMUT01331001 ibmcontentwcm_7 ) Properties_8 WHERE ((Properties_8.ITEMID = ALLLEVELS_4.TIID) AND (Properties_8.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_8.PROPVAL = 1))) OR (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_9.ITEMID , ibmcontentwcm_9.VERSIONID , ibmcontentwcm_9.ATTR0000001129 PROPVAL FROM jcr.ICMUT01580001 ibmcontentwcm_9 UNION ALL SELECT ibmcontentwcm_10.ITEMID , ibmcontentwcm_10.VERSIONID , ibmcontentwcm_10.ATTR0000001129 PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_10 UNION ALL SELECT ibmcontentwcm_11.ITEMID , ibmcontentwcm_11.VERSIONID , ibmcontentwcm_11.ATTR0000001129 PROPVAL FROM jcr.ICMUT01331001 ibmcontentwcm_11 ) Properties_12 WHERE ((Properties_12.ITEMID = ALLLEVELS_4.TIID) AND (Properties_12.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_12.PROPVAL = 2)))) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_13.ITEMID , ibmcontentwcm_13.VERSIONID , ibmcontentwcm_13.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01580001 ibmcontentwcm_13 UNION ALL SELECT ibmcontentwcm_14.ITEMID , ibmcontentwcm_14.VERSIONID , ibmcontentwcm_14.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_14 UNION ALL SELECT ibmcontentwcm_15.ITEMID , ibmcontentwcm_15.VERSIONID , ibmcontentwcm_15.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01331001 ibmcontentwcm_15 ) Properties_16 WHERE ((Properties_16.ITEMID = ALLLEVELS_4.TIID) AND (Properties_16.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_16.PROPVAL >= '2012-04-29 12:51:17.872'))))) AND (ALLLEVELS_4.TIID = NodesTab_17.IID)) AND ((Properties_19.ITEMID = ALLLEVELS_4.TIID) AND (Properties_19.VERSIONID = ALLLEVELS_4.TVID)) ORDER BY ORDERVAL20 ASC
QUERY 2
Find all draft content, content links, or file components in library 027ef7bf-0efd-4876-92ed-36ca4c1a15e3 modified since <timestamp>:
XPATH
//element(*, ibmcontentwcm:webFolder) [@jcr:uuid = '027ef7bf-0efd-4876-92ed-36ca4c1a15e3'])/element(*,
ibmcontentwcm:draftSummary)[@icm:lastModified > '2012-04-29T07:51:17.872-05:00' and @ibmcontentwcm:draftClassification = ('Content','Content Link','File Component')]
SQL
Generated SQL with param markers included:
SELECT DISTINCT Links_4.TIID , 1 WSID , NodesTab_9.VID , NodesTab_9.CTID , NodesTab_9.UUID , NodesTab_9.COMPID FROM jcr.ICMSTJCRLV00001 Links_4, jcr.ICMSTJCRLV00001 Links_1, jcr.ICMSTJCRN00001 NodesTab_9 WHERE ((((Links_4.TCTID = 1602) AND ((Links_1.TCTID = 1230) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT NodesTab_2.IID ITEMID , NodesTab_2.VID VERSIONID , NodesTab_2.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_2 ) Properties_3 WHERE ((Properties_3.ITEMID = Links_1.TIID) AND (Properties_3.VERSIONID = Links_1.TVID)) AND (Properties_3.PROPVAL = x'30323765663762662d306566642d343837362d393265642d333663613463316131356533'))))) AND (Links_4.SIID = Links_1.TIID)) AND ((EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_5.ITEMID , ibmcontentwcm_5.VERSIONID , ibmcontentwcm_5.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01602001 ibmcontentwcm_5 ) Properties_6 WHERE ((Properties_6.ITEMID = Links_4.TIID) AND (Properties_6.VERSIONID = Links_4.TVID)) AND (Properties_6.PROPVAL > '2012-04-29 12:51:17.872'))) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_7.ITEMID , ibmcontentwcm_7.VERSIONID , ibmcontentwcm_7.ATTR0000001388 PROPVAL FROM jcr.ICMUT01602001 ibmcontentwcm_7 ) Properties_8 WHERE ((Properties_8.ITEMID = Links_4.TIID) AND (Properties_8.VERSIONID = Links_4.TVID)) AND (Properties_8.PROPVAL IN ('Content' , 'Content Link' , 'File Component' )))))) AND (Links_4.TIID = NodesTab_9.IID)
QUERY 3
Find all deleted content, content links, or file components in library 'ff27da4f-bb61-4cb5-84b6-e4d7523b8bbe' modified since <timestamp>
XPATH
//element(*, ibmcontentwcm:webFolder) [@jcr:uuid = 'ff27da4f-bb61-4cb5-84b6-e4d7523b8bbe'])/element(*, ibmcontentwcm:deletionSummary)[@icm:lastModified > '2012-04-
29T07:51:17.872-05:00' and @ibmcontentwcm:deletionClassification = ('Content','Content Link','File Component')]
SQL
Generated SQL with param markers included:
SELECT DISTINCT Links_4.TIID , 1 WSID , NodesTab_9.VID , NodesTab_9.CTID , NodesTab_9.UUID , NodesTab_9.COMPID FROM jcr.ICMSTJCRLV00001 Links_4, jcr.ICMSTJCRLV00001 Links_1, jcr.ICMSTJCRN00001 NodesTab_9 WHERE ((((Links_4.TCTID = 1692) AND ((Links_1.TCTID = 1230) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT NodesTab_2.IID ITEMID , NodesTab_2.VID VERSIONID , NodesTab_2.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_2 ) Properties_3 WHERE ((Properties_3.ITEMID = Links_1.TIID) AND (Properties_3.VERSIONID = Links_1.TVID)) AND (Properties_3.PROPVAL = x'66663237646134662d626236312d346362352d383462362d653464373532336238626265'))))) AND (Links_4.SIID = Links_1.TIID)) AND ((EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_5.ITEMID , ibmcontentwcm_5.VERSIONID , ibmcontentwcm_5.LASTCHANGEDTS PROPVAL FROM jcr.ICMUT01692001 ibmcontentwcm_5 ) Properties_6 WHERE ((Properties_6.ITEMID = Links_4.TIID) AND (Properties_6.VERSIONID = Links_4.TVID)) AND (Properties_6.PROPVAL > '2012-04-29 12:51:17.872'))) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_7.ITEMID , ibmcontentwcm_7.VERSIONID , ibmcontentwcm_7.ATTR0000001395 PROPVAL FROM jcr.ICMUT01692001 ibmcontentwcm_7 ) Properties_8 WHERE ((Properties_8.ITEMID = Links_4.TIID) AND (Properties_8.VERSIONID = Links_4.TVID)) AND (Properties_8.PROPVAL IN ('Content' , 'Content Link' , 'File Component' )))))) AND (Links_4.TIID = NodesTab_9.IID)
QUERY 4
Find all sitesAreas under a specified UUID where the site areas have been modified since <TIMESTAMP>
XPATH
//element(*,
ibmcontentwcm:siteArea)[@jcr:uuid = '53227d92-aa26-4fed-9462-
71c2aaed3cc9']//element(*,ibmcontentwcm:siteArea)
[@jcr:lastModified>='2012-04-29T07:51:17.872-05:00']
SQL
Generated SQL with param markers included:
WITH ALLLEVELS AS (SELECT Links_8.SIID , Links_8.SVID , Links_9.TIID , Links_9.TVID , Links_9.TIX , Links_9.TCTID , Links_9.TNAME , Links_9.TNSID FROM jcr.ICMSTJCRLV00001 Links_8, jcr.ICMSTJCRLR00001 LinkRel_10, jcr.ICMSTJCRLV00001 Links_9 WHERE (Links_8.LID = LinkRel_10.ALID) AND (Links_9.LID = LinkRel_10.DLID)) SELECT DISTINCT ALLLEVELS_4.TIID , 1 WSID , NodesTab_7.VID , NodesTab_7.CTID , NodesTab_7.UUID , NodesTab_7.COMPID FROM ALLLEVELS ALLLEVELS_4, jcr.ICMSTJCRLV00001 Links_1, jcr.ICMSTJCRN00001 NodesTab_7 WHERE ((((ALLLEVELS_4.TCTID = 1668) AND ((Links_1.TCTID = 1668) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT NodesTab_2.IID ITEMID , NodesTab_2.VID VERSIONID , NodesTab_2.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_2 ) Properties_3 WHERE ((Properties_3.ITEMID = Links_1.TIID) AND (Properties_3.VERSIONID = Links_1.TVID)) AND (Properties_3.PROPVAL = x'35333232376439322d616132362d346665642d393436322d373163326161656433636339'))))) AND (ALLLEVELS_4.SIID = Links_1.TIID)) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_5.ITEMID , ibmcontentwcm_5.VERSIONID , ibmcontentwcm_5.ATTR0000001002 PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_5 ) Properties_6 WHERE ((Properties_6.ITEMID = ALLLEVELS_4.TIID) AND (Properties_6.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_6.PROPVAL >= '2012-04-29 12:51:17.872')))) AND (ALLLEVELS_4.TIID = NodesTab_7.IID)
QUERY 5
Find all site area items (site areas, webcontent, webcontlink) under specified UUID with workflowstatus=1:
XPATH
//element(*, ibmcontentwcm:siteArea)
[@jcr:uuid = '53227d92-aa26-4fed-9462-
71c2aaed3cc9']//element(*,ibmcontentwcm:siteArea)
[@jcr:lastModified>='2012-04-29T07:51:17.872-05:00']//(element(*,
ibmcontentwcm:siteArea) | element(*, ibmcontentwcm:webContent) |
element(*, ibmcontentwcm:webContentLink)) [@ibmcontentwcm:workflowStatus = 1] order by @jcr:uuid ascending
SQL
Generated SQL with param markers included:
WITH ALLLEVELS AS (SELECT Links_16.SIID , Links_16.SVID , Links_17.TIID , Links_17.TVID , Links_17.TIX , Links_17.TCTID , Links_17.TNAME , Links_17.TNSID FROM jcr.ICMSTJCRLV00001 Links_16, jcr.ICMSTJCRLR00001 LinkRel_18, jcr.ICMSTJCRLV00001 Links_17 WHERE (Links_16.LID = LinkRel_18.ALID) AND (Links_17.LID = LinkRel_18.DLID)) SELECT DISTINCT ALLLEVELS_7.TIID , 1 WSID , NodesTab_12.VID , NodesTab_12.CTID , NodesTab_12.UUID , NodesTab_12.COMPID , Properties_14.PROPVAL ORDERVAL15 FROM ALLLEVELS ALLLEVELS_7, ALLLEVELS ALLLEVELS_4, jcr.ICMSTJCRLV00001 Links_1, jcr.ICMSTJCRN00001 NodesTab_12, (SELECT NodesTab_13.IID ITEMID , NodesTab_13.VID VERSIONID , NodesTab_13.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_13 ) Properties_14 WHERE (((((((ALLLEVELS_7.TCTID = 1580) OR (ALLLEVELS_7.TCTID = 1668)) OR (ALLLEVELS_7.TCTID = 1331)) AND ((((ALLLEVELS_4.TCTID = 1668) AND ((Links_1.TCTID = 1668) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT NodesTab_2.IID ITEMID , NodesTab_2.VID VERSIONID , NodesTab_2.UUID PROPVAL FROM jcr.ICMSTJCRN00001 NodesTab_2 ) Properties_3 WHERE ((Properties_3.ITEMID = Links_1.TIID) AND (Properties_3.VERSIONID = Links_1.TVID)) AND (Properties_3.PROPVAL = x'35333232376439322d616132362d346665642d393436322d373163326161656433636339'))))) AND (ALLLEVELS_4.SIID = Links_1.TIID)) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_5.ITEMID , ibmcontentwcm_5.VERSIONID , ibmcontentwcm_5.ATTR0000001002 PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_5 ) Properties_6 WHERE ((Properties_6.ITEMID = ALLLEVELS_4.TIID) AND (Properties_6.VERSIONID = ALLLEVELS_4.TVID)) AND (Properties_6.PROPVAL >= '2012-04-29 12:51:17.872'))))) AND (ALLLEVELS_7.SIID = ALLLEVELS_4.TIID)) AND (EXISTS (SELECT CAST(NULL AS CHAR(1)) FROM (SELECT ibmcontentwcm_8.ITEMID , ibmcontentwcm_8.VERSIONID , ibmcontentwcm_8.ATTR0000001129 PROPVAL FROM jcr.ICMUT01580001 ibmcontentwcm_8 UNION ALL SELECT ibmcontentwcm_9.ITEMID , ibmcontentwcm_9.VERSIONID , ibmcontentwcm_9.ATTR0000001129 PROPVAL FROM jcr.ICMUT01668001 ibmcontentwcm_9 UNION ALL SELECT ibmcontentwcm_10.ITEMID , ibmcontentwcm_10.VERSIONID , ibmcontentwcm_10.ATTR0000001129 PROPVAL FROM jcr.ICMUT01331001 ibmcontentwcm_10 ) Properties_11 WHERE ((Properties_11.ITEMID = ALLLEVELS_7.TIID) AND (Properties_11.VERSIONID = ALLLEVELS_7.TVID)) AND (Properties_11.PROPVAL = 1)))) AND (ALLLEVELS_7.TIID = NodesTab_12.IID)) AND ((Properties_14.ITEMID = ALLLEVELS_7.TIID) AND (Properties_14.VERSIONID = ALLLEVELS_7.TVID)) ORDER BY ORDERVAL15 ASC
Conclusion
Incremental crawling is a powerful addition to maintaining an up-to-date search collection. With efficient scheduling and necessary custom index tuning, you can adequately maintain the primary feature of incremental crawling, that is, performance.
Tell us what you think
Please visit this link to take a one-question survey about this article:
About the author
Mark Owusu-Ansah is an Advisory Software Engineer who has worked on ICS products (Lotus Domino, Lotus Sametime, WebSphere Portal, etc.) since joining IBM in 1999. His current focus is on WebSphere Portal Runtime and Performance, and on Search Support. He holds a Master of Software Engineering Degree from Pennsylvania State University. He can be reached at
Mark_Owusu-Ansah@us.ibm.com.
Acknowledgement
The author extends a special thanks to Sascha Sambale.