IBM WebSphere Portal and Lotus Web Content Management search: Essentials and best practices
Anuradha D Chitta
Advisory Software Engineer
IBM Software Group
Bangalore, KA India
July 2009
Summary: This article presents a comprehensive review of IBM® WebSphere® Portal / Lotus® Web Content Management search artifacts and the best practices for realizing optimal search performance and scalability.
Contents
1 Basics of search engine functionality
2 Search services
3 Best practices
4 Conclusion
5 Resources
6 About the author
1 Basics of search engine functionality
Search functionality is an integral part of any Web site. WebSphere Portal provides all the search features out of the box to enable Web site/portal search with very few configuration changes.
What are search crawlers and how do they work?
A search crawler runs in the background collecting and indexing documents configured under the Content Source URL. Each search collection can have multiple Content Sources and can be scheduled to crawl at certain time intervals. When a new collection is created, it creates a directory with the name of the collection under the path specified in “DefaultCollectionsDirectory” specific to the service being used.
What are robots?
Robots.txt contains the information to allow/restrict different search engines coming into your site and attempting to “crawl” the content. Search Engines also referred to as “Robots” first look at the site’s
/robots.txt to determine whether they are allowed to crawl the site.
The following example restricts any agent from crawling this site:
User-agent: *
Disallow: /
What is a seedlist?
When a Portal Site crawl is selected while creating a new Content Source, the seedlist URL is auto-generated by Portal Server. This URL generates a list of all the portlets and the associated metadata, including security/access information.
The Web Content Management seedlist URL is created as part of the Content Source when the site is made searchable from the Authoring UI Site Form. This seedlist contains a list of all the content items that are part of that site. The seedlist output includes all the Web Content Management metadata like the title, author, templates, and security info.
To view the contents of this seedlist, access the content source URL in a browser and append
&userid=&password=&debug=1. This will show the links being crawled across multiples pages.
What is a cleanup daemon and how are the documents kept updated?
When a search crawl is run, it compares the existing links to the link being fetched, and if there is any change, it will update the index. If the link does not exist anymore, it will mark the link as invalid.
Cleanup daemon runs at midnight every day and checks for these broken links. The daemon checks if the time set on the Content Source property “Remove broken links after” has been reached, and then it cleans up the broken links. Unless a crawl runs and marks them as “broken”, and the subsequent cleanup daemon runs, those broken links will not be deleted.
Cleanup daemon can be configured to run immediately after a crawl or at a different time than midnight by setting the property CLEAN_UP_TIME_OF_DAY_HOURS from 0--24
. Based on this setting, the cleanup daemon runs at the specified time to remove the outdated files and broken links.
Client interfaces
Search & Browse (S&B)
and Search Center portlets are provided out of the box with Portal server. These portlets can be used to search the collections created in the “Manage Search”. S&B can be used to search a single collection, whereas Search Center can be used to collect all the collections across different scopes.
The Search and Indexing API (SIAPI) can be used to create your own customized search client, to add more advanced search options. The Web Content Management Search component can be used to search Web Content Management content, allowing for a custom search interface using the metadata. It also enables searching by Authoring template or author, for example, and also custom formatting the results.
2 Search services
Search services represent separate instances of the search engine provided by WebSphere Portal. When you create a search collection, you must select a search service. That search service will be used to perform searches that users request on that collection and can be used for searching multiple search collections.
You can set parameters to configure a Portal Search Service, allowing you to set up separate instances of search services with different configurations. You can also set up multiple Portal Search Services and thereby distribute the search load over several nodes.
The search services described below are provided by WebSphere Portal by default:
Portal search service
Select the Portal search service to manage search collections that contain portal pages, content managed by Web Content Management, or indexed Web pages. Note that, for a cluster portal environment, you must set up a remote search service. For details about how to do this, refer to the Portal Search topics in the
WebSphere Portal Information Center.
NOTE: The HTTP crawler of the Portal Search Service does not support Java™Script. Text that is generated by JavaScript might not be available for search.
Content Model search service
Select the Content Model search service to manage the search collection that contains content stored on the Java Content Repository (JCR). At this time the Content Model search service has only one search collection, which is provided with the portal installation by default.
This collection is used by Search Center to access the Portal Document Manager documents from the JCR index. You cannot modify this default Content Model search collection or create additional search collections under the Content Model search service.
Search scopes
Search scopes allow you to view and manage search scopes and custom links. The search scopes are displayed to users as search options in the drop-down list of the search box in the banner and in the Search Center portlet. Users can select the scope relevant for their search queries.
You can configure scopes by using one of the following:
· One or more search locations (content sources)
· Document features or characteristics, such as the document type
WebSphere Portal ships with three scopes:
· All Sources. This includes documents with all features from all content sources in the search by a user.
· Managed Web Content. This restricts the search to sites that were created by Web Content Management.
· Library Documents. This restricts the search to documents that are available in Portal Document Management libraries that portal search can access.
You can add your own custom search scopes, and you can add an icon to each scope. Users will see this icon for the scope in the pull-down selection list of scopes.
3 Best practices
A search crawler can affect the performance of the server, depending on how frequently it crawls and on the nature of the content itself.
What client interface to use
Depending on what content is being crawled, you can elect to use the client that best fits the search requirement:
· Search Center. This ships with WebSphere Portal and can be used when searching collections that span multiple scopes. The output of the results will be the same across all collections and cannot be customized.
· Search & Browse. This can be used to search any single collection.
· WCM Search Component. This is the preferred mode of search when searching WCM collections. The interface can be used to customize the search queries and the results that are displayed from the metadata.
· SIAPI. You can develop custom search interfaces, using SIAPI to customize the search queries and results.
When to schedule crawls
The search crawler indexes the content by making HTTP requests to the server. These requests can increase the server load and affect the performance, so it’s advisable to run the crawler at non-peak hours.
If the content is not changing frequently, space out the crawler schedule. For example, Web Content Management content source is created with a default interval of 4 hours between crawls; however, if the content is not changing that frequently, the crawls can be scheduled once a day, to help reduce the load on the server.
How many parallel threads to use
This determines the number of threads the crawler uses in a crawling session. Increasing this number can help speed up the crawler process, at the cost of some resources.
Level of links to follow while crawling Web sites
This determines the crawling depth, which is the maximum number of levels of nested links that the crawler will follow from the root URL. The greater this number, the more nested content that is being fetched.
Performance issues can occur, if the crawler is getting hung with deep nested fetches. Adjust this value, and identify an optimum value to achieve a balance between search relevance and crawler performance.
Setting the maximum file sizes to fetch
Very large files being fetched by the crawler can fragment the Java Virtual Machine (JVM) heap and cause the server to crash. You can limit the size of files that can be fetched by the crawler by setting these service properties:
HTTP_NON_APPL_MAX_BODY_SIZE_MB
HTTP_MAX_BODY_SIZE_MB.
Also, you can limit the size of the content fetched using seedlist crawls by updating the property
HTTP_MAX_SEEDLIST_SIZE_MB
It determines the amount of space that is reserved for listing portal site resources or managed Web content resources.
Search in a Portal Cluster setup
When running WebSphere Portal in a cluster, make sure to use Remote search, which can help delegate the resources to the remote server. Remote search also helps maintain the collection information at one central location that’s accessible by all nodes in the cluster (see figure 1).
Figure 1. Remote search request flow
<< see portal_remote_search.bmp attached
NOTE: It is a myth that, when setting up remote search, you must have a shared drive accessible by all portal nodes. The fact is, there is no need for a shared drive; the collection directory is physically located on the remote server, and any search requests are delegated to the SOAP/EJB calls accessing the remote collection and sending back the results.
Virtual portals
Portal Search Engine search services and search collections are scoped internally for virtual portals. Any collection created in the main portal is not accessible from other virtual portals. Each Virtual portal must have its own collections created; they cannot be shared with other Virtual portals.
Backup and recovery
Collections often get corrupted when a server is not gracefully shut down during a crawl. As a best practice, make sure to back up all the collections after any configuration changes are made. Starting in WebSphere Portal 6.0.1.3, search collections are automatically backed up under the
/collections_config_backup/ directory. This can be customized by setting RECOVERY_BACKUP_LOCATION to the desired location and setting the RECOVERABLE_INDEX=true, to help prevent the collections from getting corrupted.
4 Conclusion
You should now have a good understanding of all the features and options provided by Portal Search and the best practices for implementing search functionality for your Web sites.
5 Resources
developerWorks WebSphere Portal zone:
http://www.ibm.com/developerworks/websphere/zones/portal/
WebSphere Portal and Lotus Web Content Management product documentation:
http://www.ibm.com/developerworks/websphere/zones/portal/proddoc.html
“Integrating IBM WebSphere Portal Search with IBM Workplace Web Content Management for version 6”:
http://www.ibm.com/developerworks/lotus/library/wcm-search/
IBM Search and Indexing API (SIAPI):
http://www.ibm.com/developerworks/websphere/library/specs/0511_portal-siapi/0511_portal-siapi.html
“Making content searchable anywhere using IBM WebSphere Portal's publishing Seedlist Framework”:
http://www.ibm.com/developerworks/websphere/zones/portal/proddoc/dw-w-seedlist/
“High availability options for IBM WebSphere Portal 6.1 search”:
http://www.ibm.com/developerworks/websphere/zones/portal/proddoc/dw-w-portalsearch/
6 About the author
Anuradha Chitta is an Advisory Software Engineer working with the Web Content Management team at IBM’s Pune, India, facility. She was a Team Lead for the Portal search component support in IBM US, and worked extensively with Portal/WCM Search administration, configuration, and integration issues before relocating to IBM India. Anu holds a Masters degree in Computer Science from LSU, and is an IBM Certified WebSphere ND 6.1 and Portal V6.0 System Administrator.