ShowTable of Contents
Users familiar with managing search engines and crawlers are tempted to check if crawler filtering – mostly using URL patterns - is available when configuring the crawler. And though it contradicts the notion behind the seedlist architecture this has been a path some have taken (with or without success). For good reason however this has only been an an option up to Portal Search V7 and no longer available in V8. The same pattern also holds true for those users integrating with IBM OmniFind - now IBM Content Analytics Enterprise Search (ICA-ES).
Another option had been to intercept the incoming seedlist data stream to exclude entries based on even more options, e.g. metadata field values, etc.. and simply purge the unwanted entries in the seedlist.
And finally - as the third option: consume all that the seedlist delivers, however define a default Search Scope to filter out unwanted search result entries.
An example is with the Seedlist produced by IBM Connections Profiles, which contained both active and inactive Person entries. In most cases you only want to find active people, thus a Search Scope could be added where the user status showed 'active'.
As indicated above: the crawler filter capability has been dropped in V8, but replaced with a more elegant option:
- Portlets can be filtered by defining a special property setting
- Web content management content can be marked to be searchable or not
“INCLUDE_IN_SEARCH_INDEX” is a portlet property setting with the following available values:
- 'false' - the portlet will not be published in the seedlist
- 'true' - the portlet will be considered for indexing - this is the default
Note: the WCM Web Content Viewer portlet is set to “INCLUDE_IN_SEARCH_INDEX=false". The reason is that in the past not having had this filter option, this had accounted for duplicate result list entries - one from processing the portlet with the default content, the second coming from WCM library which delivered the default content item once more.
Note: if the property is omitted (this is the default), that portlet will be indexed by the search engine.
Final note: This property is only applicable to portlets and not available for Portal pages. If none of the portlets on a Portal page will get indexed, then that page will not appear in any search result.
Web content management seedlist filter
Web content management delivers a sample Authoring template out of the box where in the 'hidden fields' section is provides a checkbox to specify if the content should be searchable or not.
The checkbox has the title "Search Collection Visibility" and is per default selected - thus: any content created with this Authoring template unmodified will be published through the WCM seedlist and thus available for processing by the search engine.
A second option is to create a copy of this sample authoring template. In this copy deselect the "Search Collection Visibility" checkbox. Finally give the Authoring template a meaningful name, so that when editors create content, they can chose the right authoring template based on whether that content should be searchable or not.
Web content management seedlist filter - file attachments
The above filters apply to content items only. For dealing with file attachments and what filter options are available, please check the following article: Seedlist filter for WCM to exclude attachments based on file extension
Even though WebSphere Portal V8 dropped the option of allowing to define crawler filters for Seedlist crawling, an even more elegant solution is now available to filter content from the seedlist. For portlets using the portlet property setting "INCLUDE_IN_SEARCH_INDEX" with the self-explanatory values 'true' and 'false'. For web content management content this is an attribute associated with the content item itself and available with the out of the box sample authoring template.