ShowTable of Contents
Table of Contents |
Previous Page | Next Page
This page covers various topics related to search. There are two primary areas of discussion. The first covers the challenges and possible solutions for having content in the portal successfully show up on Internet search engines. The second is on integrating search engine capabilities and results within a Portal-based Web site.
Search engine optimization
In the past, one of the challenges for external facing Portals was ensuring the public pages appear as expected in the search results of Internet search services such as Google or Yahoo. Some optimization of the Portal is required to achieve this.
Overview of Search Engine Optimization
Search engine optimization (SEO) is defined in Wikipedia as the process of improving the volume or quality of traffic to a web site from search engines via "natural" or un-paid ("organic" or "algorithmic") search results as opposed to search engine marketing (SEM) which deals with paid inclusion.
Having key marketing Web sites appear in search engine results has become a sub-discipline of web site design. Actually a very mature one, so there are many documents, tools, and search engine optimization (SEO) sites to help sites do well in search engines.
This huge amount of information is very helpful, but in some cases can complicate things, especially if you just want to start looking from scratch how to optimize an existing site.
If you have an accessible and usable site, you are already on your way. If not, this quick overview will help by providing a comprehensive list of SEO techniques taken from some of the best online resources and combining it with some advice about how to implement them in WebSphere Portal.
Definition of terms
The following terms are important to understand. These definitions are from the following article:
http://www.ibm.com/developerworks/web/library/wa-seo1.html
Directory - A directory is a human-compiled search. Most directories rely on submissions instead of spiders.
Keywords, keyterms, and keyphrases - The words you want your Web site to rank well for in search engine results pages (SERPs).
Link farm - In SEO, a link farm is a page full of links that have very little to do with each other and exist just as links without any real context. People use link farms to increase the number of links to a page in hopes of fooling a search engine, such as Google, into thinking the page is more link-worthy than it actually is.
Organic listings - Organic listings are the free listings in the SERPs. SEO for organic listings usually involves improving the actual content of your Web site, often at the page or infrastructure level.
PageRank - PageRank is a measurement that the Google-obsessed use to test their rankings in Google. SEO and search engine marketing (SEM) professionals also use the term to describe your ranking in the SERPs and the ranking algorithm points given to your site by Google. No matter how you define it, PageRank is an important part of your SEO success.
Paid listing - Like the name implies, paid listings are paid for in search engines. Depending on the search engine, a paid listing can mean paying for inclusion in the index, pay per click (PPC), a sponsored link, or other ways of making your site show up in the SERPs for targeted keywords and phrases.
Ranking - A ranking is where your page is listed in the SERPs for your targeted keywords. The goal of SEO is high rankings for the keywords that your Web pages target.
Ranking algorithm - A ranking algorithm is the set of rules that a search engine uses to evaluate and rank the listings in its index. The ranking algorithm is what determines which results are relevant to a specific query.
Search engine optimization (SEO) - SEO involves creating Web pages that are picked up by the search engines through optimizing your content for search engine attractiveness and visibility. SEO is mostly used to increase the rankings of your organic listings. We'll use the term SEO to describe the techniques I recommend, although many of these techniques also fall under the umbrella of SEM.
Search engine results page (SERP) - SERPs are the listings, or results, displayed for a particular search. SERP is sometimes defined as search engine results placement. For the purposes of this article, we'll refer to it as a page rather than a placement. In the world of SEO, a good showing in the SERPs is what it's all about.
Spamming - Spamming is a method of SEO that attempts to trick a spider and scam loopholes in the ranking algorithm to influence rankings for targeted keywords. Spamming can take many forms, but the most simple definition for spam is any technique a Web site uses to misrepresent itself and influence ranking. The two methods of SEO are based on whether you want to spam or not.
- Black hat SEO: Spamming the search engines. Black hat SEO is lying, cheating, and stealing your way to the top of the SERPs.
- White hat SEO: Optimizing your site so it serves the user, as well as attracts spiders. In white hat SEO, anything that leads to a good user experience is considered also good for SEO.
Spider - A spider crawls through the Web looking for listings to add to a search engine index. It is sometimes referred to as a Webcrawler, robot, or bot. When optimizing your page for organic listings, you are catering to the spider.
How search engines work
This information is based on content from
http://www.seomoz.org/files/articles/beginners-guide-to-search-engine-optimization.sxw.
Search engines collect information about the information published on the web and build a huge database that relates pages with the terms they contain through a four steps process :
1. Crawling the Web
Search engines run scheduled processes called "bots" or "spiders" that use the hyperlinks found on the Web to "crawl" pages and documents
2. Indexing Documents
Once a page has been crawled, its contents are analyzed and keywords are extracted to build a database of documents that makes up a search engine's "index".
3. Processing Queries
When a request for information comes into the search engine, the engine retrieves from its index all the document that match the query.
4. Ranking Results
Once the search engine has determined which results are a match for the query, an algorithm runs calculations on each of the results to determine which is most relevant to the given query.
They sort these on the results pages in order from most relevant to least so that users can make a choice about which to select.
More on Google
Google is one of the most popular search engines. Understanding a bit more about Google can be very helpful in SEO.
Google's search index accounts for a majority of the entire search-related traffic so to start by optimizing your site for Google can make a lot of sense. Google ranks sites by link analysis; if Google isn't led to your site by other sites to be indexed, Google might never give you a high ranking.
Google optimization basics
The key to ranking well in Google is optimizing the visible keywords on a page.
A successful keyword strategy has two steps:
- Keyword selection: Determine which words your potential audience might use to search for your page and create keywords based on those words.
- Keyword optimization: Apply these keywords to the appropriate pages (3 - 5 keywords per page is the recommended amount) and optimize them from the top left, and then down. Often this will be the first 200 words on your page -- title tag, headings, abstract, and such.
Users will initially view your Web site the same way the spider does, so emphasizing the keywords from the top-left-down is a good Web design practice as well.
Other factors that will affect your ranking
Organic SEO goes beyond a good keyword optimization strategy. This table presents other optimization techniques extracted from Google's Search Engine Optimization Starter Guide Version 1.1 (
http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf)
SEO goals and tasks | Portal tasks |
Create unique, accurate page titles:
- Accurately describe the page's content
Create unique title tags for each page
- Use brief, but descriptive titles
|
|
Make use of the "description" meta tag
- Accurately summarize the page's content
- Use unique descriptions for each page
| Use Portal pages metadata to allow end users to enter meaningful description meta tags |
Improve the structure of your URLs
- Use words in URLs
- Create a simple directory structure
- Provide one version of a URL to reach a document
| Use friendly URLs and URLs mappings to create semantic URLs to your pages |
Make your site easier to navigate:
- Create a naturally flowing hierarchy
- Use mostly text for navigation
- Use "breadcrumb" navigation
| Keep your page hierarchy as simple as possible in Portal
Incorporate the breadcrumb component to your themes |
| Offer quality content and services | Use portlets and WCM to create dynamic content and keep it updated.
Define validation rules for content elements in WCM to ensure that content is properly entered in the system. |
| Write better anchor text | Use validation rules in WCM to ensure link text is relevant.
Provide contextual help in authoring templates to remind content contributors of the importance of keep doing it. |
Use heading tags appropriately:
- Imagine you're writing an outline
- Use headings sparingly across the page
| Define guidelines for using heading tags in portlets and WCM templates and shre them with your developers |
Optimize your use of images:
- Use brief, but descriptive filenames and alt text
- Supply alt text when using images as links
- Store images in a directory of their own
- Use commonly supported filetypes
| Use validation rules in WCM to validate the use of alt text in images and restrict the file types for images.
Use Ephox accesibility functions to ensure that images and media content include alt text. |
Make use of free webmaster tools:
- See which parts of a site Googlebot had problems crawling
- Upload an XML Sitemap file
- Analyze and generate robots.txt files
- Identify issues with title and description meta tags
- Understand the top searches used to reach a site
- Get a glimpse at how Googlebot sees pages
- Remove unwanted sitelinks that Google may use in results
- Receive notification of quality guideline violations and file for a site reconsideration
| Follow the instructions provided in this chapter to configure and use Portal sitemap portlet
Review your portal site with external tools to be sure that everything is indexed as expected |
Sitemaps
The following information is based on content from
http://www.sitemaps.org/
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling.
In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata.
Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.
WebSphere Portal provides a sitemap portlet to assist in meeting the best practices recommended by most of the internet search engines. You can find further information about Portal sitemap portlet in the Sitemap Portlet section of this chapter.
How Internet search engines work
When requesting that an Internet search service includes results from a Web site, typically the Web site administrator will specify the URL of the home page or a site map and the search service will crawl or traverse the hyperlinks, indexing information from each page. Typically only the hyperlinks are followed and information stored in JavaScript functions or HTML meta data is ignored by the search engine crawler. The search engine crawler is sometimes referred to as a ‘robot’.
Challenges with portal crawlability
When the Internet search crawlers encounter a Portal site, the nature of the Portal URLs is not what they are expecting. The problem is that Portal encodes information into the URL called ‘navigational state’ as a string of encrypted characters. The navigational state contains information about the state of the portal, for example, current page and theme template in use. It also contains information about portlet state, for example, portlet mode (edit, view, help), window state (minimized, maximized), and render parameters. The main reason for including all this information in the URL is to support a bookmarking of Portal pages, maintaining the exact layout and view information at that time. The exact problem for search engine crawlers is illustrated by the following figure.
The URL to any page contains information about the source page, so in the illustration URL-A and URL-D point to the same page, yet they have different URLs. This means that a Portal with say one hundred pages can actually involve more like one thousand unique URLs. Search engine crawlers expect that a URL to a page will be unique. When crawling Portal, the crawler will typically encounter more unique URLs than it deems reasonable, give up and terminate the crawl. Depending on the search engine, this results in none or only some of the Portal pages being included in the search engine’s index meaning the Portal site is difficult to find in Internet search results.
Portal crawler awareness and normalized URLs
To address the challenges of crawling Portal, it is now ‘crawler aware’. When a crawler visits a web site, it identifies itself using a ‘user agent’ string in the HTTP header, just as Web browsers do enabling the Web site to determine the browser type. Portal is pre-configured to recognize around fifty common search engines and more can be added.
When Portal recognizes that it is being visited by a crawler, it automatically adjusts the URLs for each page to be ‘normalized’. This means the Portal filters out most of the information normally stored in the navigational state and only the mandatory information required to display the page will remain. With the normalized URLs, if there are links to the same page from different source pages, they will now be the same. If the crawler encounters a link to a page for a second time, it will now be able to identify that the page has already been visited and move on to the next one. Additionally, Portal removes any ‘action URLs’ from the pages when being crawled. This is ensures there are no HTTP POST actions that could perform potentially undesirable operations, such as a link saying "Delete this document". Even though the normalized URLs may still contain some navigational state information and be quite lengthy, there is no problem for internet search engines to crawl this type of URL.
Sitemap portlet
The sitemap portlet is provided by Portal to assist in meeting the best practices recommended by most of the Internet search engines. They suggest either pointing the crawler directly at a site map, or at least having a link to the site map somewhere on the Web site’s home page. In this way, the crawler will be sure to index the most important pages before it reaches the finite limit of pages that most search engines will adhere to. The best practice for using Portal’s site map portlet would be to place a link to it near the top of the page, for example inside the Portal theme.
As the generated site map is likely to contain all the pages and content that should be crawled, additional information called robot directives can be added to a page instructing the crawler not to crawl anything more than the site map page. The robots directive is meta data included in the HTML, or a special robots.txt file, that gives instructions to the crawler defining if crawling sections of the site is allowed or disallowed. The semantics of the robots directives rely on URLs describing the site in a structured way. For example, there might be an instruction to allow /home/public/* but have an exception by disallowing /home/public/employees. Portal URLs are less well structured and although they can start in a ‘friendly’ structured way (for bookmarking), after clicking further links the URL they become more complex and unstructured. For this reason, the best practice for using robot directives with Portal is to place them as HTML
elements in the theme, with logic to allow or disallow individual pages. For example, if the Portal has a site map, the theme would output this:
<meta name="robots" content="noindex,follow">
This would ensure the crawler followed all links from the site map page, but did not include the actual sitemap page in the results. For all other pages, the theme would output:
<meta name="robots" content="index,nofollow">
This would ensure the pages were added to the index, but links not followed, as the crawler will have already done this via the sitemap page instead.
External crawlability of Portal with WCM content
If the Portal includes WCM content, extra consideration is required to ensure all content is indexed and the results show the content in the correct context, that is, on the correct Portal page. If the relationship between content and pages is one to one, that is, each page shows exactly one piece of content; the sitemap portlet and robots directive approach detailed previously will be very effective. However, if the page includes components to select further content (such as a WCM navigator), the sitemap approach with the robots directives described previously will not work well.
When a WCM navigator is used with a content rendering portlet, the URL to the Portal page contains a request parameter that the rendering portlet uses to determine which piece of content to display. To index all content referenced by the WCM Navigator, the crawler would be visiting the same page many times, each time with a different WCM request parameter. The robot directives discussed in the previous section would prevent this and should therefore not be used in this scenario.
If using a WCM navigator or any other component that relies on the WCM request parameter, it is necessary to reconfigure the way Portal presents normalized URLs. As discussed previously, when Portal detects it is being visited by a search engine crawler, it adjusts the links in every page to remove navigational state, and links that could trigger actions. It also removes request parameters and without them the WCM rendering portlet would simply show its default content. To counter this problem, the extent of URL normalization can be customized; this will be demonstrated in the case study later on this page. More information can be found in the Portal InfoCenter:
http://publib.boulder.ibm.com/infocenter/wpdoc/v6r0/topic/com.ibm.wp.zos.doc/wps/srvcfgref.html#srvcfgref__state_manager
Web 2 0 and search
Since version 6.1, Portal has provided a client side aggregation theme called PortalWeb2. Because of the significant use of JavaScript in this theme, the crawlability of the Portal will be compromised. Please see the
Web 2.0 chapter of this Redwiki for more information.
Case study using the sitemap portlet
The following figures containing example screens illustrate the steps to configure and add the sitemap portlet and robots directive to an externally facing Portal.
The sitemap portlet lists links up to a maximum figure per page and then moves to the next page. The maximum links value should be configured according to the recommendations of Internet search engines, and according to the typical number of links on the Portal’s pages. Typically, this will be between seventy-five and two hundred. The value can be configured as follows:
Use
Portlet Management->Portlets and search for wps.p.Sitemap. Click the wrench icon to configure the parameter:
In a default Portal 6.1.5 install, the sitemap can be found on the site map page. It looks like this:
If it is acceptable to make the site map visible to users via the Portal’s navigation, the sitemap portlet can be placed on any public page and this would be the crawl starting point for the Internet search engines. If however it is more preferable to hide the site map from the navigation, there are two similar options:
- Place the sitemap portlet on a separate page, and omit that page from the portal navigation. This specific page should be used as the crawler starting URL.
- Place a link to the hidden sitemap page in the theme of the portal home page. Now the regular Portal homepage can be used as the crawler starting URL.
For this case study, both methods will be illustrated.
Adding sitemap portlet to a page outside of the default navigation
Use the manage pages portlet to create a new page under the "Context Root":
Assign the new page a unique name and friendly URL name:
Add the sitemap portlet to the new page by clicking the pencil icon, searching for portlet "wps.p.Sitemap" and placing on the page.
Edit the access permissions for the page to allow anonymous access:
Now edit the access permissions for the sitemap portlet so it can be viewed by anonymous users. Locate the sitemap portlet using
Portlet Management->portlets and search for "wps.p.Sitemap". Click the key icon and permit anonymous users to perform the "User" role:
If you exit the administration pages and click
Portal home, the Travel Site Map page will be visible in the navigation:
To hide the page from the navigation, a property of the page must be amended. This is only possible by editing the page’s XML properties. Use the manage pages portlet to locate the Travel Site Map page, click the
export button and save the XML definition to the filesystem:
Edit the exported XML file and add a new piece of meta data that will instruct Portal to omit this page from the navigation (you will see existing <parameter> elements for other properties):
<parameter name="com.ibm.portal.Hidden" type="string" update="set"><![CDATA[true]]></parameter>
The modified XML file must now be imported using the XMLaccess command line tool, replacing the values in <> with the values for your Portal:
<Portal Server Home>\bin>xmlaccess -url
http://localhost:<port>/wps/config -in <modified xml file>
Now the Travel Site Map page no longer appears in the navigation and it can now only be viewed via the friendly URL specified when the page was created. In this case, the URL to the page for an anonymous user is
http://luxor.hursley.ibm.com:10040/wps/portal/travelsitemap and this URL could be provided to the Internet search engines as the starting point for their crawling.
Note that in a default Portal installation, there are no other pages visible to anonymous users so to test the site map page, add some additional pages under the "Home" label and allow all anonymous users to view them:
Now to an anonymous user, the Travel Site Map page looks like this:
Adding sitemap portlet to the theme
If it is more desirable to use the Portal’s normal home page as the starting point for the Internet search engines to crawl from, a link to the public site map page can be added to the Default.jsp of the theme:
<portal-navigation:urlGeneration contentNode="ibm.portal.Travel Site Map" ><a href="<%wpsURL.write(out);%>" ></a></portal-navigation:urlGeneration></portal-logic:if>
This will render an invisible link that the crawler can follow from the Portal web site’s home page. Note the contentNode value relates to the unique name of the site map page made previously. The site map page can be either included or excluded from the Portal navigation.
Configuring robots directives
As discussed previously, it is good practice to use robot directives to force the Internet search crawlers to the site map page, and disallow any further access to the Portal to avoid any unnecessary duplication of page crawling. In Portal, the most effective way of doing this is to add some logic to the theme that emits the following meta data if the current page is the site map:
<meta name="robots" content="noindex,follow">
To do this, create a custom theme if the page currently uses the Portal default. Then edit the theme’s Default.jsp and add the following statement:
<portal-logic:if selection="ibm.portal.Travel Site Map">
<meta name="robots" content="noindex,follow">
</portal-logic:if>
Similar statements should be added for other pages included in the site map:
<portal-logic:if selection="ibm.portal.Vacations Home Page">
<meta name="robots" content="index,nofollow">
</portal-logic:if>
Alternatively, for a more dynamic solution the robots information could be stored to each page as Portal meta data, in a similar way to how the site map page was made invisible previously in this case study. Then a different Portal tag can be added to the Default.jsp to extract this information as HTML meta data. As a starting point, this statement would render all meta data:
Reconfiguring nrmalized URLs to include request parameters
As discussed previously, if the Portal uses WCM components that rely on a rendering request parameter, it is necessary to redefine the default behavior of Portal when it normalizes URLs to ensure the parameters are not removed. By default, Portal ships two XSL transformation files which illustrate the available options for URL normalization. By default, Portal uses the UrlNormalization_MIN.xsl file that means the minimum amount of state information is left in the URL. The example configuration files are packaged in a jar file and its location varies between Portal versions. For v6.0 see this IBM technote:
http://www-01.ibm.com/support/docview.wss?uid=swg21373541
For v6.1 see this IBM technote:
http://www-01.ibm.com/support/docview.wss?uid=swg21377436
For v6.1 the location of the jar file is:
<WP_ROOT>\base\wp.engine.impl\shared\app
The UrlNormalization_MIN.xsl file should be extracted from the jar and renamed. For example to UrlNormalization_CUSTOM.xsl. The new file should be placed in the following directory:
<WP_ROOT>\shared\app\com\ibm\wps\state\outputmediators
The UrlNormalization_CUSTOM.xsl should be modified to add the sections highlighted below:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="text()">
</xsl:template>
<xsl:template match="root">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="state[@type='navigational']"/>
</xsl:copy>
</xsl:template>
<xsl:template match="state">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="selection"/>
<xsl:apply-templates select="locale"/>
<xsl:apply-templates select="portlet"/>
</xsl:copy>
</xsl:template>
<xsl:template match="portlet">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="parameters"/>
</xsl:copy>
</xsl:template>
<xsl:template match="parameters">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="selection">
<xsl:copy>
<xsl:copy-of select="@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="locale">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
Finally, we need to configure the Portal State Manager Service to reference the new transformation file. Open the WebSphere Administration Console for the Portal installation and navigate to:
Resource environment providers > WP StateManagerService > Custom properties
Add or amend property "com.ibm.wps.state.outputmediators.OutputMediatorFactory.normalization_xsl_file" and set the value to UrlNormalization_CUSTOM.xsl. Do not use the full path to the file or URI format, simply enter the filename. Restart the Portal.
Testing the new setting is possible using browser tools to customize the user agent string, for example ‘User Agent Pick’ is suitable for Internet Explorer 8:
http://www.enhanceie.com/ietoys/uapick.asp
The user agent string for Google is "Googlebot/2.1 (+http://www.google.com/bot.html)". By setting the user agent string to that of an Internet search crawler, you will be able to confirm that Portal URLs are normalized to a greater or lesser extent according to the transformations specified in your XSL customization file. Also note that if you change the user agent string in this way, you will not be able to log in as Portal will not permit a crawler to visit pages that require authentication.
Searching and crawling portal and other sites with portal search
Many Portals use a combination of external search engines and Portal search to help their users find the information they are looking for. Recall that Internet search services will only include the pages of the Portal that do not require authentication. Using Portal search enables users to find content via standard search portlets or the search box in theme. The Portal search results can include the secured content that Internet search will not display, and additional search results from other internal or external systems that may compliment the original search.
Understanding portal search
Portal search consists of a variety of portlets to administer search activities and display search results. In addition, there is a core search service that includes a variety of crawlers including Web site, Portal site, WCM and seedlist (for Quickr and other content types). During crawler processing, document filters are used to interpret more than 250 document formats. A categorizer organizes the results based on rules, and a summarizer generates a synopsis to be displayed in search results. The crawlers can be scheduled periodically and their output is a search collection, also known as an index file. See the following figure.
Using an appropriate crawler is important. You may wonder why it is not appropriate to simply crawl for all content using a Web crawler. The reason is Web crawlers do not cater for additional meta data, such as user access rights associated with some content sources. Portal pages, portlets and WCM content is constrained using Portal access controls, for example, LDAP groups if the Portal has been configured to use an external directory. The Portal, WCM and seedlist crawlers work differently than the Wweb site crawler to ensure this additional security information is included in the search collection, thus ensuring that Portal users are not presented with search results they are not permitted to see. Note that the concept of seedlists is discussed in more detail later.
Portal search architecture
The Portal search core service can be configured to run on the local Portal, or the workload can be delegated to one or more dedicated servers. Note that when running a local Portal search engine, vertical clustering must not be used. If there were two instances of Portal running on the same hardware, the Portal search engine would run in both nodes and attempt to write to a single index file on the filesystem, resulting in file deadlocks and an incomplete file.
When using remote search architecture, a Web application is installed to a remote WebSphere Application Server and accessed by one or more Portal nodes. In a clustered Portal environment, remote search must be used otherwise individual nodes might return different search results; see fthe figure below for an illustration of remote search in a clustered environment:
A choice of two communication protocols is available; EJB or SOAP, which have differences related to security. Portal will always maintain two types of security regardless of EJB or SOAP protocol:
- Collection level security - used to associate collections with sets of authorized users. Only authorized users can search in a collection.
- Document level security - ensures that users have proper authorizations on pages and portlets before search results are presented to them.
These two levels of security are enforced on the Portal server (where the Portal access controls are known). However, when using SOAP over HTTP, an unauthorized user could bypass Portal and access the remote Web service directly on the SOAP port. They could receive an unfiltered list of search results that contain a document summary, although they could not view the actual document or page. This is not a trivial task but it is theoretically possible. With EJB it is possible to completely secure the remote search application from unauthorized access.
In a clustered Portal environment, typically multiple nodes will access a remote search server. It is also possible to configure load balancing between multiple remote search servers to further share workload or for redundancy and failover. It is also possible to have multiple but unrelated remote search servers, perhaps to separate search related administrative tasks between organizational units. For more information on configuring high availability for Portal search, see this white paper:
http://www.ibm.com/developerworks/websphere/zones/portal/proddoc/dw-w-portalsearch/
Search services collections and scopes
These three concepts are illustrated in the figure below.
The figure above shows the relationships between search services, collections and scopes.
A search service is an instance of the Portal search engine that performs the crawling and document processing described previously. The manage search portlet is used to configure either local or remote search service(s).
Search collections are the result of crawling activities by the search services. Each search service may be responsible for one or more collections, and each collection is built or added to by a number of content sources. The content sources are the crawlers, for example, WCM or Portal crawlers. The collection is represented by a single file on the filesystem known as the index. For example, on a travel web site there could be a collection for the ‘latest holidays’ that contains a Web site content source scheduled to be re-crawled daily. Another collection for the ‘destinations guides’ could contain a WCM content source that is re-crawled weekly. The search collections are configured via the manage search portlet regardless of the search service performing the crawling. The configuration of search collections can be exported to automatically configure a collection on another search services if required.
Search scopes can be configured to categorize search results. This gives the user an opportunity to filter according to pre-defined categories, for example to only show search results for a particular holiday destination. A search scope can be configured to include only results from a particular search collection, or from several search collections. In addition, query text can be configured to include only results that match a certain string. For example, in a travel Web site the ‘destination guides’ search scope might be configured to include all results from the ‘European destination guides’ and ‘Americas destination guides’ search collections relying on a search scope query string ‘About Your Destination’.
Search and administration user interfaces
The search portlets provided by Portal are summarized below and some will be explored in more detail in the case study later in this section.
Search center
This portlet is used to display search results from multiple Portal search collections or WebSphere OmniFind (IBM’s enterprise search product). The portlet uses Web 2.0 features such as ‘type ahead’ in the search box to provide an up-to-date user experience.
The user can also enter a query in the small search box from the Portal theme, and on submitting the search they will be directed to a page showing the search center portlet.
The results are shown, together with the summary and ranking for each. In addition, a drop down selector of search scopes allows the user to filter the results. For example, by default there is a search scope that only shows WCM results. Search scopes can be customized using the manage search portlet. Recall that a scope can be tied to a single search collection, or span multiple collections. The figure below shows the search center portlet.
The search center portlet offers various points of customization by editing the JSPs, for more information see this developerWorks article:
http://www.ibm.com/developerworks/websphere/library/techarticles/0809_shapiro/0809_shapiro.html
The following figure illustrates the extent of customizations that is possible.
Search and browse
The search and browse portlet allows users to perform more advanced searches compared with the search center portlet. For example, users can specify multiple search conditions and fields, or browse all available results. This portlet allows searches to be made on only one search collection at a time and it does not benefit from the latest Web 2.0 features as the search center portlet does. It is preinstalled but not deployed in the default Portal installation. Its portlet parameters must be configured before use.
Suggested links
This portlet can be used to configure the display of recommended search results based on keywords entered in the search request. Administrators can manually map key sources of information/documents to search terms, and deliver priority results to users. The results are displayed alongside search results from other search portlets.
This portlet is explored in more detail in the case study of this chapter.
External links
The portlet can be used to display the search results obtained from an internal or external search service that provides results as a feed. The results are displayed alongside search results from other search portlets. The search engine service must provide a public Web-interface and return the search result as either an RSS or Atom feed. Regular HTML results pages cannot be rendered within the external links portlet.
This portlet is explored in more detail in the case study of this chapter.
Manage search
This portlet allows the configuration and management of search services, collections and scopes.
Portal Search Toolbox
These portlets can be found on developerWorks and are not included with the Portal product.
http://www.ibm.com/developerworks/websphere/library/samples/pst.html
They can be used to experiment with providing the most adventurous search scenarios that go beyond the supplied search center. The portlets are supplied with source code that can provide the basis for a spoke search solution. Since this toolbox was published, some of the portlets, such as suggested links and external links, have evolved and benefited from formal testing and are now included with Portal.
WCM search component
This is not a portlet but a WCM component that allows search results to be embedded in WCM content or templates. The component is configured to present results for a Portal search collection. It is a good candidate if the entire Portal is based on WCM content, simply to have all elements of the page constructed from one source.
Portal search APIs
Portal provides the Search and Index API (SIAPI) that can be used to develop custom search portlets to perform search and index operations. This API was used to create the portlets from the Portal Search Toolbox discussed previously. The SIAPI is common to both Portal and WebSphere OmniFind.
In addition to SIAPI, a RESTful interface to Portal search is available that can return search results. A RESTful request could be made over HTTP via a browser and a variety of parameters are available. An example of a simple query is:
http://www.<hostname>:<port>/searchfeed/myportal/search?query=testresults=10
Seedlist framework
As previously mentioned, the Portal, WCM and seedlist content sources work differently with Web site content sources in order to consume additional meta data such as user access rights. When configuring a content source to crawl WCM or Portal content, a seedlist is automatically generated by Portal. The seedlist differs from a sitemap but is complementary. Recall that when the Portal is being crawled by an Internet search engine, the sitemap portlet is often used to generate a site map. The sitemap produces an XML document in a standard format for consumption by most Internet web crawlers. A seedlist is like an extension to a site map and is based on the ATOM syndication format [RFC4287]. The need to develop a single format emerged from the following challenges:
- Search engines cannot develop crawlers fast enough to keep pace with the proliferation of new internal content sources and new third-party content systems.
- Standard web crawling is becoming more and more inefficient because web content is created and changed more rapidly today than ever before. The crawler crawls an ever-growing set of documents, while it actually needs only the delta of modified or newly created documents.
- Web crawling can't reach all content, for example, most crawlers can't follow links that are manipulated by JavaScript code.
- Content meta-data is growing rapidly. It needs to be indexed in a generic and consistent way among all types of content.
The seedlist format is rich enough to describe content and address the challenges described above. This format can be crawled by Portal search and OmniFind, although not yet by Internet search engines. IBM provides a white paper and source code to facilitate the creation of seedlists for any manner of content repositories. For example, if an organization has a document management system that is not supported by Portal, the seedlist framework could be used, together with the API of the document management system to create an ATOM feed in the seedlist format. Additional content sources can then be added to Portal to crawl this new seedlist feed, and add the documents to the search collection.
For more details see:
http://www.ibm.com/developerworks/websphere/zones/portal/proddoc/dw-w-seedlist/
Crawlability of portal with WCM content
As with external search, there are additional factors that need to be considered if the Portal contains WCM content. The WCM content should of course be indexed using Portal’s WCM crawler. However, note that this will index links to the content and not to the Portal page that is would normally render the content item. For example, a WCM content item representing some news would be displayed in isolation, not on the Portal ‘News’ page with other related portlets. This is because until recently, there was no link between the content and the Portal page it should be displayed upon.
In the past, this problem was resolved by using rules based mechanisms to adjust the URL to the WCM search results so they included the correct Portal page. This could be achieved by reformatting the URLs output by a WCM navigator component and using that as the crawler start point, or alternatively making a custom search portlet to reformat the search result URLs before presenting to the user. In both cases, this was often driven by fixed rules and was not a very dynamic solution if new site areas and Portal pages were added to the site.
This situation is much improved since the arrival of the Lotus Web Content Management Rendering Portlet in the portlet catalog:
http://www-01.ibm.com/software/brandcatalog/portal/portal/details?catalog.label=1WP1001S6
Now using this rendering portlet with Portal v6.1.0.1 and above, when a search result is clicked from the search result portlet, the Portal will automatically determine the page responsible for rendering that content item, select the page, and set the page context to the content item. Thus, the user will see the selected content item on the page on which that item would normally be seen, including all the other portlets that should be on that page.
Categorization and taxonomy
Each search collection can optionally have one categorizer to organize documents into a taxonomy. The default collections in Portal 6.1.5 do not use a categorizer. If a categorizer is used, the taxonomy of search results only becomes apparent when using the search and browse portlet which enables browsing by category.
Portal Search provides two types of categorizers:
- Predefined static categorizer - This works by organizing documents into a predefined static set of over 2,300 categories for subject areas across twelve business disciplines. Note that this categorizer is disabled by default and cannot be selected when making a new search collection. If it is required, follow the instructions of this technote to enable it:
http://www-01.ibm.com/support/docview.wss?uid=swg21408370&myns=swgws&mynp=OCSSHRKX&mync=R
- User-defined rule-based categorizer - This allows categorization of documents by user-defined rules. The manage search portlet can be used to manage the taxonomy of rule-based categories. User defined rules can be based on URLs or search text. For example, there could be a rule to specify that all content with a URL including "/FAQ/*" should be categorized in the "Frequently Asked Questions" category.
Case study crawling the portal and displaying search results
After installing Portal 6.1.5, by default there is one search service (local) that contains two search collections: "Portal Content" and "WebContentCollection". The "Portal Content" contains a content source (or Portal crawler) configured to add the Portal pages to the collection, although by default it has not collected any pages. The "WebContentCollection" is a placeholder for content sources (or web crawlers) that gather web pages. For example, an organization may have some information in static HTML on a web server that is not integrated with the Portal.
To populate the "Portal Content" collection with some documents (portal pages), initiate its crawl activities. Click the
Search Administration link from the Search Welcome page to reach the manage search portlet:
The ‘+’ symbol allows individual documents to be added at any time without requiring a complete re-crawl, however in this case it would be better to trigger the complete crawl. Select the "Portal Content" collection and then press the "play" arrow to begin the crawl, the refresh button can be used to update the status:
The time to complete of course depends on the amount of content in the Portal. For a fresh install of Portal 6.1.5 there are only a few pages so the crawl completes in just a few minutes. Now visit the search center page, or use the search box in the theme to make a search for "web content management". The results are displayed in order of relevance:
Suggested links
To configure the suggested links portlet, use the manage search portlet to add a new search collection called "Suggested links":
Use the add document button on the new search service to add a new URL about Web Content Management,
http://www-01.ibm.com/software/lotus/products/webcontentmanagement/. Note, the link must be a public link versus a secure link.
The meta-data from the Web page is used to populate the information about the page in the search collection, although it can be edited:
Next, the suggested links portlet should be configured to use the new search collection:
The keyword "Web Content Management" is already populated in the "Suggested Links" collection so this search term should now trigger a result in the suggested link portlet:
External search results
The final task of this case study is to configure the external search results portlet. This requires a search service that can provide results in RSS or ATOM syndication format,that is, the same feeds that newsreader clients subscribe to. For this case study, we will use the feeds service from Yahoo to search for news stories relating to our search term. From the Yahoo web site, it is possible to build custom RSS feeds, for example a feed of all news stories about IBM looks like this:
http://news.search.yahoo.com/news/rss?p=ibm&ei=UTF-8&fl=0&x=wrt
The URL can then be modified to insert a place holder for the search query entered in the search portlet:
This needs to be configured as a parameter of the external search results portlet. Access the portlet’s parameters via
Administration->Portlet Management->Portlets and search for "External Serach Results". Set the "searchEngineUrl" parameter to the URL above. Note you must press
‘OK’ after editing the individual parameter and again on the manage portlet:
The external search is actually made as an Ajax request from the browser, and this requires the use of an Ajax proxy. Browsers are not permitted to make requests to servers outside of the domain that served the original page. Instead, the portlet will make a request to Portal which in turn makes the request to Yahoo, assuming it is configured as an allowed destination in Portal’s Ajax proxy configuration.
The external search portlet is packaged in the searchCenter.war file, so the Ajax proxy configuration must be amended here to permit access to Yahoo. Locate the following file:
<Portal Install Path>\wp_profile\installedApps\luxor\PA_Search_Center.ear\searchCenter.war\WEB-INF\proxy-config.xml
Add an additional
element, in this case to enable access to feeds at any host:
<proxy:policy url="*">
<proxy:actions>
<proxy:method>GET</proxy:method>
<proxy:method>HEAD</proxy:method>
</proxy:actions>
<proxy:mime-types>
<proxy:mime-type>text/xml*</proxy:mime-type>
<proxy:mime-type>application/xml*</proxy:mime-type>
<proxy:mime-type>application/atom+xml*</proxy:mime-type>
<proxy:mime-type>application/rss+xml*</proxy:mime-type>
</proxy:mime-types>
</proxy:policy>
After editing this file the web module needs to be restarted. This can be done using the WebSphere Application Server administration console, or by restarting the entire Portal.
Now when a search is made via the search center, external news results from Yahoo appear alongside:
Table of Contents | Previous Page | Next Page