Tutorial: XSL on XHTML.
Zoo tutorials: [ SQL | Java | Linux | XML ]

A Gentle Introduction to xml


Question 1: Extracting data from Amazon

The popular online bookstore Amazon has details of many thousands of music albums.

Their page devoted to "The Very Best of Elvis Costello and the Attractions" may be found at using the unique identifier: B00005ARFU.

Sadly this html will not be processed by xalan because it is not well formed. Happily we can clean it up by doing a little pre-processing and running it through Dave Raggett's tidy. The result of this tidying can be found in a number of formats: original or as text or as xml and as clean xhtml We wish to extract various details about this (or indeed any) album:

  1. Find the price - you will notice that the Amazon price is $20.99 - we want to extract this value from the text. Examination of the raw text shows the price on line 445 - it is in a b tag with attribute class set to "price" - the style sheet shown looks for this and dumps the details.
  2. There is a meta tag that gives the title and artist of the CD. The tag has name="description" and content="Title, Artist". We can recover this data using the template:
      <xsl:template match="htm:meta[@name='description']">
        album: <xsl:value-of select="@content"/>
  3. The sales rank for this album is 901 - we can find this data within a span tag just after the node <b>Amazon.com Sales Rank:<b>
    The following template will find the required node:
         match="htm:b[text()='Amazon.com Sales Rank:']">
        rank: <xsl:value-of select="."/>
    By changing the select attribute of the value-of node we can navigate to the required information: select="../text()" will give us the text node associated with the parent of the matched node - this should be the number 901.
  4. Try your style sheet on another album such as Sgt. Pepper's Lonely Hearts Club Band at http://xmlzoo.net/amazon/B000002UAU original txt xml cleaned



Question 2: Creating a data oriented node

Use xsl to create an XML file that contains details about any album. It should have a structure like:

<?xml version="1.0" encoding="UTF-8"?>
  <album>Sgt. Pepper's Lonely Hearts Club Band</album>
  <artist>The Beatles</artist>
  <date>: June 1, 1967</date>
  <label> Capitol</label>
  <asin> B000002UAU</asin>
  <rank> 183</rank>
  <track>1. Sgt. Pepper's Lonely Hearts Club Band</track>
  <track>2. With a Little Help from My Friends</track>
  <track>3. Lucy in the Sky With Diamonds</track>
  <track>4. Getting Better</track>
  <track>5. Fixing a Hole</track>
  <track>6. She's Leaving Home</track>
  <track>7. Being for the Benefit of Mr. Kite</track>
  <track>8. Within You, Without You</track>
  <track>9. When I'm Sixty-Four</track>
  <track>10. Lovely Rita</track>
  <track>11. Good Morning, Good Morning</track>
  <track>12. Sgt. Pepper's Lonely Hearts Club Band (Reprise)</track>
  <track>13. Day in the Life</track>
  <style>Styles > Classic Rock > British Invasion</style>
  <style>Styles > Classic Rock > Supergroups</style>
  <style>Styles > Classic Rock > General</style>
  <style>Styles > Classic Rock > Psychedelic Rock</style>


It may help if you turn off "show as text"

Answer and discussion