Using SimpleXML with sfWebBrowser to parse html documents

Warning: This blog post was written a long time ago and might be no longer relevant.

sfWebBrowser is a class that emulates web browser calls. It gives us nice object oriented interface to navigate through document structure in a programmed way. It can return response as SimpleXML which enables us to use xpath queries on the document being parsed. We can easily get part of the page we need with a simple statement:

$xml->xpath('//table[@class="main"]//tr[@class="odd" or @class="even"]');

Unfortunately html pages are hardly ever XML valid documents. That's why sfWebBrowser's getResponseXML() method rather throws an exception than returns SimpleXMLElement. Luckily there's a workaround for it. We can overwrite getResponseXML method to create SimpleXMLElement from DOMDocument in case the original method fails.

<?php

/*
 * (c) 2008 Jakub Zalas
 *
 * For the full copyright and license information, please view the LICENSE
 * file that was distributed with this source code.
 */

/**
 * Web browser
 *
 * @package    zToolsPlugin
 * @subpackage lib
 * @author     Jakub Zalas <jakub@zalas.pl>
 * @version    SVN: $Id$
 */
class zWebBrowser extends sfWebBrowser
{
  /**
   * Returns response as XML
   *
   * If reponse is not a valid XML it is being created from
   * a DOM document which is being created from a text response
   * (this is the case for not valid HTML documents).
   *
   * @return SimpleXMLElement
   */
  public function getResponseXML()
  {
    try
    {
      $this->responseXml = parent::getResponseXML();
    }
    catch (Exception $exception)
    {
      $doc = new DOMDocument();
      $doc->loadHTML($this->getResponseText());
      $this->responseXml = simplexml_import_dom($doc);
    }

    return $this->responseXml;
  }
}