Before any work can be done with an XML document it needs to be parsed; that is, broken down into its constituent parts with some sort of internal model built up. Although XML files are simply text, it is not usually a good idea to extract information using traditional methods of string manipulation such as Substring, Length, and various uses of regular expressions. Because XML is so rich and flexible, for all but the most trivial processing, code using basic string manipulation will be unreliable.
Instead a number of XML parsers are available - some free, some as commercial products - that facilitate the breakdown and yield more reliable results. You will be using a variety of these parsers throughout this book. One of the reasons to justify using a handmade parser in the early days of XML was that pre-built ones were overkill for the job and had too large a footprint, both in actual size and in the amount of memory they used. Nowadays some very efficient and lightweight parsers are available; these mean developing your own is a waste of resources and not a task to be under taken lightly.
Some of the more common parsers used today include the following:
• MSXML (Microsoft Core XML Services): This is Microsoft's standard set of XML tools including a parser. It is exposed as a number of COM objects so it can be accessed using older forms of Visual Basic (6 and below) as well as from C++ and script. The latest version is 6.0 and, as of this writing it is not being developed further, although service packs are still being released that address bugs and any other security issues. Although you probably wouldn't use this parser when writing your own application from scratch, this is the only option when you need to parse XML from within older versions of Internet Explorer (6 and below). In these browsers the MSXML parser is invoked using ActiveX technology, which can present problems in some secure environments. Fortunately versions 7 and later have a built-in parser and cross-browser libraries. Choose this one in preference if it's available.
• System.Xml.XmlDocument: This class is part of Microsoft's .NET library, which contains a number of different classes related to working with XML. It has all the standard Document Object Model (DOM) features plus a few extra ones that, in theory, make life easier when reading, writing, and processing XML. However, since the world is trending away from using the DOM, Microsoft also has a number of other ways of tackling XML, which are discussed in later chapters.
• Saxon: Ask any group of XML cognoscenti what the leading XML product is and Saxon will likely be the majority verdict. Saxon's offerings contain tools for parsing, transforming, and querying XML, and it comes from the software house of Dr. Michael Kay, who has written a number of Wrox books on XML and related technologies. Although Saxon offers ways to interact using the document object model, it also has a number of more modern and user-friendly interfaces available. Saxon offers a version for Java and .NET; the basic edition is free to download and use.
• Java built-in parser: The Java library has its own parser. It has a reputation for being a bit basic but is suitable for many XML tasks such as parsing and validation of a document. The library is designed such that you can replace the built-in parser with an external implementation such as Xerces from Apache or Saxon.
• Xerces: Xerces is implemented in Java and is developed by the famous and open source Apache Software Foundation. It is used as the basis for many Java-based XML applications and is a more popular choice than the parser that comes with Java.
About The Authors
Joe Fawcett has been writing software, on and off, for forty years. He was one of the fi rst people to be awarded the accolade of Most Valuable Professional in XML by Microsoft. Joe is head of software development for Kaplan Financial UK in London, which specializes in training people in business and accountancy and has one of the leading accountancy e-learning systems in the UK. This is the third title for Wrox that he has written in addition to the previous editions of this book.
Liam Quin is in charge of the XML work at the World Wide Web Consortium (W3C). He has been involved with markup languages and text since the early 1980s, and was involved with XML from its inception. He has a background in computer science and digital typography, and also maintains a website dedicated to the love of books and illustrations at fromoldbooks.org. He lives on an old farm near Milford, in rural Ontario, Canada.
Danny Ayers is an independent researcher and developer of Web technologies, primarily those related to linked data. He has been an XML enthusiast since its early days. His background is in electronic music, although this interest has taken a back seat since the inception of the Web. Offline, he's also an amateur woodcarver. Originally from the UK, he now lives in rural Tuscany with two dogs and two cats.
Editorial Review
From simple data transfers to providing multi-channeled content, there's so much you can do with XML and this guide will get you started. It walks you through everything you need to know about this powerful language, including what it is, how it works, what technologies accompany it, and how you can apply it. You'll quickly discover how to manipulate XML documents, store XML in databases, extract data, utilize web services, and even use it for web page and image display. With the help of a case study, you'll even learn how to apply this information to give your programming a boost.
• Covers the goals of XML and the rules for constructing it
• Explores different techniques that help you verify that the XML is in the correct format
• Shows how to work with XQuery to create new XML documents and query existing data
• Explains how to retrieve data using DOM, XPath, and LINQ to XML
• Examines programming techniques specifically designed to cope with large documents
• Details how to present data for use by different systems
• Demonstrates a realistic XML pipeline used in a publishing business
Wrox Beginning guides are crafted to make learning programming languages and technologies easier than you think, providing a structured, tutorial format that guides you through all the techniques involved.
Reader Review
A reader in the United Kingdom says," It's almost impossible nowadays to be in the IT industry without having to deal with XML files. Whatever you do, be it DBA, software developer, system, web designer, or even a heavy Office user - sooner or later you'll have to acquire some knowledge of the subject. XML is fast becoming a major underlying standard when dealing with interconnecting platforms, databases, programming languages, applications - you name it. It looks like it's here to stay for a long time, so if you're making a living from any of the above mentioned fields - you'd better add it to your list of skills.
"While this book is certainly not one to be read cover to cover, and probably not all implementations are relevant to everyone, I felt that the subjects which interested me were presented in a very clear and methodical manner. A short look at the Index and Table of Contents available with the "Look Inside" feature reveals the scope of this more than 800 pages book. Published in July 2012 - it's covering up-to-date technologies and products that coincide with XML, and as it's a 5th edition the errata section on the publishers site was reduced to 0.
"Mentioning the publishers site - it contains 20 download files for each of the chapters on the book, to be used both in the "Try it Out" sections and the exercises at the end of it. I feel that this book will stay relevant for a long time and won't become obsolete as quickly as so many other computer related books nowadays. From the many computer books that I own, it's one of the few that rarely returns to the shelf, and the amount of bookmarks that pop out of it make it look like a hedgehog. It's well constructed, and the subjects are clearly explained, with every "Try it Out" section followed by a clear "How it Works", and over the almost two years I own it I still haven't come across any errata, poor grammar or bad writing.
"I think the book does a very good job of presenting the covered topics and referring when needed to other sources, which is exactly what should be expected from a book, in contrast with web searches. One example I can give is with SQL interaction: If you look for a way to export a SQL query to XML, and look for information on the web or MS-SQL BOL, you'll probably end up with many not so simple examples and explanations. Reading on this subject in the book will guide you from the extremely simple option of simply adding "FOR XML RAW" to your query to the more complex EXPLICIT option, and then refer the reader for more details on BOL.
"I'll remind again that I refer to the the fifth printed edition from 2012, and not the kindle edition, although I believe there's not much difference. I also can't compare this book with other books on the subject, as I haven't had the chance to read any of the available ones, mainly because I found this one covered well the scope from simple to advanced.