Note that this article was first published on 02/01/2003. The original article is available on DotNetJohn.
Introduction
The XML Schema definition language (XSD) enables you to define the structure (elements and attributes) and data types for XML documents. It enables this in a way that conforms to the relevant W3C recommendations for XML schema. XSD is just one of several XML schema definition languages but is the one best supported by Microsoft in .NET.
The schema specifies the ordering of tags in the document, indicates fields that are mandatory or that may occur different numbers of times, gives the datatypes of fields and so on. The schema importantly is able to ensure that data values in the XML file are valid as far as the parent application is concerned.
Schemas are also useful when developers in different companies or even in different parts of the same company read and write XML documents that they will share. The schema acts as a contract specifying exactly what one application or part of an application must write into an XML file and another program can expect to be there. The schema unambiguously states the correct format for the shared XML.
A well formed XML document is one that satisfies the usual rules of XML. For example, in a well formed document there is exactly one data root node, all opening tags have corresponding closing tags, tag names do not contain spaces, the names in opening and closing tags are spelt in exactly the same way, tags are properly nested, etc.
A valid document is one that is well formed and that satisfies a schema.
Visual Basic .NET provide several methods for validating an XML document against a schema. There are articles on how to do this already on dotnetjohn ( Upload an XML File and Validate Against a Schema ). The focus of this article however shall be on the basic elements of the Microsoft preferred schema language (XSD), after a brief history lesson / an introduction to other common types of schema you may come across and why the XSD alternative was developed.
DTD and XDR
While XML is a relatively new technology the need for schemas was recognised early and so several have already been created. Microsoft focuses heavily on the most recent version, XSD, so VB has the most support for this form of schema, and hence will probably be the one Microsoft developers use most.
However, you may well happen upon the situation, particularly with enterprise development, where you are required to work with other forms of XML schema. While VB has few tools for building other types of schema, it can validate data using DTD and XDR.
The first schema standard was developed alongside XML v1.0 and is DTD (Document Type Definition) schemas. This, many believed, was not an ideal solution as a schema definition language which is why Microsoft came up with XSD as its own suggested replacement and submitted this to the W3C for consideration. One of the problems was, and is, that DTDs are not XML based so you have yet another language to learn to go with the proliferation that comes with XML (XPath and XSL for example). Further, developers also found that DTD lacked the power and flexibility they needed to completely define all of the datatypes they wanted to represent in XML. A schema that can’t validate all of the data’s requirements is of limited use.
XDR (XML Data Reduced) is another schema language, this time XML based and providing a superset of the functionality of DTDs. XDR should not be confused with Sun’s XDR (External Data Representation) … another format for data description but in this case physical representation of data rather than logical representation as per XML and XML Data Reduced schemas.
The last few paragraphs were just to let you know there are other schema formats out there, some of which have limited support in .NET. Now we’ll focus on XSD.
XSD
Wherever you see 'schema' from now we’re referring to XSD. As per many topics relating to XML (see my article on XSL Understanding How to Use XSL Transforms) the XSD specification is complex as well as being quickly evolving. The following will cover the basics of XSD so you can start to construct some useful schemas for use in your own applications. You’ll then need to follow up the information presented elsewhere.
Note that Visual Studio .NET includes an XSD editor that makes generating schemas relatively painless. Unless you understand some of the basic rules of XSD, however, the editor may prove a tad confusing.
Types and Elements
XSD schemas contain type definitions and elements. A type definition defines an allowed XML data type. An 'address' might be an example of a type you might want to define. An element represents an item created in the XML file. If the XML file contains an Address tag, then the XSD file will contain a corresponding element named Address. The data type of the Address element indicates the type of data allowed in the XML file’s Address tag.
Type definitions may be simple or complex. Simple and complex types allow definition of the new data types in addition to the 19 built in primitive data types which include string, Boolean, decimal, date, etc.
A simpleType allows a type definition for a value that can be used as the content of an element or attribute. This data type cannot contain elements or have attributes.
A complexType allows a type definition for elements that can contain attributes and elements.
Let’s pause here and take a look at an example. Let’s work backwards from an XML document as I’ll assume we’re all reasonably familiar with XML but less so with XSD. Here’s an XML file representing a simplified contacts database containing just one record currently:
<?xml version="1.0" encoding="utf-8" ?>
<Contacts>
<Contact>
<FirstName>Chris</FirstName>
<Surname>Sully</Surname>
<Address>
<Street>22 Denton Road</Street>
<City>Cardiff</City>
<Country>Wales</Country>
</Address>
<Tel>02920371877</Tel>
</Contact>
</Contacts>
In fact in Visual Studio .NET you can simply right click on this XML file and generate the schema from it. Of course it may well not be quite correct for your needs, as it shall be based on one record of data. Not even Visual Studio .NET can predict the future with any accuracy … ;) Here’s what it comes up with:
<?xml version="1.0" ?>
<xs:schema id="Contacts" targetNamespace="http://tempuri.org/XMLFile1.xsd" xmlns:mstns="http://tempuri.org/XMLFile1.xsd" xmlns="http://tempuri.org/XMLFile1.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" attributeFormDefault="qualified" elementFormDefault="qualified">
<xs:element name="Contacts" msdata:IsDataSet="true" msdata:Locale="en-GB" msdata:EnforceConstraints="False">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="Contact">
<xs:complexType>
<xs:sequence>
<xs:element name="FirstName" type="xs:string" minOccurs="0" />
<xs:element name="Surname" type="xs:string" minOccurs="0" />
<xs:element name="Tel" type="xs:string" minOccurs="0" />
<xs:element name="Address" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="Street" type="xs:string" minOccurs="0" />
<xs:element name="City" type="xs:string" minOccurs="0" />
<xs:element name="Country" type="xs:string" minOccurs="0" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>
Picking out some key elements:
An XML Schema is composed of the top-level schema element. The schema element definition must include the following namespace:
http://www.w3.org/2001/XMLSchema
and you can see from the above this isn’t all that is generated, but we’ll ignore the extra elements for now.
The actual definition commences with the first <xs:element… definition which has the name attribute 'Contacts'. Again the other attributes we can ignore for now. Contacts is necessarily defined as a complex type as it contains other elements.
We then encounter <xs:choice…: a choice element allows the XML file to contain one of the elements inside the choice element. The attribute maxOccurs="unbounded" is used to indicate that the Contacts element can contain any number of Contact elements.
The contact element is again a complex type comprised of a sequence of further elements. A sequence element requires the XML document to contain the items inside the sequence in order. By default sequence elements must appear exactly once; this can be overridden using the minOccurs and maxOccurs attributes to indicate it can occur any number of times (including 0).
The individual elements are defined to be of type string (a simple type). Address is similar defined as a complex type of a sequence of elements of type string.
Hopefully that has been an informative introduction to some commonly encountered constructs by way of an example. We’ll now continue on to look at some of the XSD language constructs in a little more detail, starting with elements.
Elements and their attributes
e.g. <xs:element name="Street" type="xs:string" minOccurs="0" />
An element defines an entity in an XML file. So the above defines an element of name <Street> and type string. The element can have several attributes which modify the elements behaviour, for example:
minOccurs and maxOccurs: as indicated already these give the minimum and maximum allowed number of times an element can occur within a complex type. To make an element optional minOccurs is set to 0. To allow an unlimited number of the element maxOccurs is set to 'unbounded'.
ref: makes the element a copy of another element defined in the schema. This is best avoided however... it is better to define a distinct type and base both element definitions on this type rather than introduce such dependencies into the schema, e.g.
<xsd:simpleType name=”PhoneNumberType”>
<xsd:restriction base=”xsd:string” />
</xsd:simpleType>
...
<xsd:complexType name=”Contact”>
<xsd:sequence>
...
<xsd:element name=”HomeTel” type=”PhoneNumberType”/>
<xsd:element name=”WorkTel” type=”PhoneNumberType”/>
...
<xsd:sequence>
</xsd:complexType>
Though you might at the same time like to tie down your definition of the PhoneNumberType more tightly. We’ll return to <xsd:restriction … shortly.
default: assigns a default value to the element in which case if the XML document omits the corresponding field it will be assumed to have this value. An element that has a default value should also have minOccurs set to 0 so the XML document may omit it.
fixed: gives the element an unchangeable value. The corresponding XML element cannot have another value, although it may be omitted if minOccurs is 0. Why is this useful? Well, you may want to ensure that an XML data field has the same value throughout the document; for example, you may want to add a new Country field to an existing XML document and ensure that its value is UK for every record.
Types
Type definitions have two goals:
- To describe the data allowed in a simple field, e.g. text format of an e-mail address. Simple types achieve this goal.
- To describe relationships amongst different fields, e.g. a contact type consists of a sequence of firstname, surname, telephone, etc. Complex types achieve this goal of designing more complex data types.
In addition to simple and complex types there are built in types, similar to simple data types such as integers, dates, etc. in other programming languages or .NET’s value data types provided by the Common Type System (CTS). We’ve seen one of the built in types already in our element definitions in the form of the often-employed string type. These built in types are W3C defined and include date, dateTime, decimal, double, float, Year, etc. etc. See the SDK documentation for an authoritative list.
A facet is a characteristic of a data type that you can use to restrict the values allowed by a type. Facets are effectively attributes of the data type. For example, the string datatype has a maxLength facet. Again for further details of the facets of each built in type see the SDK documentation.
Facets enable short cuts to building simple types by restricting another data type. We’ve already seen the restriction construct in example code above; using this and the enumeration facet of the string built in data type we can define allowable values for a type, e.g.
<xsd:simpleType name=”Colours”>
<xsd:restriction base=”xsd:string”>
<xsd:enumeration value=”red” />
<xsd:enumeration value=”green” />
<xsd:enumeration value=”blue” />
</xsd:restriction>
</xsd:simpleType>
The pattern facet is particularly powerful as it specifies a regular expression that the XML field data must match. Regular Expressions are worthy of an article or three in themselves, and there are several books on the subject if interested in improving your knowledge. Look out for an article on regular expressions on dotnetjohn in the not too distant future! For now, we’ll largely skip over the topic of regular expressions though here is an example:
<xsd:simpleType name=”emailType”>
<xsd:restriction base=”xsd:string”>
<xsd:pattern value=”[^@]+@[^@]+\.[^@]+” />
<xsd:restriction>
<xsd:simpleType />
Let’s decipher ”[^@]+@[^@]+\.[^@]+” – this matches an e-mail address of the form a@b.c where a,b and c are any strings that do not contain the @ symbol. The value string equates to 'match any character other than the @ symbol one or more times; then match an @ symbol; then again match any character other than the @ symbol one or more times; next match a full stop and then once more any character other than the @ symbol one or more times'.
The use of the length, minLength, maxLength, totalDigits, fractionDigits, minExclusive, maxExclsuive, minInclusive, maxInclusive facets are all self-describing but it’s important to know they are available.
In addition to the primitive built in types there exist built in data types derived from these primitive types. These derived built in data types refine the definition of primitive types to create more restrictive types. They are based on the string and decimal primitive types.
The string derived types represent various entities that occur in XML syntax itself. For example, the Name type represents a string that satisfies the form of XML token names – it begins with a letter, underscore or colon and the rest of the string contains letters and digits.
The decimal derived types represent various kinds of numbers and thus are considerably more useful for validating data. There are thirteen such decimal derived types, e.g. byte, int, negativeInteger. See the SDK documentation for the full list.
Attributes
Just as you use an XSD schema’s element entities to define the data that can be contained in the corresponding XML data elements you can use attribute entities to define the attributes the XML element can have. Let’s return to Visual Studio .Net and see what schema it comes up with for the following small attribute-centric piece of XML.
<contacts>
<contact firstname=”Chris” Surname=”Sully” />
</contacts>
<?xml version="1.0" ?>
<xs:schema id="contacts" targetNamespace="http://tempuri.org/attribute_centric.xsd" xmlns:mstns="http://tempuri.org/attribute_centric.xsd" xmlns="http://tempuri.org/attribute_centric.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" attributeFormDefault="qualified" elementFormDefault="qualified">
<xs:element name="contacts" msdata:IsDataSet="true" msdata:Locale="en-GB" msdata:EnforceConstraints="False">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="contact">
<xs:complexType>
<xs:attribute name="firstname" form="unqualified" type="xs:string" />
<xs:attribute name="Surname" form="unqualified" type="xs:string" />
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>
You can see that attributes equate to the
<xs:attribute name="firstname" form="unqualified" type="xs:string" />
construct. The form attribute of the attribute tag (yes that does make sense!) is set to unqualified. This means that the attributes in the XML file do not need to be qualified by the schema’s namespace.
Why use attributes rather than elements (referred to as attribute-centric and element-centric XML)? Well, they are often interchangeable and it is largely a matter of taste. Generally however, elements should contain data and attributes should contain information that describes the data. So for contacts one could recommend an attribute centric approach.
However, such decisions are mitigated by the following:
- the attribute centric approach consumes less file space
- attributes can specify default values whereas elements generally do not
- you can order elements via the sequence construct; there is no method of enforcing order with attributes
- elements can occur more than once in complex types but an attribute can occur only once
Complex Types
As previously stated, whereas a simple type determines the type of data a simple text field can hold, a complex type defines relationships among other types. For example we defined a contact record to include fields to store first name and surname. Simple types are then used to define the allowable values in the fields. The complex type determines the fields that make up the contact type.
Complex types are also useful for defining XML elements that can have attributes – simple types cannot have attributes. A complex type can contain only one of a small number of elements. The elements within that element define the relationship the complex type represents. The most common elements are simpleContent, sequence, choice and all, as follows:
simpleContent: a complex type that contains a simpleContent element must contain only character data or a simple type. This construct is primarily so one may add attributes to a simple type.
sequence: as we’ve seen this allows specification of a required order to elements of a complexType.
choice: again as we’ve seen the corresponding XML data must include exactly one of the elements listed inside the choice. Note this is entirely different from the enumeration facet previously introduced: rather than a fixed set of values the choice construct allows the complex type to contain one of several types.
all: when a complex type includes the all element the corresponding XML data can include some or all of the listed elements in any order.
Named and Unnamed Types
Finally, to finish off our overview: if you will use a type only once there is no need to give it a name as you will not need to reference it again. You may choose to include the type definition in the code which uses it. This is the case for the Visual Studio generated schemas above, e.g.:
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="contact">
<xs:complexType>
<xs:attribute name="firstname" form="unqualified" type="xs:string" />
<xs:attribute name="Surname" form="unqualified" type="xs:string" />
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
If the schema referenced this complexType again it would be more succinct to add a name attribute to the <xs:complexType … definition so you could reference it again later in the schema definition.
To clarify by example, the following uses the email simple type we introduced earlier to reduce the size of an ‘EmailContactType’ definition:
<xsd:simpleType name=”emailType”>
<xsd:restriction base=”xsd:string”>
<xsd:pattern value=”[^@]+@[^@]+\.[^@]+” />
<xsd:restriction>
<xsd:simpleType />
<xsd:complexType name=”emailContactType”>
<xsd:sequence>
<xsd:element name=”name” type=”xsd:string” />
<xsd:element name=”email” type=”emailType” />
</xsd:sequence>
</xsd:complexType>
Alternatively you could have defined the email type within the email element. Whether reusing or not, employing this convention will generally make your code tidier.
Conclusion
There we shall halt our introduction to XML Schemas and to the basic XSD constructs specifically and hope you are better placed to understand why and how to use XML schemas in .NET
References
.NET Framework SDK documentation
Visual Basic .Net and XML
Stephens and Hochgurtel
Programming Visual Basic .NET
Francesco Balena
Microsoft Press