Roman My Blog: 2010

2010-11-25

What are the Long Tail Effect and Streisand Effect? How are they related to Web 2.0? Give real-life examples how they take place in Web 2.0 age.

In this article, my focus will be on describing the Long Tail Effect and Streisand Effect on the Web nowadays and how they are related to Web 2.0.

Long Tail Effect

The term Long Tail or long tail refers to a phenomenon first described in a Wired article by Chris Anderson in 2004. It talks about how product offerings by a business, usually considered to be "unpopular" because of low sales volumes, can make up a significant portion of an online business because the total volume of the "unpopular" items can be significantly large. The term has become popular in recent times due to the raising trend of adopting the retailing strategy of selling a large number of hard-to-find items in relatively small quantities to customers especially on the World Wide Web. One of the reasons for this gain of popularity is the significant profit that can be gained by using this strategy compared to only selling fewer popular items in large quantities. The total sale of this large number of "non-hit items" is what constitutes the Long Tail Effect that we are talking about.

Long Tail Effect will only work when the cost of inventory storage and distribution is insignificant such that it becomes economically viable to sell relatively unpopular products. This can in turn increase the competitiveness of those unpopular goods and reduce the demand for most of the popular goods. For example, Web content businesses with broad coverage such as Yahoo! and Google may be threatened by the rise of smaller Web sites that focus on the fine details of Web contents and cover the topics better than the larger sites. This is made possible by the creation of easy and cheap Web site software and the spread of RSS which greatly reduce the cost of establishing and maintaining a Web site and the bother encountered by the readers to find these small Web sites on the Internet.

The Long Tail Effect is very much existed in the modern World Wide Web and it is related to Web 2.0 since many of the successful Internet businesses which contribute to the Web 2.0 have leveraged the Long Tail as part of their businesses. Some of the examples include eBay (auctions), Amazon (retail) and iTunes Store (music and podcasts), Yahoo! and Google (web search) which are amongst the major companies, along with smaller Internet companies like Audible (audio books) and Netflix (video rental) and etc.

Take Amazon and Netflix for instance which have made use of the Long Tail Effect since the introduction of Web 2.0 to acquire competitive edge over their neighborhood Blockbuster because of the low overhead cost to stock the exotic and unusal books / DVDs in large centralized warehouses. Contrast to this, Blockbuster has to spend lots of money every month for every square foot of space for their retail stores which are built in visible and thus expensive locations in your neighborhood. These costs are probably multiples of what a larger warehouse in the middle of nowhere costs. Ironically this does not guarantee raising popularity with such heavy spending on inventory storage and in fact Amazon and Netflix are more popular than Blockbuster nowadays due to the convenience provided by their online services which do not require customers to even walk out of their houses to buy their goods.

Streisand Effect

According to Wikipedia, the Streisand effect is primarily an online phenomenon in which an attempt to censor or remove a piece of information which in the end has the unintended consequence of causing the information to be publicized widely and to a greater extent than would have occurred if no censorship had been attempted. It is named after American entertainer Barbra Streisand following a 2003 incident in which her attempts to suppress the photographs of her residence inadvertently generated further publicity.

One of reasons which result in this Streisand effect is the increased accessibility due to Internet especially with the introduction of Web 2.0 technologies such as Tweets, Facebook, Youtube, Forum, BitTorrent and etc. that make exchange of information easier, faster and more open to the public than before. Whenever someone tries to suppress certain information, typically via lawsuits, the blogosphere may intentionally and easily spread the news which usually contains controversial information across the Internet or share with their friends. Ultimately, this may do even more damage to the complainant in terms of reputation than if he has just let the matter slide.

One of the famous real-life examples of Streisand effect that takes place in the Web 2.0 age is the Edison Chen photo scandal which shook the Hong Kong entertainment industry in early 2008. The photo scandal involved the illegal distribution over the Internet of intimate and private photographs of Hong Kong famous actor Edison Chen with various women that include actresses Gillian Chung, Cecilia Cheung, Bobo Chan and etc. The scandal was first started when someone had accidentally gained possession of these intimate photographs and posted them on the Internet to be shared with his friends. This had ultimately led to the distribution of those photographs across the Internet and the news was eventually spread to other Internet users through forwarding the images to the different forums in Hong Kong by the visitors.

The effort of Edison to suppress the distribution of his private photographs has ironically led to more photographs being put onto the Internet and more news coverage on this scandal that makes it one of the hottest topics in the history of Hong Kong entertainment industry. The rapid spread of the news about the scandal and the distribution of the photographs is made possible thanks to the technology provided in the Web 2.0 age such as forum which makes communication among people become much easier and almost in real time and BitTorrent / Foxy which makes the sharing of photographs across different computers on the Internet possible.

The Long Tail Effect and Streisand Effect do exist in today's Web 2.0 world and we all probably have taken part in it at one time or another. For Long Tail Effect, we should learn to take advantage of it especially for small business by for example considering Niche Marketing to maximize business' profit. For Streisand Effect, although it can be prevented by enforcing strict copyright protection on the online content, it does not guarantee total safety as it will be up to the judge or jury to decide what the law says and how it applies in the end. As a result, we should use our common sense when making speeches or posting online content explicitly or implicitly to prevent future confrontations.

References

Wikipedia
http://en.wikipedia.org/wiki/Long_Tail
http://en.wikipedia.org/wiki/Streisand_effect
Discovering the Long Tail
From Wikis and Swikis and Blogs, Oh My! by Eric Holter
http://www.newfangled.com/discovering_the_long_tail
Streisand Effect: Internet Meme, Censorship, Unintended Consequence, John Gilmore, Mirror, Filesharing, Barbra Streisand
Editors: Lambert M. Surhone, Miriam T. Timpledon, Susan F. Marseken
Publisher: Betascript Publishing
Released: February 25, 2010
ISBN-10: 6130505566
ISBN-13: 978-6130505561

2010-11-18

What is StAX? What are its advantages over SAX and DOM?

StAX is a standard XML API that can stream XML data from and to a particular application just like other streaming APIs such as SAX and XNI. The implementation of StAX is based on the standard pull parser interface for Java API. In StAX, the client application requests the parser for the next piece of information rather than the parser telling the client application when the next datum is available. To put it simple, in StAX, the client application will drive the parser instead of the opposite. Moreover, StAX shares with SAX the ability to read arbitrarily large documents and it is a bidirectional API which allows applications to both read existing XML documents and create new ones at the same time.

One of the features of StAX is that it provides efficient XML access using cursor API such as XMLStreamReader. This cursor API will move across an XML document from the beginning to the end, one item at a time and only in a forward motion. The item that the cursor will point at can be a text node, a start-tag, a comment or the beginning of the document. One can retrieve the information about the item which the cursor is currently positioned at by invoking methods such as getName and getText on the XMLStreamReader. Below shows an example to demonstrate how an instance of XMLStreamReader is loaded using the XMLInputFactory java class in a typical StAX program:

URL u = new URL("http://rom1023.blogspot.com/");
InputStream in = u.openStream();
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = factory.createXMLStreamReader(in);

There are many getter methods available on the XMLStreamReader for retrieving different types of information from the current item. For example, we can retrieve the name of the element, its text node, its attribute count and etc. for the current item that the cursor is pointing at. Here is a sample code that will iterate through the XML document and print out the names of the different elements the cursor has encountered and the content of the characters if a character event is met:

for (int event = parser.next(); event != XMLStreamConstants.END_DOCUMENT; event = parser.next()) {
    switch (event) {
        case XMLStreamConstants.START_ELEMENT:
            System.out.println(parser.getLocalName());
        case XMLStreamConstants.CHARACTERS:
            System.out.println(parser.getText());
    }
}

The above loop with a switch statement is a very common pattern used in StAX programs instead of using a stack of if-else statements. However, this is also one of major criticisms of StAX as the Integer type codes for determining the type of item the cursor is at and the big switch statements do not align with the pattern of Object oriented programs which are based on classes, inheritance and polymorphism. Instead, the next method of the XMLStreamReader class should return an XMLEvent object that has subclasses like StartElement, Characters, EndDocument and etc. in order to implement the Object oriented concepts. The main reason for using integer type codes instead of classes is to avoid the slow reflection of Java API but it does sacrifice the advantages of using Object oriented programming.

The above simple example alone perhaps doesn't demonstrate the full power of StAX. As what I have mentioned in the previous paragraph, StAX is a bidirectional API which allows reading of XML document and also writing data into XML document as well. For output, instead of using the XMLStreamReader class which I have introduced earlier on, we can use the XMLStreamWriter which is an interface class that provides methods to write elements, attributes, comments, text and etc. to an XML document. Below is an example of how an instance of XMLStreamWriter can be loaded using the XMLOutputFactory java class:

OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);

Different types of data can be written onto the output stream using various writeFOO methods provided by the XMLStreamWriter class such as writeStartDocument, writeStartElement, writeEndElement, writeCharacters, writeComment and etc. Below is an example of writing a hello world XML document:

writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeCharacters("Hello World");
writer.writeEndDocument();

There are many advantages of using XMLStreamWriter to write data to an XML document. One of these is that it helps to maintain that some constraints are well-formed. For example, the endDocument method will close all the unclosed start-tags and writeCharacters will perform any necessary escaping of special characters such as < and &. Moreover, it is able to deal with documents with multiple roots and namespaces and also element names that contain whitespace. Overall, creating XML document using XMLStreamWriter is more efficient and faster than using a DOM tree.

So, after introducing what is StAX, the big question here is that what are its advantages over SAX and DOM? Firstly, start with SAX, one of the major differences between StAX and SAX APIs is that StAX is a streaming pull parser whilst SAX is a streaming push parser. A streaming pull parser refers to a programming model in which a client application will do the initiative of calling methods on a XML parsing library when it needs to interact with an XML document. That is, the client application only gets (pulls) the XML data when it explicitly asks for it. On the other hand, a streaming push parser refers to a programming model in which an XML parser sends (pushes) the XML data to the client application as the parser moves across an XML document from one element to the other. That is, the parser sends the data to the client application without considering whether or not it is at a ready state for it to use.

The advantages of StAX as a streaming pull parser over SAX which is a streaming push parser are summarized as below:

StAX parser is more flexible as it allows the client to control the application thread and call methods on the parser when needed. By contrast, SAX will take the control of the application thread and the client can only accept invocations from the parser.
The size of the pull parsing libraries of StAX is much smaller than that of SAX and the implementation codes of StAX are much simpler and easier to code than that of SAX especially even when dealing with more complex XML documents.
A StAX pull parser is capable of reading multiple documents at one time with a single thread and this can't be done with SAX parser.
A StAX pull parser can filter elements that can be ignored by the client in an XML document and it can support XML views of non-XML data. A SAX push parser, on the other hand, does not support such functions.
Unlike SAX, StAX is a bidirectional API which allows programs to both read existing XML documents and create new ones. This gives StAX an edge over SAX by providing user with more functions and alternatives.

Next, let's talk about the advantages of StAX over DOM. Generally speaking, there are two types of programming models for working with XML documents: document streaming (SAX and StAX) and the document model (DOM). Streaming models for XML processing are particularly useful when there is limitation of memory usage in the application or when the application has to process several requests simultaneously. In fact, it can be argued that majority of the XML business logic can benefit more from the streaming processing style than the DOM-tree processing style which demands in-memory maintenance of the entire DOM trees.

To summarize, here are the advantages of StAX as a document streaming model over DOM which uses the document tree model:

StAX works better than DOM when processing a large XML document which is larger than a few megabytes in size or in memory constrained environments such as J2ME.
StAX API is faster than DOM API in general as they can start generating output from the input almost immediately without waiting for the entire document to be read which is not the case for DOM which needs to build excessively complicated tree data structure upon reading the document.
StAX API is able to work on applications that require a constant streaming of XML document to retrieve the real-time data such as Web Services or Instant Messaging applications. DOM API is impossible to work on these applications as it will be inappropriate to wait for the stream’s closing tag (in order to finalize the building of the DOM tree) since the XML document is consistently streaming.

In conclusion, StAX is a fast, straightforward and memory-thrifty way of loading data from an XML document. Although it still have its shortcomings such as it does not support random access of the XML document after loading and it does not work well when the structure of the XML document is very complex, many of the toughest XML processing problems encountered today do come from exactly the domain where StAX does work well in compared to SAX and DOM.

References

Java and XML, Second Edition
Solutions to Real-World Problems By Brett McLaughlin
Publisher: O'Reilly Media
Released: August 2001
ISBN: 978-0-596-00197-1
ISBN 10: 0-596-00197-5
Wikipedia
http://en.wikipedia.org/wiki/StAX
StAX'ing up XML, Part 1: An introduction to Streaming API for XML (StAX)
By Peter Nehrer, 2006-11
http://www.ibm.com/developerworks/xml/library/x-stax1.html

2010-11-17

What is "session hijacking"? What are its security threats? How can web developers avoid it?

Session hijacking is the act of exploiting a valid user session and gaining unauthorized access to information or services after successfully obtaining or generating an authentication session ID. Session ID and information that are unique to a particular user's Web application session are usually saved in the HTTP cookies of the user's computer and these HTTP cookies can be easily stolen by an attacker using an intermediary computer while the session is still in progress. Thus, this makes it unsafe for applications which use the stateless HTTP to store the session parameters that are relevant to the user when the user logs into the application. Before tackling the problem of session hijacking, we should first get to know the usage of session in a Web application, how session hijacking can be carried out and what its security threats are which will be discussed in the following paragraphs.

A session is a succession of interactions between two communication end points (usually between the client's computer and the Web server) that occurs during the span of a single connection. When a user logs into an application and passes through the authentication, a session is created on the server to maintain the state for other requests originating from the same user. Applications use sessions to store parameters which are specific to the user in a particular login session. The session is kept "alive" on the server as long as the user is logged on to the system and the session is only destroyed when the user logs out of the system or the session is timed out after a predefined period of inactivity. The application parameters for the user will be deleted from the allocated memory space once the session is destroyed.

Sessions are independent of each other and are uniquely identified by a session ID which is usually a long, random and alpha-numeric identification string that is transmitted between the client's computer and the server. Session IDs are commonly stored in the HTTP cookies, URLs and the hidden fields of HTML Web pages. A URL containing the session ID will usually look something like this:

http://rom1023.blogspot.com/view.do?sessionID=7AD30725112120802

In a HTML Web page, a session ID may be stored as a hidden field of a form:

<input type="hidden" name="sessionID" value="7AD30725112120802">

There are problems with the session ID. If the algorithms used to generate the session ID are based on easily predictable variables such as time or IP address and that encryption is not used (typically SSL) during transmission of session ID, the session ID may be susceptible to stealing by attackers which may then lead to session hijacking.

There are many methods for hijacking sessions:

Session fixation – the attacker tricks the user into using a session ID that is known to him by for example sending the user an email with a link that contains this session ID and wait for the user to log into this particular session.
Session sidejacking – the attacker uses packet sniffing to read network traffic between two parties to steal the session cookie especially in situation when no encryption is used in the web site to prevent attackers from viewing the session data.
Cross-site scripting – the attacker tricks the user's computer into executing a malicious script or code which is treated as trustworthy as it appears to belong to the server to redirect the private user's information including cookies to the attacker.

The above are just part of the many methods that can be used to hijack sessions in the modern world. Pathetically, it just demonstrates how a session can be easily hijacked even without the need to be a skillful attacker. As a result, it is important that we examine the security threats of session hijacking and develop methods and preventions to protect against the session attack by the hackers.

Some of the security threats of session hijacking are that attacker may easily gain complete access to the private data of the user whose session has been hijacked and is permitted to perform operations on behalf of the user which may be extremely harmful especially if the system deals with money. The hijacked user may suffer from financial loss if the attacker is able to perform operation such as money transfer or use the gained private data for other illegal purposes. Also, the company which supports the hijacked system may be accounted for the security bridge and required to pay a large sum of compensation.

Moreover, the system may suffer from service attack if the attacker gains authorized access to the system and starts bombarding the server with requests to consume all available system resources or by passing malformed input data that can crash an application process. Elevation of privilege may also occur during session hijacking when the attacker assumes the identity of a privileged user to gain access to a highly privileged and trusted process or account. As a result, all these unwanted security threats that may arise due to session hijacking must be prevented in order to provide a secure and trustworthy application for users to use.

There are many methods to prevent session hijacking which can be followed by web developers. Some of these methods are summarized as below:

Use a long, random and unpredictable alphanumeric string as the session ID as this can reduce the risk that an attack can guess a valid session ID by trial and error or brute force attacks.
Regenerate the session ID after a successful login to prevent session fixation since the attacker will not know the session ID of the user after logged into the system.
Encrypt the important data such as the session ID passed between the client's computer and the server on the network by using technology such as SSL to prevent sniffing style of attacks.
Make secondary check against the identity of the user for each request made on the server. The check can be based on investigating the consistency of the IP address of the user between the current and the previous requests and prevent the user from browsing if the IP address does not match with the previous one. The drawback of this method is that it does not prevent attacks by somebody who shares the same IP address.
Change the value in the HTTP cookies with each and every request to prevent the attackers from retrieving the session data through the HTTP cookies. However, this may not be applicable for some applications which depend on the session parameters that passed between pages.
Expire the session as soon as the user logs out and clear any session parameters that may have stored in the HTTP cookies.
Set timeout and reduce the life span of a session or a cookie to prevent accumulation of outdated session data in the HTTP cookies which may of interest to the attackers.
Use a third party software such as ArpON to guard against possible hijacking on the website.

All in all, although the above methods may not be possible to guard against all forms of session hijacking, they do provide some guidelines for Web developers to produce a securer system and reduce the security threats that may arise due to session hijacking. Prevention is better than cure and it will be no harm to follow the above methods especially if it does not require any dramatic change in your system. At least, it won't too late if a session hijacking does occur in your system suddenly.

References

Wikipedia
http://en.wikipedia.org/wiki/Session_hijacking
Imperva Glossary - Session Hijacking
http://www.imperva.com/resources/glossary/session_hijacking.html

2010-11-10

What is your vision on Web 3.0? What are the key technologies that will make it happen?

The term Web 3.0 is a very hot topic nowadays. However, as popular as it is, its definition is still not confirmed yet thanks to the many different meanings given to it by several experts which makes the definition of Web 3.0 one of the hottest debates in the history of the World Wide Web. According to the Wikipedia, an online encyclopedia, Web 3.0 is a third generation of Internet based Web services which emphasizes on machine-facilitated understanding of information in order to provide a more productive and intuitive user experience. The term "machine-facilitated" is very important as it summaries what Web 3.0 is all about. Unlike Web 2.0 which is about people communicating, contributing and collaborating and the results coming from the wisdom of crowds, Web 3.0 derives its "wisdom" from software that learns by searching, analyzing and drawing conclusions from online contents. Instead of depending on people to refine information and opinion, intelligent software would do the same thing. This transformation of the Web into an artificial intelligence technology, the Semantic web, is in my opinion what defines Web 3.0.

In the perspective of Semantic Web, Web 3.0 is an evolving extension of the World Wide Web in which web content can be expressed not only in its natural language but also in a form that can be understood, interpreted and used by software applications and thus permitting them to find, share and integrate information more easily. The Semantic Web was first thought up by Tim Berners-Lee, the inventor of first World Wide Web, and expressed as follows:

"I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A 'Semantic Web', which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The 'intelligent agents' people have touted for ages will finally materialize."

– Tim Berners-Lee, 1999

What is my rationale behind for defining Web 3.0 as Semantic Web? It is that I see the Semantic Web as a huge engineering solution for generalizing the format of the data on the Web in the form that can be easily processed by anyone and any system which makes the idea of Web 3.0 to be possible. The problem now for Web 2.0 is that majority of data published on the Web is generally hidden away in HTML files as the format of HTML is not as standardized as one may want to which makes it difficult for other systems and software applications to extract the data from the HTML web page. For example, if a HTML web page that is used to provide weather information to other software applications has changed its design and style, then those software applications have to change their methods of extracting information from the web page too to adapt to the new format of the HTML files. This is not workable as one may not know when the other web pages will change their HTML design and style. As a result, more and more web technologies such as RDF, OWL, SWRL and SPARQL have been developed with the aim of generalizing the format of the data on the Web in order to make the data understandable by other systems and software applications and thus evolve the World Wide Web into a new generation of Web 3.0 and Semantic Web. In this article below, I will try to describe some of the Semantic web technologies that will make this evolution into Web 3.0 possible.

The Semantic Web is generally built on syntaxes which utilize the triple based structures of the Uniform Resource Identifier (URI) to represent data. These syntaxes are called "Resource Description Framework" (RDF) syntaxes. RDF is designed to be read and understood by computers and not for being displayed to people. It is usually written in XML and makes use of XML syntax such as namespaces. RDF is often used as a framework for describing Web resources such as the title, author, modification date, content, and copyright information of a Web page and can be illustrated in the following example:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://rom1023.blogspot.com/">

<rdf:Description rdf:about=" http://rom1023.blogspot.com/">

<dc:title>What is your vision on Web 3.0? What are the key technologies that will make it happen?</dc:title>

<dc:author>Roman</dc:author>

</rdf:Description>

</rdf:RDF>

This piece of RDF basically says that this article has the title "What is your vision on Web 3.0? What are the key technologies that will make it happen?" and was written by someone whose name is "Roman".

The benefit for using RDF is that the information maps directly and unambiguously to a model which is decentralized and the data represented in RDF form is easy to be processed since RDF is a generic format which already has many parsers. This means that when you have a RDF application, you know which data are the semantics of the application and which data are just syntactic fluff. Thus, the data represented in RDF form can become a part of the Semantic Web.

Another popular Semantic Web technology is the Web Ontology Language (OWL) which is designed for software applications to process the online content instead of just presenting information to humans. OWL can facilitate machines to interpret the Web content better by providing additional vocabulary along with a formal semantics to describe the meaning of the terminology used in Web documents. This helps machines to perform useful reasoning tasks on these Web documents and thus be able to process any Web content that can be published by anyone. OWL provides three sublanguages designed for use by specific communities of implementers and users respectively and they are OWL Lite, OWL DL, OWL Full which is ordered by increasing expressiveness. Each of these sublanguages is an extension of its simpler predecessor. The following set of relations hold but their inverses do not:

Every legal OWL Lite ontology is a legal OWL DL ontology.
Every legal OWL DL ontology is a legal OWL Full ontology.
Every valid OWL Lite conclusion is a valid OWL DL conclusion.
Every valid OWL DL conclusion is a valid OWL Full conclusion.

The OWL family of languages support a variety of syntaxes and they are grouped into two categories which are high level syntaxes and exchange syntaxes. High level syntaxes are for specification and exchange syntaxes are more for general use. The below examples, which are taken from the W3C, show the syntax of the OWL2 Functional Syntax which is a high level syntax and OWL2 XML Syntax which is an exchange syntax.

OWL2 Functional Syntax

Ontology(
Declaration( Class( :Tea ) )
)

OWL2 XML Syntax

</Declaration>

</Ontology>

Consider an ontology for tea based on a Tea class as shown above. Every OWL ontology must be identified by an URI (http://www.example.org/tea.owl in this case). To save space, preambles and prefix definitions have been skipped for the above examples.

The main power of Semantic Web languages is that anyone can publish the languages easily by using the RDF that describes a set of URIs without the fear that they might get misinterpreted or stolen and with the knowledge that anyone in the world that has a generic RDF processor can use them. However, although the future of the Semantic Web technologies appears to be bright, they are still very much in their infant stage and there seems to be little consensus about the likely direction and characteristics of the early Semantic Web. Nevertheless, I still believe that the Semantic Web is a vision of the future of the World Wide Web and is what that defines the generation of Web 3.0.

References

The Semantic Web: An Introduction
Sean B. Palmer Editors 2001-09
http://infomesh.net/2001/swintro/
W3C RDF - Semantic Web Standards
Publisher: RDF Working Group
Released: 2004-02-10 (with a previous version published at: 1999-02-22)
http://www.w3.org/RDF/
OWL Web Ontology Language
Deborah L. McGuinness and Frank van Harmelen Editors 2004
http://www.w3.org/TR/owl-features/

2010-11-05

What are ID, IDREF, and IDREFS simple types in XSD? How is xs:unique used to constrain values? Give examples to explain the usage of these XSD constructs.

The purpose of the ID simple type in XSD is to define unique identifiers that are global to a document and emulate the ID attribute type available in the XML DTDs. The ID simple type provides a way to uniquely identify the containing element of the attribute defined using this data type through IDREF and IDREFS.

IDREF and IDREFS are simple types in XSD which can be used to refer to the values of the attributes with data type defined as ID and thus enabling links between documents. The difference between IDREF and IDREFS simple types is that IDREF simple type is a reference to the identifiers defined by the ID simple type whilst IDREFS is derived as a whitespace-separated list of what IDREF simple type can refer to.

The relationships between ID and IDREF / IDREFS simple types are synonymous with the primary key and foreign key relationship in the database. For every foreign key (IDREF / IDREFS), there must be a matching distinct and unique primary key (ID) which it is referring to.

For the identifiers defined in the ID, IDREF and IDREFS simple types to be valid:

The value of an ID simple type must be unique and distinct within the XML document.
For every identifier defined in IDREF and IDREFS simple types, the reference ID values must be in the XML document.
The value of an ID, IDREF and IDREFS must be a named token and neither numerical identifiers nor whitespace is allowed (for example, the integer value 202 can not be an ID value).

The following example illustrates the ID, IDREF and IDREFS approach. Notice that the element “orderID” is defined as ID simple type in XSD and it is represented by two values, “k1” and “k2”, in the example XML document. In the XML document, the value for the element “orderIDREF” must match the value of any elements “orderID” (in this case “k1” or “k2”) in order for the document to be valid. For the element “orderIDREFS”, its value can be made up of a list of the “orderID” element values separated by whitespace or a single value which match the value of any elements “orderID” just like the case for the element “orderIDREF”.

XML Schema

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <xsd:element name="orders">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element name="order" type="orderDetails" />
                <xsd:element name="orderlist" type="orderLists" />
            </xsd:sequence>
        </xsd:complexType>
    </xsd:element>
    <xsd:complexType name="orderDetails">
        <xsd:sequence>
            <xsd:element name="customerName" type="xsd:string"/>
            <xsd:element name="customerAddress" type="xsd:string"/>
            <xsd:element name="customerContact" type="xsd:string"/>
            <xsd:element name="orderIDREF" type="xsd:IDREF"/>
            <xsd:element name="orderIDREFS" type="xsd:IDREFS"/>
        </xsd:sequence>
    </xsd:complexType>
    <xsd:complexType name="orderLists">
        <xsd:sequence>
            <xsd:element name="orderID" type="xsd:ID" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>
</xsd:schema>

XML Document

<?xml version="1.0" encoding="UTF-8"?>
<orders>
    <order>
        <customerName>Test</customerName>
        <customerAddress>Test Address</customerAddress>
        <customerContact>12345678</customerContact>
        <orderIDREF>k1</orderIDREF>
        <orderIDREFS>k1 k2</orderIDREFS>
    </order>
    <orderlist>
        <orderID>k1</orderID>
        <orderID>k2</orderID>
    </orderlist>
</orders>

The W3C XML Schema contains many flexible XPath-based features for describing uniqueness constraint. One of these features is declared with the xs:unique element. The xs:unique element defines that an element or an attribute value must be unique within the scope.

The xs:unique element must contain the followings (in order):

one and only one selector element which contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique
one or more field elements which contains an XPath expression that specifies the values that must be unique for the set of elements specified by the selector element. If there is a set of field elements, the values for a single field element may or may not be unique across the selected elements but the combination of all the fields must be unique

The following example illustrates the uniqueness constraint using xs:unique element. Each product element must have an id child element whose value is unique within products. If any product id is duplicated within the document, the document will be checked as invalid under the uniqueness constraint by the XML Schema.

XML Schema

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <xsd:element name="products">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element name="product" maxOccurs="unbounded">
                    <xsd:complexType>
                        <xsd:sequence>
                            <xsd:element name="id" type="xsd:integer" />
                            <xsd:element name="name" type="xsd:string" />
                            <xsd:element name="price" type="xsd:decimal" />
                        </xsd:sequence>
                    </xsd:complexType>
                </xsd:element>
            </xsd:sequence>
        </xsd:complexType>
        <xsd:unique name="prodIDUnique">
            <xsd:selector xpath="./product"/>
            <xsd:field xpath="id"/>
        </xsd:unique>
    </xsd:element>
</xsd:schema>

XML Document

<?xml version="1.0" encoding="UTF-8"?>
<products>
    <product>
        <id>546</id>
        <name>Nike Soccer Ball</name>
        <price>120</price>
    </product>
    <product>
        <id>547</id>
        <name>Adidas Sports Shoes</name>
        <price>600</price>
    </product>
</products>

References

W3C Recommendation XML Schema Part 2: Datatypes Second Edition
Paul V. Biron and Ashok Malhotra Editors 2004
http://www.w3.org/TR/xmlschema-2
w3schools.com
http://www.w3schools.com/schema/el_unique.asp
RELAX NG By Eric van der Vlist
Publisher: O'Reilly Media
Released: December 2003
ISBN: 978-0-596-00421-7
ISBN 10: 0-596-00421-4

2010-09-29

Brief Introduction

Hi everyone,

Welcome to my blog! As you may know, my name is Roman and I'm writing this blog for one of my postgraduate course's assignments. This may sound puzzling right as you may wonder what kind of a course that will consider writing blog as a student's assignment and even mark the blog topics and count their marks to the final grading. Yes! This is exactly the course that I'm taking now and there is nothing irrelevant as what this course is teaching about is what makes blogging online become real and allows people to interact and share their ideas and comments almost instantly through a world-wide network. You may already have guessed the answer and it is nothing but "Web Technologies".

In this blog, you will find topics talking about the latest trend of Web technologies as well as the common techniques used to display, format and organize the information on the webpage. Most of these topics are written by me (remember I'm doing this for my assignment =P) and of course there will be some useful articles and information provided by some experts which I will post them separately on the "Articles" and "Presentations" pages to get you familiarize with the topics in my blog. Moreover, to make this blog more interactive, you are most welcome to leave your comments regarding the topics posted and you can also leave a message to me in the "Blog Talk" page.

I hope that this blog is useful to your understanding of the latest trend of Web technologies and don't forget to rate the topics that are written by me so that I can at least have a glimpse of how great a writer I am now =).

Cheers,
Roman

Roman My Blog

2010-11-25

What are the Long Tail Effect and Streisand Effect? How are they related to Web 2.0? Give real-life examples how they take place in Web 2.0 age.

2010-11-18

What is StAX? What are its advantages over SAX and DOM?

2010-11-17

What is "session hijacking"? What are its security threats? How can web developers avoid it?

2010-11-10

What is your vision on Web 3.0? What are the key technologies that will make it happen?

2010-11-05

What are ID, IDREF, and IDREFS simple types in XSD? How is xs:unique used to constrain values? Give examples to explain the usage of these XSD constructs.

2010-09-29

Brief Introduction

Recent Blogs

Visitor Counter

Roman My Blog

2010-11-25

What are the Long Tail Effect and Streisand Effect? How are they related to Web 2.0? Give real-life examples how they take place in Web 2.0 age.

2010-11-18

What is StAX? What are its advantages over SAX and DOM?

2010-11-17

What is "session hijacking"? What are its security threats? How can web developers avoid it?

2010-11-10

What is your vision on Web 3.0? What are the key technologies that will make it happen?

2010-11-05

What are ID, IDREF, and IDREFS simple types in XSD? How is xs:unique used to constrain values? Give examples to explain the usage of these XSD constructs.

2010-09-29

Brief Introduction

Recent Blogs

Subscribe To My Blog

Visitor Counter