Tuesday, September 24, 2013

Structured v/s unstructured data: a new take

It's quite interesting how software designers perceive structured v/s unstructured data. The traditional premise has been that structured data is always easier to process over unstructured data. While this is largely true, there are some interesting counter points against structured data. The big question to me, who takes care of the formats and standardization of structured data. Is it the owner, the corporation or a standards committee. The answer to this question varies depending upon how high profile the structured data is. For instance, if you look at clinical trial data, it's not very clear at first if the National Institutes of Health (NIH) owns the format, or is it the clinical companies that conduct the trial. Secondly, while there are standards to ensure such standardization, there is no enforcement, and large amounts of data is either pre-created before the standard was established or before it was enforced and inspected. This places an undue burden on the data scientist. The need to comply with standard semantics of structure is actually a burden. Since, there is an illusion of ruled based semantics associated with structured data, data mining program designs often get "lazy" over time, and no one notices changes until either the software breaks, or there is an accident.

The second issue with structured data is that multiple sets of data associated with the same type of application often are inconsistent. Either the semantics change or the schemas differ. The true understanding is only hidden in the mind of the creator. Pretty soon, everything including the documentation of the format can become obsolete, and the data becomes the ultima ratio regum of the programmer, and even the user. Expectations are built into the mind of the data scientist for consistency and correctness, when structured data is well documented and supposedly publicized in forums and announcements. The sense of community based collaboration can cloud the mind of the data scientist and create a deceptive cycle of prejudice based design for interpretation of perceptually recognizable data.

With unstructured data, the burden of the owner is usually zero. The standards committees stay out of it's way. The human expression of semantics is now an art form and not a science. Which increases the creativity of content in the data. Since there are no guidelines for structuring, free form linguistic expressive talent contributes to greater degrees of freedom in content variety. In the mind of the data scientist, user interest becomes a bigger barometer for measuring the effectiveness of this content, over data correctness. Programmers have often realized that the truth about data is a relative concept. What good is truth, if there are no consumers for their data? Bringing larger audiences, more collaboration and higher interactivity becomes a priority for the data scientist. Data mining algorithms start becoming more creative and alert towards semantic expression, and less towards structural integrity and consistency. The data scientist now craves to evince the various shades of meaning in the data, to bring out different perspectives for separate audiences, effectively strengthening the customer base and diversity. More liberty to the algorithm implies higher variety cluster heterogeneity. Isn't this what we humans crave for? It would be boring to sound like programmable robots that utter predictable commands in the correct sequence. When there is no orderliness, the overall information entropy increases. Think of why literature