Tuesday, September 24, 2013

Structured v/s unstructured data: a new take

It's quite interesting how software designers perceive structured v/s unstructured data. The traditional premise has been that structured data is always easier to process over unstructured data. While this is largely true, there are some interesting counter points against structured data. The big question to me, who takes care of the formats and standardization of structured data. Is it the owner, the corporation or a standards committee. The answer to this question varies depending upon how high profile the structured data is. For instance, if you look at clinical trial data, it's not very clear at first if the National Institutes of Health (NIH) owns the format, or is it the clinical companies that conduct the trial. Secondly, while there are standards to ensure such standardization, there is no enforcement, and large amounts of data is either pre-created before the standard was established or before it was enforced and inspected. This places an undue burden on the data scientist. The need to comply with standard semantics of structure is actually a burden. Since, there is an illusion of ruled based semantics associated with structured data, data mining program designs often get "lazy" over time, and no one notices changes until either the software breaks, or there is an accident.

The second issue with structured data is that multiple sets of data associated with the same type of application often are inconsistent. Either the semantics change or the schemas differ. The true understanding is only hidden in the mind of the creator. Pretty soon, everything including the documentation of the format can become obsolete, and the data becomes the ultima ratio regum of the programmer, and even the user. Expectations are built into the mind of the data scientist for consistency and correctness, when structured data is well documented and supposedly publicized in forums and announcements. The sense of community based collaboration can cloud the mind of the data scientist and create a deceptive cycle of prejudice based design for interpretation of perceptually recognizable data.

With unstructured data, the burden of the owner is usually zero. The standards committees stay out of it's way. The human expression of semantics is now an art form and not a science. Which increases the creativity of content in the data. Since there are no guidelines for structuring, free form linguistic expressive talent contributes to greater degrees of freedom in content variety. In the mind of the data scientist, user interest becomes a bigger barometer for measuring the effectiveness of this content, over data correctness. Programmers have often realized that the truth about data is a relative concept. What good is truth, if there are no consumers for their data? Bringing larger audiences, more collaboration and higher interactivity becomes a priority for the data scientist. Data mining algorithms start becoming more creative and alert towards semantic expression, and less towards structural integrity and consistency. The data scientist now craves to evince the various shades of meaning in the data, to bring out different perspectives for separate audiences, effectively strengthening the customer base and diversity. More liberty to the algorithm implies higher variety cluster heterogeneity. Isn't this what we humans crave for? It would be boring to sound like programmable robots that utter predictable commands in the correct sequence. When there is no orderliness, the overall information entropy increases. Think of why literature

Monday, August 26, 2013

The ontology of oncology ...

Oncology is the branch of life science that deals with the study of the cancer disease biology. Cancer is an old disease of the human and animal race, except has grown by 200% just in the last 100 years. People often blame it on the modern diet, but the factors responsible are numerous that include the environment, stress, genetic disposition and more. The study of various forms of cancer has been a daunting task for bio-researchers. About 1.5 million research papers have been published, there have been 200 approved drugs, 110,000 new treatment centers around the world, and a lot of different sophisticated treatment techniques have been invented. However, the disease is elusive and smart. Cancer is learning over time. It has learned to use our immune system to attack us. There is a ton of big-data about cancer, and this is precisely where categorization is critical for semantic correlation purposes. This is where ontology comes in. There are hundreds of ontologies published by various governmental and non-governmental organizations that attempt to classify the whole information base of cancer.

Research is the new media

Traditionally media has been about reporting news, perhaps BREAKING NEWS, sensationalism and most of all crime. Plenty of crime. Serial killers, murderers, terrorists, abductors. In the course of years, the media industry has lost track of who the consumers are. A large percentage of the news produced in the next generation media is driven through data and text analytics. News and entertainment industry is not only analyzing big data, but also researching it. They are able to analyze consumer sentiment (netflix) and behavior, profile audiences (news and sports). The future that we are looking at is large clouds of context specific data that allow millions of researchers to collaborate. This may include but is not limited to crime fighting, national security, cancer research, financial market analysis, space research, traffic analysis, weather models, sub-atomic physics research and more. The researchers of the future are not going to be the users of the media, but instead they will be the media itself. Why do we think so? The answer lies in where is the true value in knowledge being sought? What are consumers of this content or knowledge striving for? The credibility and authenticity of data justifies the need for a larger audience. Extensive research leads to better conclusions, and people feel less violated. After all, consumers pay for this experience.