Tuesday, September 24, 2013

Structured v/s unstructured data: a new take

It's quite interesting how software designers perceive structured v/s unstructured data. The traditional premise has been that structured data is always easier to process over unstructured data. While this is largely true, there are some interesting counter points against structured data. The big question to me, who takes care of the formats and standardization of structured data. Is it the owner, the corporation or a standards committee. The answer to this question varies depending upon how high profile the structured data is. For instance, if you look at clinical trial data, it's not very clear at first if the National Institutes of Health (NIH) owns the format, or is it the clinical companies that conduct the trial. Secondly, while there are standards to ensure such standardization, there is no enforcement, and large amounts of data is either pre-created before the standard was established or before it was enforced and inspected. This places an undue burden on the data scientist. The need to comply with standard semantics of structure is actually a burden. Since, there is an illusion of ruled based semantics associated with structured data, data mining program designs often get "lazy" over time, and no one notices changes until either the software breaks, or there is an accident.

The second issue with structured data is that multiple sets of data associated with the same type of application often are inconsistent. Either the semantics change or the schemas differ. The true understanding is only hidden in the mind of the creator. Pretty soon, everything including the documentation of the format can become obsolete, and the data becomes the ultima ratio regum of the programmer, and even the user. Expectations are built into the mind of the data scientist for consistency and correctness, when structured data is well documented and supposedly publicized in forums and announcements. The sense of community based collaboration can cloud the mind of the data scientist and create a deceptive cycle of prejudice based design for interpretation of perceptually recognizable data.

With unstructured data, the burden of the owner is usually zero. The standards committees stay out of it's way. The human expression of semantics is now an art form and not a science. Which increases the creativity of content in the data. Since there are no guidelines for structuring, free form linguistic expressive talent contributes to greater degrees of freedom in content variety. In the mind of the data scientist, user interest becomes a bigger barometer for measuring the effectiveness of this content, over data correctness. Programmers have often realized that the truth about data is a relative concept. What good is truth, if there are no consumers for their data? Bringing larger audiences, more collaboration and higher interactivity becomes a priority for the data scientist. Data mining algorithms start becoming more creative and alert towards semantic expression, and less towards structural integrity and consistency. The data scientist now craves to evince the various shades of meaning in the data, to bring out different perspectives for separate audiences, effectively strengthening the customer base and diversity. More liberty to the algorithm implies higher variety cluster heterogeneity. Isn't this what we humans crave for? It would be boring to sound like programmable robots that utter predictable commands in the correct sequence. When there is no orderliness, the overall information entropy increases. Think of why literature

Monday, August 26, 2013

The ontology of oncology ...

Oncology is the branch of life science that deals with the study of the cancer disease biology. Cancer is an old disease of the human and animal race, except has grown by 200% just in the last 100 years. People often blame it on the modern diet, but the factors responsible are numerous that include the environment, stress, genetic disposition and more. The study of various forms of cancer has been a daunting task for bio-researchers. About 1.5 million research papers have been published, there have been 200 approved drugs, 110,000 new treatment centers around the world, and a lot of different sophisticated treatment techniques have been invented. However, the disease is elusive and smart. Cancer is learning over time. It has learned to use our immune system to attack us. There is a ton of big-data about cancer, and this is precisely where categorization is critical for semantic correlation purposes. This is where ontology comes in. There are hundreds of ontologies published by various governmental and non-governmental organizations that attempt to classify the whole information base of cancer.

Research is the new media

Traditionally media has been about reporting news, perhaps BREAKING NEWS, sensationalism and most of all crime. Plenty of crime. Serial killers, murderers, terrorists, abductors. In the course of years, the media industry has lost track of who the consumers are. A large percentage of the news produced in the next generation media is driven through data and text analytics. News and entertainment industry is not only analyzing big data, but also researching it. They are able to analyze consumer sentiment (netflix) and behavior, profile audiences (news and sports). The future that we are looking at is large clouds of context specific data that allow millions of researchers to collaborate. This may include but is not limited to crime fighting, national security, cancer research, financial market analysis, space research, traffic analysis, weather models, sub-atomic physics research and more. The researchers of the future are not going to be the users of the media, but instead they will be the media itself. Why do we think so? The answer lies in where is the true value in knowledge being sought? What are consumers of this content or knowledge striving for? The credibility and authenticity of data justifies the need for a larger audience. Extensive research leads to better conclusions, and people feel less violated. After all, consumers pay for this experience.

Friday, March 11, 2011

If Microsoft Had Invented The Internet (Humor)

- There would only be one website on the internet c:\\command.com

- You would have to "logon" to the internet by typing CTRL+ALT+DEL.

- MS Office would be your only browser. It would talk NetBios and not http.

- All your search requests would be submitted to the Command.Com, Microsoft command center, the largest support office in the world. There would be 20,000 outsourced employees who would promise to get back to you within 24 hours with your search results.

- All products you ever purchased on the internet would be directly purchasable only from Microsoft.
If there were security issues, Microsoft would "reboot the internet" by toggling the command center power supply.

- All your bookmarks would work only if they were stored in autoexec.bat file. You would need the permission of Microsoft, if you would like to update your bookmarks.

- You would not be able to browse the internet. You would only be allowed to download it! Browsing would cost extra, and would be a premium service.

- Your browser cookie would be stored in config.sys and loaded along with other drivers.

- When many consumers would protest about the lack of innovation on the internet, Microsoft would make a special announcement, that they would add a second website, the .Net website. To avoid any confusion, they would call it Command.Net.

Monday, June 30, 2008

Future of google video

This is precisely what I am talking about:

http://news.cnet.com/8301-13506_3-9980495-17.html?tag=cnetfd.blogs.item

Setting up a Media 3.0 experience is a balancing act. You can't just have the YouTube and call that the future of TV. Nor can you have the TV and call that the future of Internet.

Google is beginning to abandon YouTube and create a site where advertisers can come, a site with reviewed content.

Sunday, June 29, 2008

My definition of Media 3.0

I am sure you have heard of Web 2.0. So what's up with all these terms? Web 2.0, Web 3.0? Good lord. Seems never-ending and confusing. Now I am out there to add to some more confusion. Media 3.0.

So what is Media 3.0 and how is it different from all the 2.0s and 3.0s out there?

This is really not a new phenomenon, and has been in play since a while. Why do you visit web 2.0 sites? Do you really trust the information you read? It could be a bunch of content posted by hundreds of thousands of people. Many of these are not rated content providers and there is really no comfort of credibility to this information. But you still visit these sites and you keep preferring to read them over the traditional Media.

But why are you visiting these Web 2.0 sites? Do you really seek any information? If so, how seriously do you take this information. To answer that question, you must pick an example from the past. When is the last time you read a printed manual. Nowadays, companies that manufacture products are not even printing manuals anymore. The best example is Apple's iPhone. When I bought the iPhone, I never got a manual with it. So the question is, do I need one? And the next question is, do I really read the printed manual? How soon is the information in this manual going to be obsolete? How soon, am I going to be tempted to check on the Apple's website for more current information?

So on one hand you have the printed manuals on the extreme right of the spectrum. These are obviously fading away. People want to read what are other people's experiences rather than read the manuals. But then they were looking increasingly, for authenticity. So then, you have a lot of review sites. For example the CNET technology review site or the Amazon book review site. Why do you trust these sites more? Well, it's not really about trust. It's about knowledge and not just information. You really need more than just a manual.

What's on the extreme left of the spectrum? It's those crazy Web 2.o sites where anyone and just about anything gets posted. You really don't know who and what to believe. In my opinion people visit such websites purely for entertainment. It's like you don't disclose a lot of things to your closest friend until you drink with your friend. People want to read if something really sucks, and the serious review sites are just not able to stoop down to that level, as they need to maintain a level of credibility and ethics.

So you had all these tons of Web 2.o sites that had a lot of information and there was really no way to correlate this information and organize this automatically so you can consume this without going mad. That's where the Web 3.o, in my opinion came into play. Web 3.o is all about the semantic relationships between content on the web. Where users now have to power to assemble information and apps, and mash them up into a unique aggregate experience. This mainly helped you in getting to the information you needed, much faster, without having to sort through a ton of information.

However, Web 3.0 made sites very complicated and eliminated a lot of the entertainment appeal that they enjoyed in the Web 2.o era. People still needed the motivation to read Web 3.0 sites and get to the answer they need. But this did not work well.

In the meantime, a new age was born. There were two types of companies, the Technology and the Media companies. The technology companies were trying to make their sites more interesting and entertaining. The media companies were trying to make their sites more credible and informative. This is when Media 3.0 was born. People realized that Technology and Media are merging into a single entity. They are becoming inter-convertible and almost synonymous. They are indeed becoming one and the same thing. In other words the line between information and entertainment is now getting blurry. And Media 3.0 is now the next big wave that is beginning to sweep the internet and change it forever.

With Media 3.0, the Internet, Broadcast Media, Print Media and the Mobile industry are all going to converge. There will be massive industry consolidation, but it will also lead to innovation. Data centers and satellite broadcast stations will become inter-convertible. Storage will be massively redundant and website domains will be mobile. For instance your mobile device will get an IP, run a webserver and host it's own website. Even the line between content and apps will be blurry. Consumers will load up on apps and developers will load up on content. This is counter-intuitive and quite a reverse trend compared to what we have seen in the past.