I had the chance to present the Sinequa solution at a conference organized by one of the 5 big pharma companies a month ago. The theme of the conference was innovation for RDI and the first morning was dedicated to Big Data. Sinequa was invited because we have customers like Siemens or AstraZeneca that are using our solution for their R&D teams to help them find information in large amounts of internal and external documents (publications, project reports, test reports, patent filings, and even emails), but more importantly, to find expertise based on the analysis of these documents.
Following all the contributions on the topic of Big Data and a round table discussion, questions were invited from the audience. One question from the audience put all the speakers on the spot:
Big Data… is it garbage in or garbage out?!
After few second of hesitation I ventured an explication of why technologies like search could help information workers to actually select what is garbage for them and what isn’t.
But the question was definitely more complex than it seems!
Big Data is very often associated with machine data and related storage issues. In large organisations, especially in R&D, Big Data is very often human generated data. By this I mean documents, email, research reports stored in many applications, containing years of research on a specific subject. Furthermore, in R&D, information does not only reside inside the firewall, but also outside, in specialized databases or in academic publications. To some people, hundreds of millions of such documents may appear as “garbage”, but they could turn out to be a goldmine if a scientist finds in that “garbage” research results related to his or her current research, or even better, if he or she can find an “unknown colleague” who can provide answers to some specific questions.
Then came the question: How to actually filter the garbage for each end user and help find the goldmine?
The first approach people take is to define the best sources for good content. With search, if you index poor quality content, you will find poor quality content!
But very often it is almost impossible for an IT department to define what is good or bad quality content. The quality of content may even be perceived differently by different users, i.e. different subject matter experts.
This is the main challenge in dealing with human generated big data!
Search is all about “Free-Form-Analytics”, contrary to the slicing and dicing in predetermined structures of data warehouses and “classic” BI tools. To offer this flexibility, data is organized during initial indexing and then during the life of the search application.
Here are the main steps to achieve this:
Step 1: We index all the content with the corresponding security credentials, and the available metadata for every application or data source used by information workers on a daily basis
Step 2: During indexing we perform statistical analysis, like many other search engines, but we add our special sauce, Natural Language Processing and Semantic analysis to be able to tag names of people, companies, places, etc. in a full text. It seems easy, but this is the hard work that needs to be automated, at scale, for hundreds of millions of documents, if you want to get a grip on the “Garbage”.
Step 3: Once this work is done, here comes the interesting part, when we link the automated content analytics to an organization’s “DNA” (mostly contained in its business applications). Organizations have spent years trying to organize their content and will probably continue to do so forever without ever seeing the end. Why not use that available DNA (products, client information, HR data, etc.) to refine the content analytics performed in step 2? An anonymous person detected in a text then becomes a colleague, a customer, a partner, etc. A strange series of numbers and letters becomes a product ID and so on.
Structured data helps to refine the analysis of unstructured data.
Once you have gone through these steps you are able to provide end user snot only with a way to manipulate huge amounts of data (what you may have called garbage before may have become valuable Human Generated Big Data), but also to make sense of this data by asking the free-style questions that they are interested in at a given time. There is no limit to the questions you can ask – and hence no limit in making your Big Data valuable.
By Xavier Pornain
VP Sales & Alliances at Sinequa