Artificial intelligence, machine learning, deep learning, cognitive computing…no doubt there is a lot of buzz out there but quite a bit of confusion too in terms of expectations and pre-requisites. We often hear customers and prospects say: “I want an AI assistant that tells me what to expect and what to do next,” or “It takes a lot of time to train an AI assistant to make him an expert in my field, doesn’t it? I want something that fits in my budget and I want it now.” Also, users often have many questions regarding the potential of machine learning for end users in their work environment. In this blog post, I’m sharing our initial thoughts about machine learning algorithms and how they empower cognitive search and analytics platforms to deliver better insights in relevant work context.
Machine learning algorithms often operate in two phases: the learning phase and the model application phase. In the learning phase, the data is analyzed iteratively to extract a model from manually classified data. While in the model application phase, the extracted model is applied to further inputs to predict a result.
Machine learning algorithms depend strongly on the quality of data, which is correlated to the quality of results. Cognitive search and analytics platforms can use natural language processing (NLP) and other analytics to enrich structured and unstructured data from different sources (entity extraction, detection of relationships within the data, etc.). This “data pre-processing” stage allows machine learning algorithms to start from enriched data and deliver relevant results much faster. These results continuously enrich the index/logical data warehouse and thus make it easier to answer users’ queries in real-time.
A performant cognitive search and analytics platform must integrate machine learning algorithms with its NLP and other analytics capabilities to deliver the most intelligent and relevant search results to users. Below are five ways machine learning makes search cognitive:
- Classification by example – a supervised learning algorithm used to extract rules (create a model) to predict labels for new data given a training set composed of pre-labeled data. For example, in bioinformatics, we can classify proteins according to their structures and/or sequences. In medicine, classification can be used to predict the type of a tumor to determine if it’s harmful or not. Marketers can also use classification by example algorithms to help them predict if customers will respond to a promotional campaign by analyzing how they reacted to similar campaigns in the past.
- Clustering – an unsupervised learning algorithm whereby we aim to group subsets of documents by similarity. Sinequa uses clustering when we don’t necessarily want to run a search query on the whole index. The idea is to limit our search to a specific group of documents in each cluster. Unlike classification, the groups are not known beforehand, making this an unsupervised task. Clustering is often used for exploratory analysis. For example, marketing professionals can use clustering to discover different groups in their customer/prospect database and use these insights to develop targeted marketing campaigns. In the case of pharmaceutical research, we can cluster R&D project reports based on similar drugs, diseases, molecules and/or side effects cited in these reports.
- Regression – a supervised algorithm that predicts continuous numeric values from data by learning the relationship between input and output variables. For example, in the financial world, regression is used to predict stock prices according to the influence of factors like economic growth, trends or demographics. Regression can also be used to create applications that predict traffic-flow conditions depending on the weather.
- Similarity – not a machine learning algorithm but simply a heavy computing process that helps build a matrix synthesizing the interaction of each sample of data with another one. This process often serves as a basis for the algorithms cited above, and can be used to identify similarities between people in a given group. For example, pharmaceutical R&D can rely on similarity applications to constitute worldwide teams of experts for a research project based on their skills and their footprints in previous research reports and/or scientific publications.
- Recommendation – one of the various use cases consists of merging several basic algorithms to create a recommendation engine proposing contents that might be of interest to users. This is called “content-based recommendation,” which offers personalized recommendations to users by matching their interest with the description and attributes of documents.
All the algorithms above need to be executed in a fast and scalable computing environment to deliver the most precise results. Currently, the Spark distributed computing platform offers the most powerful capabilities to execute machine learning algorithms efficiently. It is indeed designed to scale up from single servers to thousands of machines and it runs much faster than simple Hadoop frameworks.
Our recent contribution in the KM World Whitepaper “Best Practices in Cognitive Computing” highlights concrete use cases, describing how cognitive information systems are capable of extracting relevant information from big and diverse data sets for users in their work context. Get your copy here.