Plato once said that a good decision is based on knowledge and not on numbers. One of the things that data science is tasked to do is to turn information into knowledge. Above all data science is a novel science area that builds knowledge and understanding from structured, and unstructured data using scientific methods and algorithms as well as procedures and systems. The goal is to deploy data-driven knowledge and actionable insights across a variety of application fields. Data mining, machine learning, and big data are all strongly related to data science.
It all began in 1962, with the publication of John W. Tukey's book "The Future of Data Analysis." He advocated that statistics and mathematics be separated from the focus on data comprehension. Data analytics has evolved into an empirical discipline(Link to a book on JSTOR). In 1996, as part of the International Federation of Classification Societies, the phrase "data science" was used in a conference title for the very first time. The Committee on Data for Science and Technology (CODATA) established the Data Science Journal in 2002 as a result of this. No surprise that with the increased digital presence of the world’s population Google’s Chief Economist said, “I keep saying the sexy job in the next ten years will be statisticians” (2019). Data-driven decision making is already well established inside the economy and oscillates around 50% globally with the highest percentage of applications materialising in the banking sector globally.
"I keep saying the sexy job in the next ten years will be statisticians." Hal Ronald Varian, Chief Economist at Google
This, however, is not equivalent to an understanding of the value of data science. As data science combines programming skills knowledge of mathematics, statistics and often expert knowledge on a specific subject it is extremely difficult to find an expert that is knowledgeable in all fields. While organisations continue to leverage machine learning and data science, around 65% of execs do not understand how the algorithms and predictions are made within their organisations (Wiggers, 2021). This could mean that majority is still unclear in which areas data science can support competitive advantage to a company’s success. To build upon the fact that financial services use data science the most could be connected to the fact that in this sector there is a greater need to safeguard against the potential threat linked to money laundering and manage the risks associated with the volatility of financial markets.
One of the most important aspects of Text Analytics is Name Entities Extraction. The purpose is to identify proper names in text and classify them into a set of predefined interest categories, one of which will be entities, which can be organizations, people, or places. This however does not allow you to understand what is the connection between two entities that happen to be within the same sentence. You want to be able to allow the system to understand the interdependencies between entities involved. For this, we can turn towards relationship extraction.
Relationship extraction is critical for obtaining structured data from unstructured datasets like raw text. To develop a searchable knowledge base, one might seek to extract relationships between people, interpret scenes in photographs, or uncover drug interactions to build a medical database. To clarify the problem, let’s assume that we need to understand that “Jill Biden” is the wife of “Joe Biden” from a raw text that mentions “Joe Biden married Jill Biden in 1977”. A simple method would be to look for words like "married" or "XXX's spouse" in news articles. Although this might produce some results, human language is naturally vague, and it is impossible to come up with all of the expressions that suggest a married relationship. The extraction of the relationships using machine learning techniques would be a natural next step. We could train a machine learning classifier to automatically learn the patterns for us if we had some labelled training data, such as examples of couples in a married relationship.
Fundamental analysis is still considered a key way to analyse and predict a company’s performance. Financial items such as income statements, balance sheets as well as a statement of the cash flow are used in complex valuations models to arrive at the intrinsic value. These are the later basis for investment decisions with the assumption that markets are random and when the market values the asset below its intrinsic value this should be a “buy” signal for the investor. As long as financial analysts build their knowledge relying on numerical information, data science can provide a drastically different approach.
Investors have access to a wide range of financial data on companies' financial performance in the form of online records. While automatic analysis of financial results is popular, extracting meaning from the textual sections of financial reports has proven to be more difficult. An annual report's textual segment goes deeper than the financial ratios. In a paper, Kloptchenko et al. (2004) combined data and text mining techniques looking at the qualitative data of the financial reports to verify if textual parts contain some hints about future financial performance.
"The textual part of an annual report contains richer information than the financial ratios." Kloptchenko et al.(2004)
Self-organizing maps were used for the quantitative analysis, and prototype-matching text clustering was used for the qualitative study. The investigation was based on the quarterly reports of three major telecommunications firms. The authors have found that the analysis of the qualitative and quantitative data from quarterly reports have shown that by analyzing text from reports, one may predict certain future changes in financial performance. This reveals itself by the change in the style used in a financial report.
As long as relationship extraction has traditionally been used with rather extensive human involvement to establish extraction rules there are already new ideas in place (Etzioni et al., 2008) that automate the process yet aim to discover all possible sets of relations in the text. This study offers Open IE from the Web, an unsupervised extraction paradigm that offers a range relation-specific extraction in favour of a single corpus-wide extraction approach during which relations of interest are automatically discovered and effectively retained. Additionally, the authors propose TextRunner, a fully implemented Open IE system that has higher precision from well-established KnowItAll authored and developed by the University of Washington’s Turing Center.
Almost all corporate events, projects and results as well as procedures and decisions are documented through written communication, regulatory filings or business and financial news. When we add all types of social digital presence of individuals we may end up with a vast amount of more or less unstructured data. To make sense out of this data is a challenging task for computer and data scientists. We have to be aware of the fact that whatever comes out of the state-of-the-art systems largely depend on what we enter into that system, colloquially known as “garbage in, garbage out”. Still, there is a degree of arbitrary decisions that need to be made when processing “Big Data”. To a certain degree with more and more complex and automated systems users may experience a growing lack of control and sometimes unexpected outcomes could lead to erroneous conclusions. The research shows that unstructured data investigated is limited to a single data type such as text, image, audio or video (Adnan & Akbar, 2019). At this point there seem to be no indication that a complex and holistic IE system exists that would address multifaceted unstructured big data sets and that data science is merely taking the first step towards a full-fledged context-aware system. Despite this thanks to data science we managed to discover fresh unchartered territories that helped us to understand better the very nature of the dynamics of financial markets