Essential Math for Data Science

Math is like an octopus: it has tentacles that can reach out and touch just about every subject. And while some subjects only get a light brush, others get wrapped up like a clam in the tentacles’ vice-like grip. Data science falls into the latter category. If you want to do data science, you’re going to have to deal with math. Mathematics is the bedrock of any contemporary discipline of science. Almost all the techniques of modern data science, including machine learning, have a deep mathematical underpinning.
If you’ve completed a math degree or some other degree that provides an emphasis on quantitative skills, you’re probably wondering if everything you learned to get your degree was necessary. I know I did the same too. In this post, we’re going to explore what it means to do data science and talk about just how much math you need to know to get started.
The knowledge of this essential math is particularly important for newcomers arriving in data science from other professions: hardware engineering, retail, the chemical process industry, medicine and health care, business management, etc. Although such fields may require experience with spreadsheets, numerical calculations, and projections, the math skills required in data science can be significantly different. Here is the list of some of the topics that are important to excel in the field of data science.

1. Linear Algebra: Linear algebra is the branch of mathematics concerning linear equations such as and their representations through matrices and vector spaces. 

pexels-photo-240163Linear algebra is central to almost all areas of mathematics. For instance, linear algebra is fundamental in modern presentations of geometry, including for defining basic objects such as lines, planes and rotations. Also, a functional analysis may be basically viewed as the application of linear algebra to spaces of functions. Linear algebra is also used in most sciences and engineering areas, because it allows modelling many natural phenomena, and efficiently computing with such models. For nonlinear systems, which cannot be modelled with linear algebra, linear algebra is often used as a first-order approximation.

Where You Might Use It
If you have used the dimensionality reduction technique principal component analysis, then you have likely used the singular value decomposition to achieve a compact dimension representation of your data set with fewer parameters. All neural network algorithms use linear algebra techniques to represent and process network structures and learning operations.

2. Statistics: Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation.

pexels-photo-590022.jpeg

In applying statistics too, for example, a scientific, industrial, or social problem, it is conventional, to begin with, a statistical population or a statistical model process to be studied. Populations can be diverse topics such as “all people living in a country” or “every atom composing a crystal”. Statistics deals with all aspects of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics

Where You Might Use It
You can use statistics in:

  • Data summaries and descriptive statistics, central tendency, variance, covariance, correlation
  • Basic probability: basic idea, expectation, probability calculus, Bayes’ theorem, conditional probability
  • Probability distribution functions: uniform, normal, binomial, chi-square, Student’s t-distribution, central limit theorem
  • Sampling, measurement, error, random number generation
  • Hypothesis testing, A/B testing, confidence intervals, p-values
  • ANOVA, t-test
  • Linear regression, regularisation etc 3. Calculus: Calculus (from Latin calculus, literally ‘small pebble’, used for counting and calculations, as on an abacus) is the mathematical study of continuous change, in the same way, that geometry is the study of shape and algebra is the study of generalisations of arithmetic operations.

It has two major branches, differential calculus (concerning instantaneous rates of change and slopes of curves), and integral calculus (concerning the accumulation of quantities and the areas under and between curves). These two branches are related to each other by the fundamental theorem of calculus. Both branches make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit
Here is a cheat sheet for calculus in Machine Learning.

Where You Might Use It

Ever wondered how exactly a logistic regression algorithm is implemented? There is a high chance it uses a method called “gradient descent” to find the minimum loss function. To understand how this works, you need to use concepts from calculus: gradient, derivatives, limits, and chain rule.

3. Discrete Math: Discrete mathematics is the study of mathematical structures that are fundamentally discrete rather than continuous. In contrast to real numbers that have the property of varying “smoothly”, the objects studied in discrete mathematics – such as integers, graphs, and statements in logic – do not vary smoothly in this way, but have distinct, separated values. Discrete mathematics, therefore, excludes topics in “continuous mathematics” such as calculus or Euclidean geometry. Discrete objects can often be enumerated by integers. More formally, discrete mathematics has been characterized as the branch of mathematics dealing with countable sets (finite sets or sets with the same cardinality as the natural numbers). However, there is no exact definition of the term “discrete mathematics.” Indeed, discrete mathematics is described less by what is included than by what is excluded: continuously varying quantities and related notions.

Where You Might Use It

In any social network analysis, you need to know the properties of the graph and fast algorithm to search and traverse the network. In any choice of algorithm, you need to understand the time and space complexity—i.e., how the running time and space requirement grows with input data size, by using O(n) (Big-Oh) notation.

4. Optimization and Operation Research Topics: These topics are most relevant in specialized fields like theoretical computer science, control theory, or operation research.

pexels-photo-577585.jpeg

But a basic understanding of these powerful techniques can also be fruitful in the practice of machine learning. Virtually every machine-learning algorithm aims to minimize some kind of estimation error subject to various constraints—which is an optimization problem.

Where You Might Use It

Simple linear regression problems using least-square loss function often have an exact analytical solution, but logistic regression problems don’t. To understand the reason, you need to be familiar with the concept of “convexity” in optimization. This line of investigation will also illuminate why we must remain satisfied with “approximate” solutions in most machine-learning problem.

5. Functions, Variables, Equations, and Graphs:
This area of math covers the basics, from the equation of a line to the binomial theorem and its properties:

pexels-photo-241544.jpeg

  • Logarithm, exponential, polynomial functions, rational numbers
  • Basic geometry and theorems, trigonometric identities
  • Real and complex numbers, basic properties
  • Series, sums, inequalities
  • Graphing and plotting, Cartesian and polar coordinates, conic sections

Where You Might Use It

If you want to understand how a search runs faster on a million-item database after you’ve sorted it, you will come across the concept of “binary search.” To understand the dynamics of it, you need to understand logarithms and recurrence equations. Or, if you want to analyze a time series, you may come across concepts like “periodic functions” and “exponential decay.”

Leave a comment