Data Science & Data Engineering Glossary
Up your data literacy and learn more about the data science and data engineering industry terms you’ll see most often in the field with this comprehensive glossary. You can download a full pdf of this glossary here.
A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures.
In classification, accuracy is defined as the number of observations that are correctly labeled by the algorithm as a fraction of the total number of observations the algorithm attempted to label. Colloquially, it is the fraction of times the algorithm guessed “right.”
A series of repeatable steps for carrying out a certain type of task with data.
Anomaly detection, also known as outlier detection, is the identification of rare items, events, observations, or patterns which raise suspicions by differing significantly from the majority of the data.
API is an acronym for Application Programming Interface, a software intermediary that ensures a connection between applications or computers.
Artificial Intelligence (AI)
The ability to have machines act with apparent intelligence, although varying definitions of “intelligence” lead to a range of meanings for the artificial variety.
Artificial Neural Network (ANN)
An Artificial Neural Network is a machine learning model that is loosely inspired by biological neural networks in human brains.
Periodic evaluation of a trained machine learning algorithm to check whether the predictions of the algorithm have degraded over time. Backtesting is a critical component of model maintenance.
A model or heuristic used as reference point for comparing how well a machine learning model is performing. A baseline helps model developers quantify the minimal, expected performance on a particular problem. Generally, baselines are set to simulate the performance of a model that doesn’t actually make use of our data to make predictions. This is called a naive benchmark.
A set of observations that are fed into a machine learning model to train it. Batch training is a counterpart to online learning, in which data are fed sequentially instead of all at once.
Also, Bayes’ Rule. An equation for calculating the probability that something is true if something potentially related to it is true.
Bias is a source of error that emerges from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and labels. Bias can be mitigated by adding additional features to the data or using a more flexible algorithm. See also variance, cross-validation.
Big Data refers to the ability to work with collections of data that had been impractical before because of their volume, velocity, and variety.
A distribution of outcomes of independent events with two mutually exclusive possible outcomes, a fixed number of trials, and a constant probability of success.
Classification is one of the two major types of supervised learning models in which the labels we train the algorithm to predict are distinct categories. Usually these categories are binary (yes/no, innocent/guilty, 0/1) but classification algorithms can typically be extended to handle multiple classes (peach, plum, pear) or, in a more limited set of cases, multiple labels (an object can belong to more than one category).
A computing paradigm in which the storage and processing of data or the hosting of computing services such as databases or websites takes place on a remote system comprised of multiple individual computing units acting as one and typically owned by a cloud computing service provider.
An unsupervised learning technique that identifies group structures in data. Clusters are groups of observations that are similar to other observations in the same cluster and different from those belonging to different clusters. Clustering algorithms only consider the relationships between features in the data mathematically and not conceptually; as such, the clusters identified by these algorithms may not reflect any grouping structure that would be sensible to a human being.
A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (Ex.: x in x(y + z), 6 in 6ab.
Also, natural language processing, NLP. A branch of computer science for parsing text of spoken languages to convert it to structured data that you can use to drive program logic. Early efforts focused on translating one language to another or accepting complete sentences as queries to databases; modern efforts often analyze documents and other data to extract potentially valuable information.
A range specified around an estimate to indicate margin of error, combined with a probability that a value will fall in that range. The field of statistics offers specific mathematical formulas to calculate confidence intervals.
A variable whose value can be any of an infinite number of values, typically within a particular range.
The degree of relative correspondence, as between two sets of data.
A measure of the relationship between two variables whose values are observed at the same time; specifically, the average value of the two variables diminished by the product of their average values.
The name given to a set of techniques that split data into training sets and test sets when using data with an algorithm. The training set is given to the algorithm, along with the correct answers (labels), and becomes the set used to make predictions. The algorithm is then asked to make predictions for each item in the test set. The answers it gives are compared to the correct answers, and an overall score for how well the algorithm did is calculated. Cross-validation repeats this splitting procedure several times and computes an average score based on the scores from each split.
The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data and provide more consistency.
A database is a structured storage space where the data is organized in many different tables in a way such that the necessary information can be easily accessed and summarized. Databases are mostly used with a relational database management system (RDBMS) such as Oracle or PostgreSQL. The most common programming language used to interact with the data from a database is SQL.
Database Management System (DBMS)
A database management system is a software package used to easily perform different operations on the data: accessing, manipulating, retrieving, managing, and storing the data in a database. Based on the way the data is organized and structured, there are different types of DBMS: relational, graph, hierarchical, etc.. Some examples of DBMS: Oracle, MySQL, PostgreSQL, Microsoft SQL Server, MongoDB.
A set of information describing the contents, format, and structure of a database and the relationship between its elements, used to control access to andmanipulation of the database.
A data engineer is a specialist responsible for providing the right data to the hands of data scientists and data analysts. They design and maintain the storage infrastructure and data pipelines that take large amounts of raw data coming from various sources into one centralized location with clean, correctly formatted data that is relevant for the organization. Learn more.
The data that a person creates as a byproduct of a common activity—for example, a cell call log or web search history.
A means for a person to receive a stream of data. Examples of data feed mechanisms include RSS or Twitter.
The planning, oversight, and control over management of data and data-related sources. Data governance sets the roles, responsibilities, and processes for ensuring data availability, relevance, quality, usability, integrity, and security. Data governance includes a governing body, a framework of rules and practices to meet the company’s information needs, and a program to perform these practices.
A data lake is a single storage repository containing a large amount of raw, unprocessed data of any kind from various sources, which as yet has no defined purpose. A data lake includes both structured data of different structures without any relationship between one another and, more often, unstructured data, such as documents and text files. The raw data is kept as an original source of information, without the necessity to structure and wrangle it unless the data is needed.
Data literacy is the ability to read, write, analyze, communicate, and reason with data to make better data-driven decisions.
The access layer of a data warehouse used to provide data to users.
The process of moving data between different storage types or formats, or between different computer systems.
Data Model, Data Modeling
An agreed upon data structure. This structure is used to pass data from one individual, group, or organization to another, so that all parties know what the different data components mean. Often meant for both technical and non-technical users.
The process of collecting statistics and information about data in an existing source.
The measure of data to determine its worthiness for decision making, planning or operations.
The process of sharing information to ensure consistency between redundant sources.
Data Scientists investigate, extract, and report meaningful insights in the organization’s data. They communicate these insights to non-technical stakeholders, and have a good understanding of machine learning workflows and how to tie them back to business applications. They work almost exclusively with coding tools, conduct analysis, and often work with big data tools.
A dataset is a collection of data of one or many types representing real-life or synthetically generated observations, and used for statistical analysis or data modeling.
A person responsible for data stored in a data field.
A specific way of storing and organizing data.
A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.
A place to store data for the purpose of reporting and analysis.
The process of transforming and cleaning data from raw formats to appropriate formats for later use. Also called data munging.
A decision tree uses a tree structure to represent a number of possible decision paths and an outcome for each path.
A multilevel algorithm that gradually identifies things at higher levels of abstraction. For example, the first level identifies certain lines. The second identifies combinations of lines as shapes. Then the third identifies combinations of shapes as specific objects. Deep learning is popular for image classification. See also neural network.
The value of a dependent value “depends” on the value of the independent variable.
Errors at Random
Errors-at-random are data errors such as missing or mismeasured data that are random with respect to the data we observe. Errors are not-at-random if the probability that an observation is missing or erroneous is correlated with the observed data. Errors-not-at-random are especially problematic if errors are correlated with labels.
ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data from a primary source and place it into a database.
An expert system is a computer system that emulates the decision-making ability of a human expert. Expert systems are designed to solve complex problems processing data describing the context of the decision being made and applying logic, mainly in the form of if-then rules.
“General Architecture for Text Engineering,” an open source, Java-based framework for natural language processing tasks.
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
Hadoop is a collection of software that facilitate using a network of many computers to solve problems involving large amounts of data and computation. It consists of two main functional components. One, the Hadoop Distributed File System (HDFS), is a utility that allows data to be stored over multiple networked machines in a failure-tolerant manner while still being treated as a single file from the perspective of the user. The other, Hadoop MapReduce, is a programming paradigm that allows the user to process and analyze this data in parallel over large numbers of individual processing units located across multiple machines.
Hyperparameters are attributes pertaining to a machine learning model whose value is set manually before starting the training process. Unlike the other parameters, hyperparameters cannot be estimated or learned directly from the data.
Hive is a data warehouse software project built on top of Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Imputation is the process of filling in missing values in a dataset. Imputation techniques can be either statistical (mean/mode imputation) or machine learning techniques (KNN imputation).
Internet of things (IoT)
The Internet of things (IoT) is the extension of internet connectivity into physical devices and everyday objects. Embedded with electronics, internet connectivity, artificial intelligence, and other forms of hardware, these devices can communicate and interact with others over the Internet, and they can be remotely monitored and controlled.
Java is a general-purpose, object-oriented, compiled programming language. While it is not among the most common languages used by data scientists, it and its close relative Scala are the native language of many distributed computing frameworks such as Hadoop and Spark.
A scripting language (no relation to Java) originally designed for embedding logic in web pages, but which later evolved into a more general-purpose development language.
K-Means is the most popular clustering algorithm that identifies K cluster centers (called centroids) with tentative coordinates in the data and iteratively assigns each observation to one of the centroids based on its features until the centroids converge. Data points are similar inside a cluster and different from the data points in the other clusters.
K-Nearest Neighbors (kNN)
K-nearest neighbors are supervised learning algorithms that classify observations based on their similarity to their nearest neighbors. The most important parameters of KNN that can be tuned are the number of nearest neighbors and the distance metric (Minkowski, Euclidean, Manhattan, etc.).
In supervised learning applications, labels are the components of the data that indicate the desired predictions or decisions we would like the machine learning algorithm to make for each observation we pass into the algorithm. Supervised learning algorithms learn to use other features in the data to predict labels so that these algorithms can learn to predict labels in other instances when the labels are not known or determined. In certain fields, labels are called targets. See also supervised learning, classification, regression.
Leakage is the introduction of information during training that will not be germane or available to the deployed algorithm.
Length measures the number of observations in our dataset.
A technique to look for a linear relationship by starting with a set of data points that don’t necessarily line up nicely. This is accomplished by computing the “least squares” line: on an x-y graph, the line that has the smallest possible sum of squared distances to the actual data point y values. Statistical software packages and typical spreadsheet packages offer automated ways to calculate this.
The relationship between two varying amounts, such as price and sales, that can be expressed with an equation that can be represented as a straight line on a graph.
The use of data-driven algorithms that perform better
as they have more data to work with, redefining their models or “learning” from this additional data. This involves cross-validation with training and test data sets. Studying the practical application of machine learning usually means researching which machine learning algorithms are best for which situations.
Machine Learning Model
The model artifact that is created in the process of providing a machine learning algorithm with training
data from which to learn.
MapReduce is a programming model and implementation designed to work with big data sets in parallel on a distributed cluster system. MapReduce programs consist of two steps. First, a map step takes chunks of data and processes it in some way (e.g. parsing text into words). Second, a reduce step takes the data that are generated by the map step and performs some kind of summary calculation (e.g. counting word occurrences). In between the map and reduce step, data move between machines using a key-value pair system that guarantees that each reducer has the information it needs to complete its calculation (e.g. all of the occurrences of the word “Python” get routed to a single processor so they can be counted in aggregate).
A commercial computer language and environment popular for visualization and algorithm development
Mean Absolute Error
Also, MAE. The average error of all predicted values when compared with observed values
Mean Squared Error
Also, MSE. The average of the squares of all the errors found when comparing predicted values with observed values.
Minimum Viable Product (MVP)
The minimum viable product is the smallest complete unit of work that would be valuable in its
own right, even if the rest of the project fizzled out.
The specification of mathematical or probabilistic
relationships existing between different variables.
Because “modeling” can mean many things, the
term “statistical modeling” is often used to more
accurately describe the kind of modeling that data
A classification algorithm that predicts labels from data by assuming that the features of the data are statistically independent from each other. Due to this assumption, Naive Bayes models can be easily fit on distributed systems.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a branch of data science that applies machine learning techniques to help machines learn to interpret and process textual data consisting of human language. Applications of NLP include text classification (predicting what type of content a document contains), sentiment analysis (determining whether a statement is positive, negative, or neutral), and translation. NLP also comprises techniques to encode textual content numerically to use in machine learning applications.
A machine learning method modeled after the brain. This method is extremely powerful and flexible, as it is created from an arbitrary number of artificial neurons that can be connected in various patterns appropriate to the problem at hand, and the strength of those connections are adjusted during the training process. They are able to learn extremely complex relationships between data and output, at the cost of large computational needs. They have been used to great success in processing image, movie, and text data, and any situation with very large numbers of features.
Also, Gaussian distribution. A probability distribution which, when graphed, is a symmetrical bell curve with the mean value at the center. The standard deviation value affects the height and width of the graph.
A database management system that uses any of several alternatives to the relational, table-oriented model used by SQL databases. Originally meant as “not SQL,” it has come to mean something closer to “not only SQL” due to the specialized nature of NoSQL database management systems. These systems often are tasked with playing specific roles in a larger system that may also include SQL and additional NoSQL systems.
Online learning is a learning paradigm by which machine learning models may be trained by passing them training data sequentially or in small groups (mini-batches). This is important in instances where the amount of data on hand exceeds the capacity of the RAM of the system on which a model is being developed. Online learning also allows models to be continually updated as new data are produced.
Open source refers to free-licensed software and resources available for further modifications and sharing.
An outlier is an abnormal value in a dataset that deviates considerably from the rest of the observations.
Overfitting refers to when a model learns too much information from the training set including potential noise and outliers. As a result, it becomes too complex, too conditioned on the particular training set, and fails to adequately perform on unseen data. See variance.
An older scripting language with roots in pre-Linux UNIX systems. Perl has always been popular for text processing, especially data cleanup and enhancement tasks.
Apache Pig is a high-level platform for creating programs that run on Hadoop. Pig is designed to make it easier to create data processing and analysis workflows that can be executed in MapReduce, Spark, or other distributed frameworks.
Pivot tables quickly summarize long lists of data, without requiring you to write a single formula or copy a single cell. But the most notable feature of pivot tables is that you can arrange them dynamically.
A performance measure for classification models. Precision measures the fraction of all of the observations that a classification algorithm flagged positively that were flagged correctly. For example, if our algorithm were judging suspects, precision would measure the percentage of all the suspects declared guilty by the algorithm who actually were guilty. See also recall.
The analysis of data to predict future events, typically to aid in business planning. This incorporates predictive modeling and other techniques. Machine learning might be considered a set of algorithms to help implement predictive analytics.
The development of statistical models to predict
Random Forest is a supervised learning algorithm used for regression or classification problems, random forest combines the outputs of many decision trees in a single model.
A performance measure for classification models. Recall measures the fraction of all of the observations that a classification algorithm should have flagged positively that were actually flagged by the algorithm.
Regression is one of the two major types of supervised learning models in which the labels we train the algorithm to predict are ordered quantities like prices or
Reinforcement Learning (RL)
Reinforcement learning (RL) is a stand-alone branch of machine learning (neither supervised nor unsupervised) where an algorithm gradually learns by interacting with an environment.
A relational database is a type of database that stores data in several tables related to one another by means of unique IDs (keys) from which the data can be accessed, extracted, summarized, or reassembled in different ways.
Root Mean Square Error
The root mean squared error (RMSE) is the square root of the mean squared error. This evaluation metric is more intuitive than MSE because the result can be understood easier, using the same units of measurement as the original data.
A scripting language that first appeared in 1996. Ruby is popular in the data science community, but not as popular as Python, which has more specialized libraries available for data science tasks.
A commercial statistical software suite that includes a programming language also known as
Scala is a Java-like programming language commonly used by data scientists. It is the native language of Spark.
The most common Python package for machine learning.
A computer’s operating system when used from the command line. Along with scripting languages such as Perl and Python, Linux-based shell tools (included and available for Mac and Windows computers) such as grep, diff, split, comm, head and tail are popular for data wrangling. A series of shell commands stored in a file that lets you execute the series by entering the file’s name is known as a shell script.
Simpson’s paradox is a phenomenon in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.
A commercial statistical software package, or predictive analytics software, popular in the social sciences. It has been available since 1968 and was acquired by IBM in 2009.
A commercial statistical software package commonly used by academics, particularly in the social sciences.
A type of machine learning algorithm in which a system learns to predicts labels after being shown a set of training data and identifying statistical associations between features in the data and the labels it is given. The classic example is sorting email into spam versus ham. See also unsupervised learning, machine learning.
A commercial data visualization package often used in data science projects.
Time Series Data
A time series is a sequence of measurements of some quantity taken at different times, often but not necessarily at equally spaced intervals.
Underfitting is when a model is unable to detect the patterns from the training set because it was built on insufficient information. As a result, the model is too simple and cannot perform well on unseen data, nor the training set itself. Underfitted models have high bias.
Unstructured data is any data that does not fit a predefined data structure such as the typical row-column structure of a database. Examples of such data are images, emails, text documents, videos and audio.
A class of machine learning algorithms designed to identify (potentially) useful patterns or structures in data without being directed to perform a specific prediction or decision task.
Variance is the amount that the estimate of the target function will change if different training data was used. Another way of saying this is that variance measures the degree to which a model picks up noise as opposed to signal. High variance is synonymous with overfitting.
Web scraping is the process of extracting specific data from websites for further usage. Web scraping can be done automatically by writing a program to capture the necessary information from a website.
Width measures the number of features in a dataset.