# Glossary

The following glossary is collaboratively edited by CDH researchers in this Google doc. Because it’s still a work in progress, please feel free to comment in the doc to add terms, questions, or clarifications. 🚧

accuracy
"The number of correct classification predictions divided by the total number of predictions. For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of 80%." Source

artificial intelligence
"The field of developing computers and robots that are capable of behaving in ways that both mimic and go beyond human capabilities. AI-enabled programs can analyze and contextualize data to provide information or automatically trigger actions without human interference. Today, artificial intelligence is at the heart of many technologies we use, including smart devices and voice assistants such as Siri on Apple devices." Source

benchmark
"In this paper we describe a benchmark as a particular combination of a dataset or sets of datasets (at least test data, sometimes also training data), and a metric, conceptualized as representing one or more specific tasks or sets of abilities, picked up by a community of researchers as a shared framework for the comparison of methods. The task is a particular specification of a problem, as represented in the dataset. A metric is a way to summarize system performance over some set or sets of tasks as a single number or score. The metric provides a means of counting success and failure at the level of individual system outputs and summarizing those counts over the full dataset. Models obtaining the most favourable scores on the metrics for a benchmark are considered to be "state-of-the-art" (SOTA) in terms of performance on the specified task." Source

class
"A set of instances having the same identity. For example, 'A' and 'a' belong to the same class. In machine learning, for each class we learn a discriminant from the set of its examples. Classification is the assignment of a given instance to one of a set of classes." Source

correlation
"A correlation is a relationship between two quantities such that the value of one is a signal of the value of the other. It's an example of a signal relationship. The word correlation is used in a wide variety of senses, and often simply means that there is a 'co-relation' between the two things in question. Very often, when people use the word correlated, they mean 'linear correlation', which means that the signal has a particular form. A 'positive correlation' in this sense means that if X is high (or higher than usual), then Y is high (or higher than usual), and that if X is low (or lower than usual), then Y is low (or lower than usual)." Source

distance
"A numerical measurement of how far apart objects or points are." Source

data science
"Billions of gigabytes of data are generated globally every day. Data science is the drive to turn this data into useful information, and to understand its powerful impact on science, society, the economy and our way of life. The study of data science brings together researchers in computer science, mathematics, statistics, machine learning, engineering and the social sciences." Source

decision tree
"A hierarchical model composed of decision nodes and leaves. The decision tree works fast, and it can be converted to a set of if-then rules, and as such allows knowledge extraction." Source

descriptive statistics
"...identifying which are the relevant variables to explore, where the outlier data points lie and what the overall distribution of data suggests. In addition to tackling these problems in research methodology, tools for working with digital humanities datasets must address the full spectrum of interpretive questions that lie at the core of traditional humanities practice." Source

distribution
"A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically." Source

exploratory data analysis (EDA)
"A term coined by the statistician John Tukey in 1977. It consists of using statistics to summarize and describe the features of a dataset and the patterns it contains. This kind of analysis — figuring out what features are present in a dataset, what subsets one might analyze — is the first step to discovering patterns in a dataset." Source

feature
"An input variable to a machine learning model." Source

K-nearest neighbors classifier
"A classifier model in which the label of a new example is predicted based on the k-closest data points in the training set." Source

machine learning
"Machine learning is one way to use AI. It was defined in the 1950s by AI pioneer Arthur Samuel as 'the field of study that gives computers the ability to learn without explicitly being programmed.' [. . . ] Machine learning takes the approach of letting computers learn to program themselves through experience. Machine learning starts with data — numbers, photos, or text, like bank transactions, pictures of people or even bakery items, repair records, time series data from sensors, or sales reports. The data is gathered and prepared to be used as training data, or the information the machine learning model will be trained on. The more data, the better the program. From there, programmers choose a machine learning model to use, supply the data, and let the computer model train itself to find patterns or make predictions. " Source

model
"A model is a mathematical representation of some larger phenomenon. Usually models take some kind of *input* (eg the frequencies in patterns in words) and produce an *output* (can be a classification — a qualitative output — or a quantitative output) in order to make a "prediction" (which can be about what category that object falls in, or how similar/dissimilar that object is in relation to other objects). A model describes some assumptions about the statistical distribution of data by predicting a range of probable outcomes given a change in a particular variable." Source

overfitting
"Creating a model that matches the training data so closely that the model fails to make correct predictions on new data." Source

neural networks
"The field of developing computers and robots that are capable of behaving in ways that both mimic and go beyond human capabilities. AI-enabled programs can analyze and contextualize data to provide information or automatically trigger actions without human interference. Today, artificial intelligence is at the heart of many technologies we use, including smart devices and voice assistants such as Siri on Apple devices." Source

regression
"Estimating a numeric value for a given instance. For exampe, estimating the price of a used car given the attributes of the car is a regression problem." Source

supervised learning
"The computer is presented with example inputs and their desired outputs, given by a 'teacher,' and the goal is to learn a general rule that maps inputs to outputs. . . . We essentially tell the machine what we want to find, then fine-tune the model until we get the machine to predict what we know to be true." Source

test set
"A subset of the dataset reserved for testing a trained model. Each example in a dataset should belong to only one of the training or test sets. For instance, a single example should not belong to both the training set and the test set." Source

training data
"Known datasets for practicing and tuning the machine-learning model. Contrast with testing data and validation data." Source

unsupervised learning
"No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means toward an end (feature learning)." Source

visualization
"A presentation of a pattern within data in pictoral form (e.g., chart, graph), that does not require a narrative explanation or extensive text for the viewer to interpret." Source