Glossary
The following glossary is collaboratively edited by CDH researchers in this Google doc. Because it’s still a work in progress, please feel free to comment in the doc to add terms, questions, or clarifications. 🚧
- accuracy
- "The number of correct classification predictions divided by the total number of predictions. For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of 80%." Source
- artificial intelligence
- "The field of developing computers and robots that are capable of behaving in ways that both mimic and go beyond human capabilities. AI-enabled programs can analyze and contextualize data to provide information or automatically trigger actions without human interference. Today, artificial intelligence is at the heart of many technologies we use, including smart devices and voice assistants such as Siri on Apple devices." Source
- benchmark
- "In this paper we describe a benchmark as a particular combination of a dataset or sets of datasets (at least test data, sometimes also training data), and a metric, conceptualized as representing one or more specific tasks or sets of abilities, picked up by a community of researchers as a shared framework for the comparison of methods. The task is a particular specification of a problem, as represented in the dataset. A metric is a way to summarize system performance over some set or sets of tasks as a single number or score. The metric provides a means of counting success and failure at the level of individual system outputs and summarizing those counts over the full dataset. Models obtaining the most favourable scores on the metrics for a benchmark are considered to be "state-of-the-art" (SOTA) in terms of performance on the specified task." Source
- class
- "A set of instances having the same identity. For example, 'A' and 'a' belong to the same class. In machine learning, for each class we learn a discriminant from the set of its examples. Classification is the assignment of a given instance to one of a set of classes." Source
- correlation
- "A correlation is a relationship between two quantities such that the value of one is a signal of the value of the other. It's an example of a signal relationship. The word correlation is used in a wide variety of senses, and often simply means that there is a 'co-relation' between the two things in question. Very often, when people use the word correlated, they mean 'linear correlation', which means that the signal has a particular form. A 'positive correlation' in this sense means that if X is high (or higher than usual), then Y is high (or higher than usual), and that if X is low (or lower than usual), then Y is low (or lower than usual)." Source
- data science
- "Billions of gigabytes of data are generated globally every day. Data science is the drive to turn this data into useful information, and to understand its powerful impact on science, society, the economy and our way of life. The study of data science brings together researchers in computer science, mathematics, statistics, machine learning, engineering and the social sciences." Source
- decision tree
- "A hierarchical model composed of decision nodes and leaves. The decision tree works fast, and it can be converted to a set of if-then rules, and as such allows knowledge extraction." Source
- descriptive statistics
- "...identifying which are the relevant variables to explore, where the outlier data points lie and what the overall distribution of data suggests. In addition to tackling these problems in research methodology, tools for working with digital humanities datasets must address the full spectrum of interpretive questions that lie at the core of traditional humanities practice." Source
- distance
- "A numerical measurement of how far apart objects or points are." Source
- distribution
- "A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically." Source
- exploratory data analysis (EDA)
- "A term coined by the statistician John Tukey in 1977. It consists of using statistics to summarize and describe the features of a dataset and the patterns it contains. This kind of analysis — figuring out what features are present in a dataset, what subsets one might analyze — is the first step to discovering patterns in a dataset." Source
- feature
- "An input variable to a machine learning model." Source
- K-nearest neighbors classifier
- "A classifier model in which the label of a new example is predicted based on the k-closest data points in the training set." Source
- machine learning
- "Machine learning is one way to use AI. It was defined in the 1950s by AI pioneer Arthur Samuel as 'the field of study that gives computers the ability to learn without explicitly being programmed.' [. . . ] Machine learning takes the approach of letting computers learn to program themselves through experience. Machine learning starts with data — numbers, photos, or text, like bank transactions, pictures of people or even bakery items, repair records, time series data from sensors, or sales reports. The data is gathered and prepared to be used as training data, or the information the machine learning model will be trained on. The more data, the better the program. From there, programmers choose a machine learning model to use, supply the data, and let the computer model train itself to find patterns or make predictions. " Source
- model
- "A model is a mathematical representation of some larger phenomenon. Usually models take some kind of *input* (eg the frequencies in patterns in words) and produce an *output* (can be a classification — a qualitative output — or a quantitative output) in order to make a "prediction" (which can be about what category that object falls in, or how similar/dissimilar that object is in relation to other objects). A model describes some assumptions about the statistical distribution of data by predicting a range of probable outcomes given a change in a particular variable." Source
- neural networks
- "The field of developing computers and robots that are capable of behaving in ways that both mimic and go beyond human capabilities. AI-enabled programs can analyze and contextualize data to provide information or automatically trigger actions without human interference. Today, artificial intelligence is at the heart of many technologies we use, including smart devices and voice assistants such as Siri on Apple devices." Source
- overfitting
- "Creating a model that matches the training data so closely that the model fails to make correct predictions on new data." Source
- prediction
- "A model’s predictions are applications of the model to new, unseen inputs. While it does sometimes involve trying to predict the future (“is this patient at high risk for cancer?”), sometimes it doesn’t (“is this social media account a bot?”). Prediction can take the form of classification (determine whether a piece of email is spam), regression (assigning risk scores to defendants), or information retrieval (finding documents that best match a search query)" Source
- regression
- "Estimating a numeric value for a given instance. For exampe, estimating the price of a used car given the attributes of the car is a regression problem." Source
- supervised learning
- "The computer is presented with example inputs and their desired outputs, given by a 'teacher,' and the goal is to learn a general rule that maps inputs to outputs. . . . We essentially tell the machine what we want to find, then fine-tune the model until we get the machine to predict what we know to be true." Source
- test set
- "A subset of the dataset reserved for testing a trained model. Each example in a dataset should belong to only one of the training or test sets. For instance, a single example should not belong to both the training set and the test set." Source
- training data
- "Known datasets for practicing and tuning the machine-learning model. Contrast with testing data and validation data." Source
- unsupervised learning
- "No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means toward an end (feature learning)." Source
- visualization
- "A presentation of a pattern within data in pictoral form (e.g., chart, graph), that does not require a narrative explanation or extensive text for the viewer to interpret." Source