Data labeling and Annotation

Our mission is to accelerate the development of AI applications

Better data leads to more performant models. Performant models lead to faster deployment. We help deliver value from AI investments faster with better data by providing an end-to-end solution to manage the entire ML lifecycle.

The age of AI is here. Generative AI has the potential to unseat incumbents, catapult new leaders, or solidify existing moats.

Every industry, from the private sector to public sector is rethinking their strategies to incorporate AI. Despite this explosion in interest, there is no blueprint for organizations to go from inception to deployment for their AI initiatives.

Our products for image annotation, semantic segmentation, 3D point cloud annotation, and LIDAR and RADAR annotation are used by industry leaders and provide world-class accuracy.

Our proprietary Data Engine powers the most advanced LLMs, generative models, and computer vision models with high-quality data. We then apply our experience partnering with leading AI companies building these models to help more organizations customize and Apply AI in their organizations.

Methodology

We are proud to be trusted by leading companies to provide a data-centric, end-to-end solution to manage the entire ML lifecycle. Combining cutting edge technology with operational excellence, we help teams develop the highest-quality datasets because better data → better AI　Meet our customers

Over our M models, we let A={(m,m′):m<m′, and m,m′∈[M]}A={(m,m′):m<m′, and m,m′∈[M]} denote our comparative data set.

At time t∈Nt∈N, we serve the human a pair of models At∈AAt∈A and we have our evaluator’s response Ht∈[0,0.5,1]Ht∈[0,0.5,1]. A 1 means that model mm is preferred over model m′m′ and a 0.5 means that the models were equally preferred.

With Bradley-Terry, we use a logistic relationship to model the probability that this is true with:

P(Ht=1)=11+eξm′−ξmP(Ht=1)=1+eξm′−ξm1

Where ξξ is an M-length vector of "BT" coefficients. We then want to estimate the BT coefficients by minimizing the binary cross-entropy loss:

s(P^)=arg min⁡ξEA,HP[l(H,11+eξA2−ξA1)]s(P^)=ξargminEA,HP[l(H,1+eξA2−ξA11)]

Where ll is the binary cross-entropy loss,

l(h,p)=−(hlog⁡(p)+(1−h)log⁡(1−p))l(h,p)=−(hlog(p)+(1−h)log(1−p))

Additionally, we’ll minimize this loss while using inverse weighting by P(At)P(At) to target a score with a uniform distribution over AA. This inverse weighting isn’t strictly necessary, however, as our pairwise comparisons between models are very close to equal. We perform the below formula to get our final BT-score.

s(P^)=arg min⁡ξ∑t=1T1P(At)l(Ht,11+eξAt,2−ξAt,1)s(P^)=ξargmin∑t=1TP(At)1l(Ht,1+eξAt,2−ξAt,11)

where At~PAt~P. This score is converted to an Elo-scale with the simple conversion 1000+s(P^)×4001000+s(P^)×400 and is sorted to get our final ranking.

Data Labeling: The Authoritative Guide
The success of your ML models is dependent on data and label quality. This is the guide you need to ensure you get the highest quality labels possible.
Machine learning has revolutionized our approach to solving problems in computer vision and natural language processing. Powered by enormous amounts of data, machine learning algorithms are incredibly good at learning and detecting patterns in data and making useful predictions, all without being explicitly programmed to do so.
Trained on large amounts of image data, computer models can predict objects with very high accuracy. They can recognize faces, cars, and fruit, all without requiring a human to write software programs explicitly dictating how to identify them.
Similarly, natural language processing models power modern voice assistants and chatbots we interact with daily. Trained on enormous amounts of audio and text data, these models can recognize speech, understand the context of written content, and translate between different languages.
Instead of engineers attempting to hand-code these capabilities into software, machine learning engineers program these models with a large amount of relevant, clean data. Data needs to be labeled to help models make these valuable predictions. Data labeling is one of machine learning's most critical and overlooked activities.
This guide aims to provide a comprehensive reference for data labeling and to share practical best practices derived from Scale's extensive experience in addressing the most significant problems in data labeling.
What is Data Labeling?
Data labeling is the activity of assigning context or meaning to data so that machine learning algorithms can learn from the labels to achieve the desired result.
To better understand data labeling, we will first review the types of machine learning and the different types of data to be labeled. Machine learning has three broad categories: supervised, unsupervised, and reinforcement learning. We will go into more detail about each type of machine learning in Why is Data Annotation Important?
Supervised machine learning algorithms leverage large amounts of labeled data to “train” neural networks or models to recognize patterns in the data that are useful for a given application. Data labelers define ground truth annotations to data, and machine learning engineers feed that data into a machine learning algorithm. For example, data labelers will label all cars in a given scene for an autonomous vehicle object recognition model. The machine learning model will then learn to identify patterns across the labeled dataset. These models then make predictions on never before seen data.
Types of Data
Structured vs. Unstructured Data
Structured data is highly organized, such as information in a relational database (RDBMS) or spreadsheet. Customer information, phone numbers, social security numbers, revenue, serial numbers, and product descriptions are structured data.
Unstructured data is data that is not structured via predefined schemas and includes things like images, videos, LiDAR, Radar, some text data, and audio data.
Images
Camera sensors output data initially in raw format and then converted to .png or preferably .jpg files, which are compressed and take up less storage than .png, which is a serious consideration when dealing with the large amounts of data needed to train machine learning models. Image data is also scraped from the internet or collected by 3rd party services. Image data powers many applications, from face recognition to manufacturing defect detection to diagnostic imaging.
Why is Data Annotation Important?
Machine learning powers revolutionary applications made possible by vast amounts of high-quality data. To better understand the importance of data labeling, it is critical to understand the different types of machine learning: supervised, unsupervised, and reinforcement learning.
Reinforcement Learning leverages algorithms to take actions in an environment to maximize a reward. For instance, Deepmind’s AlphaGo used reinforcement learning to play games against itself to master the game of GO and become the strongest player in history. Reinforcement learning does not rely on labeled data but instead maximizes a reward function to achieve a goal.
Supervised Learning vs. Unsupervised Learning
Supervised learning is behind the most common and powerful machine learning applications, from spam detection to enabling self-driving cars to detect people, cars, and other obstacles. Supervised learning uses a large amount of labeled data to train a model to accurately classify data or predict outcomes.
Unsupervised learning helps analyze and cluster unlabeled data, driving systems like recommendation engines. These models learn from features of the dataset itself, without any labeled data to "teach" the algorithm the expected outputs. A common approach is K-means clustering, which aims to partition n observations into k clusters and assign each observation to the nearest mean.
While there are many fantastic applications for unsupervised learning, supervised learning has driven the most high-impact applications due to its high accuracy and predictive capabilities.
Machine learning practitioners have turned their attention away from model improvement to improving data, coining a new paradigm: data-centric ai. Only a tiny fraction of real-world ML systems are composed of ML code. More high-quality data and accurate data labels are necessary to power better AI. As the methods to create better machine learning models shift to data-centricity, it is essential to understand the entire process of a well-defined data pipeline, from data collection methods to data labeling to data curation.
This guide focuses on the most common types of data labels and the best practices for high quality so that you can get the most out of your data and therefore get the most out of your models.
How to Annotate Data
To create high-quality supervised learning models, you need a large volume of data with high-quality labels. So, how do you label data? First, you will need to determine who will label your data. There are several different approaches to building labeling teams, and each has its benefits, drawbacks, and considerations. Let's first consider whether it is best to involve humans in the labeling process, rely entirely on automated data labeling, or combine the two approaches.
1. Choose Between Humans vs. Machines
Automated Data Labeling
For large datasets consisting of well-known objects, it is possible to automate or partially automate data labeling. Custom Machine Learning models trained to label specific data types will automatically apply labels to the dataset.
Establishing high-quality ground-truth datasets early on, and only then can you leverage automated data labeling. Even with high-quality ground truth, it can be challenging to account for all edge cases and to fully trust automated data labeling to provide the highest quality labels.
Human Only Labeling
Humans are exceptionally skilled at tasks for many modalities we care about for machine learning applications, such as vision and natural language processing. Humans provide higher quality labels than automated data labeling in many domains.
However, human experiences can be subjective to varying degrees and training humans to label the same data consistently is a challenge. Furthermore, humans are significantly slower and can be more expensive than automated labeling for a given task.
Human in the Loop (HITL) Labeling
Human-in-the-loop labeling leverages the highly specialized capabilities of humans to help augment automated data labeling. HITL data labeling can come in the form of automatically labeled data audited by humans or from active tooling that makes labeling more efficient and improves quality. The combination of automated labeling plus human in the loop nearly always outpaces the accuracy and efficiency of either alone.
2. Assemble Your Labeling Workforce
If you choose to leverage humans in your data labeling workforce, which we highly recommend, you will need to figure out how to source your labeling workforce. Will you hire an in-house team, convince your friends and family to label your data for free, or scale up to a 3rd Party labeling company? We provide a framework to help you make this decision below.
In-House Teams
Small startups may not have the capital to afford significant investments in data labeling, so they may end up having all the team members, including the CEO, label data themselves. For a small prototype, this approach may work but is not a scalable solution.
Large, well-funded organizations may choose to keep in-house labeling teams to keep control over the entire data pipeline. This approach allows for much control and flexibility, but it is expensive and much work to manage.
Companies with privacy concerns or sensitive data may choose in-house labeling teams. While a perfectly valid approach, this can be difficult to scale.
Pros: Subject matter expertise, tight control over data pipelines
Cons: Expensive, overhead in training and managing labelers
Crowdsourcing
Crowdsourcing platforms provide a quick and easy way to quickly complete a wide array of tasks by a large pool of people. These platforms are fine for labeling data with no privacy concerns, such as open datasets with basic annotations and instructions. However, if more complex labels are needed or sensitive data is involved, the untrained resource pool from crowdsource platforms is a poor choice. Resources found on crowdsourcing platforms are not trained well and lack domain expertise, often leading to poor quality labeling.
Pros: Access to a larger pool of labelers
Cons: Quality is suspect; significant overhead in training and managing labelers
3rd Party Data Labeling Partners
3rd Party data labeling companies provide high-quality data labels efficiently and often have deep machine learning expertise. These companies can act as technical partners to advise you on best practices for the entire machine learning lifecycle, including how to best collect, curate, and label your data. With highly trained resource pools and state-of-the-art automated data labeling workflows and toolsets, these companies offer high-quality labels for a minimal cost.
To achieve extremely high quality (99%+) on a large dataset requires a large workforce (1,000+ data labelers on any given project). Scaling to this volume at high quality is difficult with in-house teams and crowdsourcing platforms. However, these companies can also be expensive and, if they are not acting as a trusted advisor, can convince you to label more data than you may need for a given application.
Pros: Technical expertise, minimal cost, high quality; The top data labeling companies have domain-relevant certifications such as SOC2 and HIPAA.
Cons: Relinquish control of the labeling process; Need a trusted partner with proper certifications to handle sensitive data
3. Select Your Data Labeling Platform
Once you have determined who will label your data, you need to find a data labeling platform. There are many options here, from building in-house, using open source tools, or leveraging commercial labeling platforms.
Open Source Tools
These tools are free to use by anyone, with some limitations for commercial use. These tools are great for learning and developing machine learning and AI, personal projects, or testing early commercial applications of AI. While free, the tradeoff is that these tools are not as scalable or sophisticated as some commercial platforms. Some label types discussed in this guide may not be available in these open-source tools.
The list below is meant to be representative, but not exhaustive so many great open source alternatives may not be included.
CVAT: Originally developed by Intel, CVAT is a free, open-source web-based data labeling platform. CVAT supports many standard label types, including rectangles, polygons, and cuboids. CVAT is a collaborative tool and is excellent for introductory or smaller projects. However, web users are limited to 500 MB of data and only ten tasks per user, reducing the appeal of the collaboration features on the web version. CVAT is available locally to avoid these data constraints.
LabelMe: Created by CSAIL, LabelMe is a free, open-source data-labeling platform supporting community collaboration on datasets for computer vision research. You can contribute to other projects by labeling open datasets and label your data by downloading the tool. Labelme is quite limited compared to CVAT, and the web version no longer accepts new accounts.
Stanford Core NLP: A fully featured NLP labeling and natural language processing platform, Stanford's CoreNLP is a robust open source tool offering Named Entity Recognition (NER), linking, text processing, and more.
In-house Tools
Building in-house tools is an option selected by some large organizations that want tighter control over their ML pipelines. You have direct control over which features to build, support your desired use cases, and address your specific challenges. However, this approach is costly, and these tools will need to be maintained and updated to keep up with the state-of-the-art.
Commercial Platforms
Commercial platforms offer high-quality tooling, dedicated support, and experienced labeling workforces to help you scale and can also provide guidance on best practices for labeling and machine learning. Supporting many customers improves the quality of the platforms for all customers, so you get access to state-of-the-art functionality that you may not see with in-house or open-source labeling platforms.
Scale Studio is the industry-leading commercial platform, providing best-in-class labeling infrastructure to accelerate your team, with labeling tools to support any use case and orchestration to optimize the performance of your workforce. Easily annotate, monitor, and improve the quality of your data.
High-Quality Data Annotations
Whatever annotation platform you use, maximizing the quality of your data labels is critical to getting the most out of your machine learning applications.
The classic computer science axiom "Garbage in Garbage Out" is especially acute in machine learning, as data is the primary input to the learning process. With poor quality data or labels, you will have poor results. We aim to provide you with the most critical quality metrics and discuss best practices to ensure that you are maximizing your labeling quality.
Different Ways to Measure Quality
We cover some of the most critical quality metrics and then discuss best practices to ensure quality in your labeling processes.
Label Accuracy
It is essential to analyze how closely labels follow your instructions and match your expectations. For instance, say you have assigned tasks to data labelers to annotate pedestrians. In your instructions, you have specified labels should include anything carried (i.e., a phone or backpack), but not anything that is pushed or pulled. When you review sample tasks, are the instructions followed, or are strollers (pushed) and luggage (pulled) included in the annotations?
How accurate are labelers on benchmark tasks? These are test tasks to determine overall labeler accuracy and give you more confidence that other labeled data will also be correct. Is labeling consistent across labelers or types of data? If label accuracy is inconsistent across different labelers, this may indicate that your instructions are unclear or that you need to provide more training to your labelers.
Model Performance Improvement
How accurate is your model at its specified task? This output metric is not solely dependent on labeling quality, the quantity and quality of data play a prominent role, but labeling quality is a significant factor to consider.
Let's review some of the most critical model performance metrics.
Precision
Precision defines what proportion of positive identifications were correct and is calculated as follows:
Precision = True Positives / True Positives + False Positives
A model that produces no false positives has a precision of 1.0
Recall
Recall defines what proportion of actual positives were identified correctly by the model, and is calculated as follows:
Recall = True Positives / True Positives + False Negatives
A model that produces no false negatives has a recall of 1.0
A model with high recall but low precision returns many results, but many of the predicted labels are incorrect compared to the ground truth labels. On the other hand, a model with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the ground truth labels. An ideal model with high precision and high recall will return many results, with all results labeled correctly.
Precision recall curves provide a more nuanced understanding of model performance than any single metric. Precision and recall are a tradeoff; the exact desired values depend on your model and its application.
For instance, for a diagnostic imaging application tasked to detect cancerous tumors, higher recall is desired as it is better to predict that a non-cancerous tumor is cancerous than the alternative of labeling a cancerous tumor as non-cancerous.
Alternatively, applications like SPAM filters require high precision so that important emails are not incorrectly flagged SPAM, even though this may allow more actual SPAM to enter our inbox.
Intersection Over Union (IoU)
An indirect method of confirming label quality in computer vision applications is an evaluation metric called Intersection over Union (IoU).
IoU compares the accuracy of a predicted label over the ground truth label by measuring the ratio of the area of overlap of the predicted label to the area of the union of both the predicted label and the ground truth label.
The closer this ratio is to 1, the better trained the model is.
As we discussed earlier, the essential factors for model training are high-quality data and labels. So, IoU is an indirect measure of the quality of data labels. You may have a different threshold of quality that you are focused on depending on the class of object. For instance, if you are building an augmented reality application focused on human interaction, you may need your IoU at 0.95 for identifying human faces but only need an IoU of 0.70 for identifying dogs.
Again, it is critical to remember that other data-related factors can influence IoU, such as bias in the dataset or an insufficient quantity of data.
Confusion Matrices
Confusion matrices are a simple but powerful tool to understand class confusion in your model better. A confusion matrix is a grid of classes comparing predicted and actual (ground truth) classifications. By examining the confusion matrix, you can quickly understand misclassifications, such as your model predicting a traffic sign when the ground truth is a train.
Combining these confusions with confidence scores can provide an easy way to prioritize the object confusion, looking at those instances where the model is highly confident in an incorrect prediction. Class confusions are likely due to an issue with incorrect or missing labels or with an inadequate quantity of data containing the traffic sign and train classes.
Best Practices for Achieving High-Quality Labels:
Collect the best data possible: Your data should be high quality and as consistent as possible while avoiding the bias that may reduce your model's usefulness. Ideally, your data collection pipeline is integrated into your labeling pipeline to increase efficiency and minimize turnaround times.
Hire the right labelers for the job: Ensure that your labelers speak the right language, are from a specific region, or are familiar with a particular domain. Also, ensure that your labelers are properly incentivized to provide high-quality labels.
Combine humans and machines: use ML-powered labeling tooling with humans in the loop (HITL) for the highest accuracy labels.
Provide clear and comprehensive instructions: This will help to ensure that different labelers will label data consistently.
Curate Your data: As you look to improve your model performance, you will also want to curate your data. Use a data curation tool such as Scale Nucleus to explore your data and identify data that is completely missing labels or improperly labeled. Review your dataset's IoU, ROC curve, and confusion matrix to understand poor model performance better. The best data curation tools will also allow you to visually interact with these charts to visually inspect data related to a specific confusion and even send it to your labeling team to correct. You may also discover that you are missing data, in which case you will need to collect more data to label.
Benchmark tasks and screening: Collect high-confidence responses to a subset of labeling tasks and use these tasks to estimate the quality of your labelers. Mix these benchmark tasks into other tasks. Use the performance on benchmark tasks to determine if an individual labeler understands your instructions and is capable of providing your desired quality. You can screen labelers who do not pass your benchmark tasks and either retrain them or exclude them from your project.
Inspect common answers for specific data: Looking at common answers for labels can help you identify trends in labeling errors. If all data labelers are incorrectly classifying a particular object or mislabeling a piece of data, then maybe the issue is not with the labeler but lies somewhere else. Reevaluate your ground truth, instructions, and training processes to ensure that your expectations have been clearly communicated. Once identified, add common mistakes to your instructions to avoid these issues in the future.
Update your instructions and golden datasets as you encounter edge cases
Create calibration batches to ensure that your instructions are clear and that quality is high on a small sample of your data before scaling up your labeling tasks.
Establish a consensus pipeline: Implement a consensus pipeline for classification or text-based tasks with more subjectivity. Use a majority vote or a hierarchical approach based on the experience or proven quality of an individual or group of data labelers.
Establish layers of review: Establish a hierarchical review structure for computer vision tasks to ensure that the labels are as accurate as possible.
Randomly sample labeled data for manual auditing: Randomly sample your labeled data and audit it yourself to confirm the quality of the sample. This approach will not guarantee that the entire dataset is labeled accurately but can give you a sense of general performance on labeling tasks.
Retrain or remove poor annotators: If an annotator's performance does not improve over time and with retraining, then you may need to remove them from your project.
Measure your model performance: Whether your model performance improves or degrades can often be directly reflected in the quality of your data labels. Use model validation tools such as Scale validate to critically evaluate precision, recall, intersection over union, and any other metrics critical to your model performance.
Data Labeling for Computer Vision
Computer vision is a field in artificial intelligence focused on understanding data from 2D images, videos, or 3D inputs and making predictions or recommendations based on that data. The human vision system is particularly advanced, and humans are very good at computer vision tasks.
In this chapter, we explore the most relevant types of data labeling for computer vision and provide best practices for labeling each type of data.
1. Bounding Box
The most commonly used and simplest data label, bounding boxes are rectangular boxes that identify the position of an object in an image or video
Data labelers draw a rectangular box over an object of interest, such as a car or street sign. This box defines the object's X and Y coordinates.
By "bounding" an object with this type of label, machine learning models have a more precise feature set from which to extract specific object attributes to help them conserve computing resources and more accurately detect objects of a particular type.
Object detection is the process of categorizing objects along with their location in an image. These X and Y coordinates can then be output in a machine-readable format such as JSON.
Typical Bounding Box Applications:
Autonomous driving and robotics to detect objects such as cars, people, or houses
Identifying damage or defects in manufactured objects
Household object detection for augmented reality applications
Anomaly detection in medical diagnostic imaging
Best Practices:
Hug the border as tightly as possible. Accurate labels will capture the entire object and match the edges as closely as possible to the object's edges to reduce confusion for your model.
Avoid item overlap. Due to IoU, bounding boxes work best when there is minimal overlap between objects. If objects overlap significantly, using polygon or segmentation annotations may be better.
Object size: Smaller objects are better suited for bounding boxes, while larger objects are better suited for instance segmentation. However, annotating tiny objects may require more advanced techniques.
Avoid Diagonal Lines: Bounding boxes perform poorly with diagonal lines such as walkways, bridges, or train tracks as boxes cannot tightly hug the borders. Polygons and instance segmentation are better approaches in these cases.
2. Classification
Object classification means applying a label to an entire image based on predefined categories, known as classes. Labeling images as containing a particular class such as "Dog," "Dress," or "Car" helps train an ML model to accurately predict objects of the same class when run on new data.
Typical Classification Applications:
Activity Classification
Product Categorization
Image Sentiment Analysis
Hot Dog vs. Not Hot Dog
Best Practices:
Create clearly defined, easily understandable categories that are relevant to the dataset.
Provide sufficient examples and training to your data labelers so that the requirements are clear and ambiguity between classes is minimized.
Create benchmark tests to ensure label quality.
3. Cuboids
Cuboids are 3-dimensional labels that identify the width, height, and depth of an object, as well as the object's location.
Data labelers draw a cuboid over the object of interest such as a building, car, or household object, which defines the object's X, Y, and Z coordinates. These coordinates are then output in a machine-readable format such as JSON.
Cuboids enable models to precisely understand an object's position in 3D space, which is essential in applications such as autonomous driving, indoor robotics, or 3D room planners. Reducing these objects to geometric primitives also makes understanding an entire scene more manageable and efficient.
Typical Cuboid Applications:
Develop prediction and planning models for autonomous vehicles using cuboids on pedestrians and cars to determine predicted behavior and intent.
Indoor objects such as furniture for room planners
Picking, safety, or defect detection applications in manufacturing facilities
Best Practices:
Capture the corners and edges accurately. Like bounding boxes, ensure that you capture the entire object in the cuboid while keeping the label as tight to the object as possible.
Avoid Overlapping labels where possible. Clean, non-overlapping cuboid data annotations will help your model improve object predictions and localizations in 3D space.
Axis alignment is critical. Ensure that the alignment of your bounding boxes is on the same axis for objects of the same class.
Keep your camera intrinsics in mind. Applying cuboids without understanding the camera's location will lead to poor prediction results when objects are not in the same position related to the camera in the future. The front face of a "true" cuboid will likely not be a perfect 90-degree rectangle, especially if it isn't facing the camera head-on. Furthermore, the edges of a cuboid parallel to the ground should converge to the horizon, while the top and bottom edges of the right side of the above annotation are parallel.
Pair 2D Data with 3D Depth Data such as LiDAR.2D images inherently lack depth information, so pairing your 2D data with 3D depth data such as LiDAR will yield the best results for applications dependent on depth accuracy. See the section below on 3D Sensor fusion for more information on this topic.
4. 3D Sensor Fusion
3D sensor fusion refers to the method of combining the data from multiple sensors to accommodate for the weaknesses of each sensor. 2D images alone are not enough for current machine learning models to make sense of entire scenes. Estimating depth from a 2D image is challenging, and occlusion and limited field of view make relying on 2D images tricky. While some approaches to autonomous driving rely solely on cameras, a more robust approach is to overcome the limitations of 2D by supplementing with 3D systems using sensors such as LiDAR and Radar.
LiDAR (Light Detection and Ranging) is a method for determining the distance of objects with a laser to determine the depth of objects in scenes and create 3D representations of the scene.
Radar (Radio detection and ranging) uses radio waves to determine objects' distance, angle, and radial velocity.
This demo provides an interactive 3D sensor fused scene, and the video below gives a high-level overview of a similar scene.
Typical 3D Sensor Fusion Applications
Autonomous Vehicles
Geospatial and mapping applications
Robotics and automation
Best Practices
Ensure that your data labeling platform is calibrated to your sensor intrinsics (or better yet, ensure that your tooling is sensor agnostic) and supports different lens and sensor types, for example, fisheye and panoramic cameras.
Look for a data labeling platform that can support large scenes, ideally with support for infinitely long scenes.
Ensure that object tracking is consistent throughout a scene, even when an object leaves and returns to the scene.
Include attribute support for understanding correlations between objects, such as truck cabs and trailers.
Leverage linked instance IDs describing the same object across the 2D and 3D modalities.
5. Ellipses
Ellipses are oval data labels that identify the position of objects in an image. Data labelers draw an ellipse label on an object of interest such as wheels, faces, eyes, or fruit. This annotation defines the object's location in 2D space. The X and Y coordinates of the four extremal vertices of the ellipse can then be output in a machine-readable format such as JSON to fully define the location of the ellipse.
Applications
Face Detection
Medical Imaging Diagnosis
Wheel Detection
Best Practices:
The data to be labeled should be oval or circular; i.e., it is not helpful to label rectangular boxes with ellipses when a bounding box will yield better results.
Use ellipses where there would be high overlap for bounding boxes or where objects are tightly clustered or occluded, such as in bunches of fruit. Ellipses can tightly hug the borders of these objects and provide a more targeted geometry to your model.
6. Lines
Lines identify the position of linear objects like roadway markers. Data labelers draw lines over areas of interest, which define the vertices of a line. Labeling images with lines helps to train your model to identify boundaries more accurately. The X and Y coordinates of the vertices of the lines can then be output in JSON.
ypical Lines Applications
Label roadway markers with straight or curved lines for autonomous vehicles
Horizon lines for AR/VR applications
Define boundaries for sporting fields
Best Practices
Label only the lines that matter most to your application.
Match the lines to the shape of the lines in the image as closely as possible.
Depending on the use case, it could be important for lines not to intersect.
Center the line annotation within the line in the image to improve model performance.
7. Points
Points are spatial locations in an image used to define important features of an object. Data labelers place a point on each location of interest, representing that location's X and Y coordinates. These points may be related to each other, such as when annotating a human shoulder, elbow, and wrist to identify the larger moving parts of an arm. These labels help machine learning models more accurately determine pose estimations or detect essential features of an object.
Typical Points Applications
Pose estimation for fitness or health applications or activity recognition
facial feature points for face detection
Best Practices
Label only the points that are most critical to your application. For instance, if you are building a face detection application, focus on labeling salient points on the eyes, nose, mouth, eyebrows, and the outline of the face.
Group points into structures (hand, face, and skeletal keypoints), and the labeling interface should make it efficient for taskers to visualize the interconnections between points in these structures.
8. Polygons
While bounding boxes are quick and easy for data labelers to draw, they are not precise in mapping to irregular shapes and can leave large gaps around an object. There is a tradeoff between accuracy and efficiency in using bounding boxes and polygons. For many applications, bounding boxes provide sufficient accuracy for a machine learning model with minimal effort. However, some applications require the increased accuracy of polygons at the expense of a more costly and less efficient annotation.
Data labelers draw a polygon shape over an object of interest by clicking on relevant points of the object to complete an entirely connected annotation. These points define the vertices of the polygon. The X and Y coordinates of these vertices are then output in JSON.
Typical Polygons Applications
Irregular objects such as buildings, vehicles, or trees for autonomous vehicles
Satellite imagery of houses, pools, industrial facilities, planes, or landmarks
Fruit detection for agricultural applications
Best Practices
Objects with holes or those split into multiple polygons due to occlusion (a car behind a tree, for example) require special treatment. Subtract the area of each hole from the object.
Avoid slight overlaps between polygons that are next to each other.
Zoom in closely to each object to ensure that you place points close to each object's borders.
Pay close attention to curved edges, making sure to add more vertices to 'smooth' these edges as much as possible.
Leverage the Auto Annotate Polygon tool to efficiently label objects. Automatically and quickly generate high-precision polygon annotations by highlighting specific objects of interest with an initial, approximate bounding box.
0:28
Follow these steps to achieve success with the Auto Annotate Polygon tool:
Include all parts of the object of interest.
Exclude overlapping object instances and other objects as much as possible.
Keep the bounding box tight to the borders of the object.
Use click to include/exclude to refine the automatically-generated polygon by instantly performing local edits - include and exclude specific areas of interest.
Further, refine the polygon by increasing or decreasing vertex density to smooth curved edges.
9. Segmentation
Segmentation labels relate to pixel-wise labels on an image and come in three common types, semantic segmentation, instance segmentation, and panoptic segmentation.
Semantic Segmentation
Label each pixel of an image with a class of what is being represented, such as a car, human, or foliage. Referred to as "dense prediction," this is a time-consuming and tedious process.
With semantic segmentation, you do not distinguish between separate objects of the same class (see instance segmentation for this).
Intersection Over Union (IoU)
An indirect method of confirming label quality in computer vision applications is an evaluation metric called Intersection over Union (IoU).
IoU compares the accuracy of a predicted label over the ground truth label by measuring the ratio of the area of overlap of the predicted label to the area of the union of both the predicted label and the ground truth label.
The closer this ratio is to 1, the better trained the model is.
As we discussed earlier, the essential factors for model training are high-quality data and labels. So, IoU is an indirect measure of the quality of data labels. You may have a different threshold of quality that you are focused on depending on the class of object. For instance, if you are building an augmented reality application focused on human interaction, you may need your IoU at 0.95 for identifying human faces but only need an IoU of 0.70 for identifying dogs.
Again, it is critical to remember that other data-related factors can influence IoU, such as bias in the dataset or an insufficient quantity of data.
Confusion Matrices
Confusion matrices are a simple but powerful tool to understand class confusion in your model better. A confusion matrix is a grid of classes comparing predicted and actual (ground truth) classifications. By examining the confusion matrix, you can quickly understand misclassifications, such as your model predicting a traffic sign when the ground truth is a train.
Combining these confusions with confidence scores can provide an easy way to prioritize the object confusion, looking at those instances where the model is highly confident in an incorrect prediction. Class confusions are likely due to an issue with incorrect or missing labels or with an inadequate quantity of data containing the traffic sign and train classes.
Best Practices for Achieving High-Quality Labels:
Collect the best data possible: Your data should be high quality and as consistent as possible while avoiding the bias that may reduce your model's usefulness. Ideally, your data collection pipeline is integrated into your labeling pipeline to increase efficiency and minimize turnaround times.
Hire the right labelers for the job: Ensure that your labelers speak the right language, are from a specific region, or are familiar with a particular domain. Also, ensure that your labelers are properly incentivized to provide high-quality labels.
Combine humans and machines: use ML-powered labeling tooling with humans in the loop (HITL) for the highest accuracy labels.
Provide clear and comprehensive instructions: This will help to ensure that different labelers will label data consistently.
Curate Your data: As you look to improve your model performance, you will also want to curate your data. Use a data curation tool such as Scale Nucleus to explore your data and identify data that is completely missing labels or improperly labeled. Review your dataset's IoU, ROC curve, and confusion matrix to understand poor model performance better. The best data curation tools will also allow you to visually interact with these charts to visually inspect data related to a specific confusion and even send it to your labeling team to correct. You may also discover that you are missing data, in which case you will need to collect more data to label.
Benchmark tasks and screening: Collect high-confidence responses to a subset of labeling tasks and use these tasks to estimate the quality of your labelers. Mix these benchmark tasks into other tasks. Use the performance on benchmark tasks to determine if an individual labeler understands your instructions and is capable of providing your desired quality. You can screen labelers who do not pass your benchmark tasks and either retrain them or exclude them from your project.
Inspect common answers for specific data: Looking at common answers for labels can help you identify trends in labeling errors. If all data labelers are incorrectly classifying a particular object or mislabeling a piece of data, then maybe the issue is not with the labeler but lies somewhere else. Reevaluate your ground truth, instructions, and training processes to ensure that your expectations have been clearly communicated. Once identified, add common mistakes to your instructions to avoid these issues in the future.
Update your instructions and golden datasets as you encounter edge cases
Create calibration batches to ensure that your instructions are clear and that quality is high on a small sample of your data before scaling up your labeling tasks.
Establish a consensus pipeline: Implement a consensus pipeline for classification or text-based tasks with more subjectivity. Use a majority vote or a hierarchical approach based on the experience or proven quality of an individual or group of data labelers.
Establish layers of review: Establish a hierarchical review structure for computer vision tasks to ensure that the labels are as accurate as possible.
Randomly sample labeled data for manual auditing: Randomly sample your labeled data and audit it yourself to confirm the quality of the sample. This approach will not guarantee that the entire dataset is labeled accurately but can give you a sense of general performance on labeling tasks.
Retrain or remove poor annotators: If an annotator's performance does not improve over time and with retraining, then you may need to remove them from your project.
Measure your model performance: Whether your model performance improves or degrades can often be directly reflected in the quality of your data labels. Use model validation tools such as Scale validate to critically evaluate precision, recall, intersection over union, and any other metrics critical to your model performance.
Data Labeling for Computer Vision
Computer vision is a field in artificial intelligence focused on understanding data from 2D images, videos, or 3D inputs and making predictions or recommendations based on that data. The human vision system is particularly advanced, and humans are very good at computer vision tasks.
In this chapter, we explore the most relevant types of data labeling for computer vision and provide best practices for labeling each type of data.
1. Bounding Box
The most commonly used and simplest data label, bounding boxes are rectangular boxes that identify the position of an object in an image or video
Data labelers draw a rectangular box over an object of interest, such as a car or street sign. This box defines the object's X and Y coordinates.
By "bounding" an object with this type of label, machine learning models have a more precise feature set from which to extract specific object attributes to help them conserve computing resources and more accurately detect objects of a particular type.
Object detection is the process of categorizing objects along with their location in an image. These X and Y coordinates can then be output in a machine-readable format such as JSON.
Typical Bounding Box Applications:
Autonomous driving and robotics to detect objects such as cars, people, or houses
Identifying damage or defects in manufactured objects
Household object detection for augmented reality applications
Anomaly detection in medical diagnostic imaging
Best Practices:
Hug the border as tightly as possible. Accurate labels will capture the entire object and match the edges as closely as possible to the object's edges to reduce confusion for your model.
Avoid item overlap. Due to IoU, bounding boxes work best when there is minimal overlap between objects. If objects overlap significantly, using polygon or segmentation annotations may be better.
Object size: Smaller objects are better suited for bounding boxes, while larger objects are better suited for instance segmentation. However, annotating tiny objects may require more advanced techniques.
Avoid Diagonal Lines: Bounding boxes perform poorly with diagonal lines such as walkways, bridges, or train tracks as boxes cannot tightly hug the borders. Polygons and instance segmentation are better approaches in these cases.
2. Classification
Object classification means applying a label to an entire image based on predefined categories, known as classes. Labeling images as containing a particular class such as "Dog," "Dress," or "Car" helps train an ML model to accurately predict objects of the same class when run on new data.
Typical Classification Applications:
Activity Classification
Product Categorization
Image Sentiment Analysis
Hot Dog vs. Not Hot Dog
Best Practices:
Create clearly defined, easily understandable categories that are relevant to the dataset.
Provide sufficient examples and training to your data labelers so that the requirements are clear and ambiguity between classes is minimized.
Create benchmark tests to ensure label quality.
3. Cuboids
Cuboids are 3-dimensional labels that identify the width, height, and depth of an object, as well as the object's location.
Data labelers draw a cuboid over the object of interest such as a building, car, or household object, which defines the object's X, Y, and Z coordinates. These coordinates are then output in a machine-readable format such as JSON.
Cuboids enable models to precisely understand an object's position in 3D space, which is essential in applications such as autonomous driving, indoor robotics, or 3D room planners. Reducing these objects to geometric primitives also makes understanding an entire scene more manageable and efficient.
Typical Cuboid Applications:
Develop prediction and planning models for autonomous vehicles using cuboids on pedestrians and cars to determine predicted behavior and intent.
Indoor objects such as furniture for room planners
Picking, safety, or defect detection applications in manufacturing facilities
Best Practices:
Capture the corners and edges accurately. Like bounding boxes, ensure that you capture the entire object in the cuboid while keeping the label as tight to the object as possible.
Avoid Overlapping labels where possible. Clean, non-overlapping cuboid data annotations will help your model improve object predictions and localizations in 3D space.
Axis alignment is critical. Ensure that the alignment of your bounding boxes is on the same axis for objects of the same class.
Keep your camera intrinsics in mind. Applying cuboids without understanding the camera's location will lead to poor prediction results when objects are not in the same position related to the camera in the future. The front face of a "true" cuboid will likely not be a perfect 90-degree rectangle, especially if it isn't facing the camera head-on. Furthermore, the edges of a cuboid parallel to the ground should converge to the horizon, while the top and bottom edges of the right side of the above annotation are parallel.
Pair 2D Data with 3D Depth Data such as LiDAR.2D images inherently lack depth information, so pairing your 2D data with 3D depth data such as LiDAR will yield the best results for applications dependent on depth accuracy. See the section below on 3D Sensor fusion for more information on this topic.
4. 3D Sensor Fusion
3D sensor fusion refers to the method of combining the data from multiple sensors to accommodate for the weaknesses of each sensor. 2D images alone are not enough for current machine learning models to make sense of entire scenes. Estimating depth from a 2D image is challenging, and occlusion and limited field of view make relying on 2D images tricky. While some approaches to autonomous driving rely solely on cameras, a more robust approach is to overcome the limitations of 2D by supplementing with 3D systems using sensors such as LiDAR and Radar.
LiDAR (Light Detection and Ranging) is a method for determining the distance of objects with a laser to determine the depth of objects in scenes and create 3D representations of the scene.
Radar (Radio detection and ranging) uses radio waves to determine objects' distance, angle, and radial velocity.
This demo provides an interactive 3D sensor fused scene, and the video below gives a high-level overview of a similar scene.
Typical 3D Sensor Fusion Applications
Autonomous Vehicles
Geospatial and mapping applications
Robotics and automation
Best Practices
Ensure that your data labeling platform is calibrated to your sensor intrinsics (or better yet, ensure that your tooling is sensor agnostic) and supports different lens and sensor types, for example, fisheye and panoramic cameras.
Look for a data labeling platform that can support large scenes, ideally with support for infinitely long scenes.
Ensure that object tracking is consistent throughout a scene, even when an object leaves and returns to the scene.
Include attribute support for understanding correlations between objects, such as truck cabs and trailers.
Leverage linked instance IDs describing the same object across the 2D and 3D modalities.
5. Ellipses
Ellipses are oval data labels that identify the position of objects in an image. Data labelers draw an ellipse label on an object of interest such as wheels, faces, eyes, or fruit. This annotation defines the object's location in 2D space. The X and Y coordinates of the four extremal vertices of the ellipse can then be output in a machine-readable format such as JSON to fully define the location of the ellipse.
Applications
Face Detection
Medical Imaging Diagnosis
Wheel Detection
Best Practices:
The data to be labeled should be oval or circular; i.e., it is not helpful to label rectangular boxes with ellipses when a bounding box will yield better results.
Use ellipses where there would be high overlap for bounding boxes or where objects are tightly clustered or occluded, such as in bunches of fruit. Ellipses can tightly hug the borders of these objects and provide a more targeted geometry to your model.
6. Lines
Lines identify the position of linear objects like roadway markers. Data labelers draw lines over areas of interest, which define the vertices of a line. Labeling images with lines helps to train your model to identify boundaries more accurately. The X and Y coordinates of the vertices of the lines can then be output in JSON.
ypical Lines Applications
Label roadway markers with straight or curved lines for autonomous vehicles
Horizon lines for AR/VR applications
Define boundaries for sporting fields
Best Practices
Label only the lines that matter most to your application.
Match the lines to the shape of the lines in the image as closely as possible.
Depending on the use case, it could be important for lines not to intersect.
Center the line annotation within the line in the image to improve model performance.
7. Points
Points are spatial locations in an image used to define important features of an object. Data labelers place a point on each location of interest, representing that location's X and Y coordinates. These points may be related to each other, such as when annotating a human shoulder, elbow, and wrist to identify the larger moving parts of an arm. These labels help machine learning models more accurately determine pose estimations or detect essential features of an object.
Typical Points Applications
Pose estimation for fitness or health applications or activity recognition
facial feature points for face detection
Best Practices
Label only the points that are most critical to your application. For instance, if you are building a face detection application, focus on labeling salient points on the eyes, nose, mouth, eyebrows, and the outline of the face.
Group points into structures (hand, face, and skeletal keypoints), and the labeling interface should make it efficient for taskers to visualize the interconnections between points in these structures.
8. Polygons
While bounding boxes are quick and easy for data labelers to draw, they are not precise in mapping to irregular shapes and can leave large gaps around an object. There is a tradeoff between accuracy and efficiency in using bounding boxes and polygons. For many applications, bounding boxes provide sufficient accuracy for a machine learning model with minimal effort. However, some applications require the increased accuracy of polygons at the expense of a more costly and less efficient annotation.
Data labelers draw a polygon shape over an object of interest by clicking on relevant points of the object to complete an entirely connected annotation. These points define the vertices of the polygon. The X and Y coordinates of these vertices are then output in JSON.
Typical Polygons Applications
Irregular objects such as buildings, vehicles, or trees for autonomous vehicles
Satellite imagery of houses, pools, industrial facilities, planes, or landmarks
Fruit detection for agricultural applications
Best Practices
Objects with holes or those split into multiple polygons due to occlusion (a car behind a tree, for example) require special treatment. Subtract the area of each hole from the object.
Avoid slight overlaps between polygons that are next to each other.
Zoom in closely to each object to ensure that you place points close to each object's borders.
Pay close attention to curved edges, making sure to add more vertices to 'smooth' these edges as much as possible.
Leverage the Auto Annotate Polygon tool to efficiently label objects. Automatically and quickly generate high-precision polygon annotations by highlighting specific objects of interest with an initial, approximate bounding box.
Follow these steps to achieve success with the Auto Annotate Polygon tool:
Include all parts of the object of interest.
Exclude overlapping object instances and other objects as much as possible.
Keep the bounding box tight to the borders of the object.
Use click to include/exclude to refine the automatically-generated polygon by instantly performing local edits - include and exclude specific areas of interest.
Further, refine the polygon by increasing or decreasing vertex density to smooth curved edges.
9. Segmentation
Segmentation labels relate to pixel-wise labels on an image and come in three common types, semantic segmentation, instance segmentation, and panoptic segmentation.
Semantic Segmentation
Label each pixel of an image with a class of what is being represented, such as a car, human, or foliage. Referred to as "dense prediction," this is a time-consuming and tedious process.
With semantic segmentation, you do not distinguish between separate objects of the same class (see instance segmentation for this).

Guide to AI for eCommerce
This guide details the main applications of Artificial Intelligence for the eCommerce Industry.
Diffusion Models: A Practical Guide
Diffusion models have the power to generate any image that you can imagine. This is the guide you need to ensure you can use them to your advantage whether you are a creative artist, software developer, or business executive.
Guide to Computer Vision Applications
Understand what computer vision is, how it works, and deep dive into some of the top applications for computer vision by industry.
Guide to Large Language Models
Large language models (LLMs) are transforming how we create, understand our world, and how we work. We created this guide to help you understand what LLMs are and how you can use these models to unlock the power of your data and accelerate your business.

Industry Use Cases

Let's quickly explore how a few industries are adopting Generative AI to improve their business:

Insurance companies use AI to increase the operational efficiency of claims processing. Claims are often highly complicated, and Generative AI excels at properly routing, summarizing, and classifying these claims.
Retail and eCommerce companies have for years tried to adopt customer chatbots, but they have failed to live up to their promise of streamlining operations and providing a better user experience. But now, with the latest generative chatbots that are fine-tuned on company data, chatbots finally provide engaging discussions and recommendations that dynamically respond to customer input.
Financial services companies are building assistants for investment research that analyze financial statements, historical market data, and other proprietary data sources and provide detailed summaries, interactive charts, and even take action with plugins. These tools increase the efficiency and effectiveness of investors by surfacing the most relevant trends and providing actionable insights to help improve returns.

Overall, large language models offer a wide range of potential benefits for businesses.

A Brief History of Language Models

To better contextualize the impact of these models, it is essential to understand the history of natural language processing. While the field of natural language processing (NLP) began in the 1940s after World War II, and the concept for using neural networks for natural language processing dates back to the 1980s, it was not until relatively recently that the combination of processing power via GPUs and data necessary to train very large models became widely available. Symbolic and statistical natural language processing were the dominant paradigms from the 1950s through the 2010s.

Recurrent Neural Networks (RNNs) were popularized in the 1980s. RNNs are a basic form of artificial neural network that can handle sequential data, but they struggle with long-term dependencies. In 1997, LSTMs were invented; LSTMs are a type of RNN that can manage long-term dependencies better due to their gating mechanism. Around 2007, LSTMs began to revolutionize speech recognition, with this architecture being used in many commercial speech-to-text applications.

Throughout the early 2000s and 2010s, trends shifted to deep neural nets, leading to rapid improvement on the state of the art for NLP tasks. In 2017, the now-dominant transformer architecture was introduced to the world, changing the entire field of AI and machine learning. Transformers use an attention mechanism to process entire sequences at once, making them more computationally efficient and capable of handling complex contextual relationships in data. Compared to RNN and LSTM models, the transformer architecture is easier to parallelize, allowing training on larger datasets.

2018 was a seminal year in the development of language models built on this transformer architecture, with the release of both BERT from Google and the original GPT from OpenAI.

BERT, which stands for Bidirectional Encoder Representations from Transformers, was one of the first large language models to achieve state-of-the-art results on a wide range of natural language processing tasks in 2018. BERT is widely used in business today for its classification capabilities, as it is a relatively lightweight model and inexpensive to run in production. BERT was state of the art when it was first unveiled, but has now been surpassed in nearly all benchmarks by more modern generative models such as GPT-4.

The GPT family of models differs from BERT in that GPT models are generative and have a significantly larger scale leading to GPT models outperforming BERT on a wide range of tasks. GPT-2 was released in 2019, and GPT-3 was announced in 2020 and made generally available in 2021. Google released the open-source T5/FLAN (Fine-tuned Language Net) model and announced LaMDA in 2021, pushing the state of the art with highly capable models.

In 2022 the open-source BLOOM language model, the more powerful GPT-3 text-davinci-003, and ChatGPT were released, capturing headlines and catapulting LLMs to popular attention.In 2023, GPT-4 and Google's Bard chatbot were announced. Bard was originally running LaMDA, but Google has since replaced it with the more powerful PaLM 2 model.

There are now several competitive models including Anthropic’s Claude, Cohere’s Command Model, Stability AI’s StableLM. We expect to see these models to continue to improve and gain new capabilities over the next few years. In addition to text, multimodal models will be able to ingest and respond with images and videos, with a coherent understanding of the relationships between these modalities. Models will hallucinate less and will be able to more reliably interact with tools and databases. Developer ecosystems will proliferate around these models as a backend, ushering in an era of accelerated innovation and productivity. While we do expect to see larger models, we expect model builders will focus more on high quality data to improve model performance.

Model Size and Performance

Over time, LLMs have become more capable as they've increased in size. Model size is typically determined by its training dataset size measured in tokens (parts of words) or by its number of parameters (the number of values the model can change as it learns).

BERT (2018) was 3.7B tokens and 240 million parameters.
GPT-2 (2019) was 9.5B tokens and 1.5 billion parameters.
GPT-3 (2020) has 499B tokens and 175B parameters.
PaLM (2022) was 780 Billion tokens and 540 billion parameters

As these models scaled in size, their capabilities continued to increase, providing more incentive for companies to build applications and entire businesses on top of these models. This trend continued until very recently.

But now, model builders are grappling with the fact that we may have reached a sort of plauteau, a point at which additional model size yields diminishing performance improvements. Deepmind's paper on training compute-optimal LLMs, showed that for every doubling of model size the number of training tokens should also be doubled. Most LLMs are already trained on enormous amounts of data, including most of the internet, so expanding dataset size by a large degree is increasingly difficult. Larger models will still outperform smaller models, but we are seeing model builders focusing less on increasing size and instead focusing on incredibly high-quality data for pre-training, combined with techniques like supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and prompt engineering to optimize model performance.

Fine-Tuning Large Language Models

Fine-tuning is a process by which an LLM is adapted to specific tasks or domains by training it on a smaller, more targeted dataset. This can help the model better understand the nuances of the specific tasks or domains and improve its performance on those particular tasks. Through human evaluations on prompt distribution, OpenAI found that outputs from their 1.3B parameter InstructGPT model were preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Fine-tuned models perform better and are much less likely to respond with toxic content or hallucinate (make up information). The approach for fine-tuning these models included a wide array of different domains, though still a tiny subset compared to the entirety of internet data.

This principle of fine-tuning increasing task-specific performance also applies to single domains, such as a particular industry or specific task. Fine-tuning large language models (LLMs) makes them incredibly valuable for businesses. For example, a company that provides language translation services could fine-tune an LLM to understand better the nuances of a particular language or domain, such as legal documents or insurance claims. This understanding helps the model generate more accurate and fluent translations, leading to better customer satisfaction and potentially even higher revenues.

Another example is a business that generates product descriptions for an e-commerce website. By fine-tuning an LLM to understand the characteristics of different products and their features, the model could generate more informative and compelling descriptions, which could increase sales and customer engagement. Fine-tuning an LLM can help businesses tailor the model's capabilities to their specific needs and improve their performance in various tasks and domains.

Fine-tuning an LLM generally consists of the following high-level process:

Identify the task or domain you want to fine-tune the model for, which could be anything from language translation to text summarization to generating product descriptions.
Gather a targeted dataset relevant to the task or domain you want to fine-tune the model for. This dataset should be large enough to provide the model with sufficient information to learn from but not so large that it takes a long time to train. A few hundred training examples is the minimum recommended amount, with more data increasing model quality. Ensure the data is relevant to your industry and specific use case.
Use a machine learning framework, library, or a tool like Scale Spellbook to train the LLM on the smaller dataset. This will involve providing the model with the original data, "input text," and the corresponding desired output, such as summarization or classification of the text into a set of predefined categories, and then allowing the model to learn from this data by adjusting its internal parameters.
Monitor the model's performance as it trains and make any necessary adjustments to the training process to improve the output. This could involve changing the size of the training dataset, adjusting the model's learning rate, or modifying the model's architecture.
Once the model has been trained, evaluate its performance on the specific task or domain you fine-tuned it for. This will involve providing the model with input text and comparing its actual output to the desired output.

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement learning from human feedback (RLHF) is a methodology to train machine learning models by soliciting feedback from human users. RLHF allows for more efficient learning. Instead of attempting to write a loss function that will result in the model behaving more like a human, RLHF includes humans as active participants in the training process. RLHF results in models that align more closely with human expectations, a typical qualitative measure of model performance.

Models trained with RLHF, such as InstructGPT and ChatGPT, have the benefit of generally being more helpful and more aligned with a user's goals. These models are better at following instructions and tend not to make up facts (hallucinate) as often as models trained with other methods. Additionally, these models perform as well as traditional models but at a substantially smaller size (InstructGPT is 1.3 billion parameters, compared to GPT-3 at 175 billion parameters).

InstructGPT

OpenAI API uses GPT-3 based language models to perform natural language tasks on user prompts, but these models can generate untruthful or toxic outputs. To improve the models' safety and alignment with user intentions, OpenAI developed InstructGPT models using reinforcement learning from human feedback (RLHF). These models are better at following instructions and generate less toxic content. They have been in beta on the API for over a year and are now the default language models on the API. OpenAI believes that fine-tuning language models with human input is a powerful way to align them more closely with human values and make them more reliable.

ChatGPT

ChatGPT is a large language model that has been developed specifically for the task of conversational text generation. This model was initially trained with supervised fine-tuning with humans interacting to create a conversational dataset. The model was then fine-tuned with RLHF, with humans ranking model outputs which were then used to improve the model.

One of the key features of ChatGPT is its ability to maintain the context of a conversation and generate relevant responses. As such, it is a valuable tool for applications such as search engines or chatbots, where the ability to generate coherent and appropriate responses is essential. In addition, ChatGPT can be fine-tuned for even more specific applications, allowing it to achieve even better performance on specialized tasks.

Overall, ChatGPT has made large language models more accessible to a wider range of users than previous large language models.

While the high-level steps for fine-tuning are simple, to accurately improve the performance of the model for a specific task, expertise is required.

LLM Prompt Engineering

Prompt engineering is the process of carefully designing the input text, or "prompt," that is fed into an LLM. By providing a well-crafted prompt, it is possible to control the model's output and guide it to generate more desirable responses. The ability to control model outputs is useful for various applications, such as generating text, answering questions, or translating sentences. Without prompt engineering, an LLM may generate irrelevant, incoherent, or otherwise undesirable responses. By using prompt engineering, it is possible to ensure that the model generates the desired output and makes the most of its advanced capabilities.

Prompt engineering is a nascent field, but a new career is already emerging, that of the "Prompt Engineer." A prompt engineer for large language models (LLMs) is responsible for designing and crafting the input text, or "prompts," that are fed into the models. They must have a deep understanding of LLM capabilities and the specific tasks and applications it will be used for. The prompt engineer must be able to identify the desired output and then design prompts that are carefully crafted to guide the model to generate that output. In practice, this may involve using specific words or phrases, providing context or background information, or framing the prompt in a particular way. The prompt engineer must be able to work closely with other team members and adapt to changing requirements, datasets, or models. Prompt engineering is critical in ensuring that LLMs are used effectively and generate the desired output.

Prompt Engineering for an LLM generally consists of the following high-level process:

Identify the task or application you want to use the LLM for, such as generating text, answering questions, or summarizing reports.
Determine the specific output you want the LLM to generate, which could be a paragraph of text, a single value for classification, or lines of code.
Carefully design a prompt to guide the LLM to generate the desired output. Be as specific as possible and provide context or background information to ensure that the language is clear.
Feed the prompt into the LLM and observe the output it generates.
If the output is not what you desired, modify the prompt and try again.
Following these high level can help you get the most out of your model and make it more useful for a variety of applications. To quickly get started with prompt engineering for large language models, try out Spellbook today.

Below we provide an overview of a few popular prompt engineering techniques:

Ensuring Brand Fidelity

In combination with RLHF and domain-specific fine-tuning, prompt engineering can help ensure that model responses reflect your brand guidelines and company policies. By specifying an identity for your model in a prompt, you can enforce the desired model behavior in various scenarios.

For instance, let's say that you are Acme Corp., a financial services company. A user has landed on your website by accident and is asking for advice on a particular pair of running shoes.

This response is an example of a hallucination or the model fabricating results. Though the company does not sell running shoes, it gladly responds with a suggestion. Let's update the default prompt, or system message, to cover this edge case.

Default Prompt: We will specify a default prompt, which is added to every session to define the default behavior of the chatbot. In this example, we will use this default prompt:

"You are AcmeBot, a bot designed to help users with financial services questions. AcmeBot responses should be informative and actionable. AcmeBot's responses should always be positive and engaging. If a user asks for a product or service unrelated to financial services, AcmeBot should apologize and simply inform the user that you are a virtual assistant for Acme Corp, a financial services company and cannot assist with their particular request, but that you would be happy to assist with any financial questions the user has."

With this default prompt in place, the model now behaves as we expect:

Improved Information Parsing

By specifying the desired template for the response, you can steer the model to return data in the format that is required by your application. For example, say you are a financial institution integrating existing backend systems with a natural language interface powered by an LLM. Your backend systems require a specific format to accept any data, which an LLM will not provide out of the box. Let's look at an example:

This response is accurate, but it is missing context that our backend systems need to parse this data properly. Let's specify the template we need to receive an appropriate response. Depending on the application, this template can also be added as part of a default prompt.

Now our data can be parsed by our backend system!

Adversarial or “Red-team” prompting

Chat models are often designed to be deployed in public-facing applications, where it's important they do not produce toxic, harmful, or embarrassing responses, even when users intentionally seek such material. Adversarial prompts are designed to elicit disallowed output, tricking or confusing a chat model into violating the policies its creators intended.

One typical example is prompt injection, otherwise referred to as instruction injection. Models are trained to follow user instructions but are also given a directive by a default prompt to behave in certain ways, such as not revealing details about how the model works or what the default prompt is. However, with clever prompting, the model can be tricked to disregard its programming and follow user instructions that conflict with its training or default prompt.

Below we explore a simple example of an instruction injection, followed by an example using a model that has been properly trained, fine-tuned, and with a default prompt that prevents it from falling prey to these common adversarial techniques:

Adversarial prompt with poor response:

Adversarial prompt with desired response:

Adversarial prompt engineering is an entire topic unto itself, including other techniques such as role-playing and fictionalization, unusual text formats and obfuscated tasks, prompt echoing, and dialog injection. We have only scratched the surface of prompt engineering here, but there are a wide array of different techniques to control model responses. Prompt engineering is evolving quickly, and experienced practitioners have spent much time developing an intuition for optimizing prompts for a desired model output. Additionally, each model is slightly different and responds to the same prompts with slightly different behaviors, so learning these differences adds another layer of complexity. The best way to get familiar with prompt engineering is to get hands on and start prompting models.

Conclusion

As we have seen, LLMs are versatile tools that can be applied to a wide variety of use cases. These models have already had a transformative impact on the business landscape, with billions of dollars being spent in 2023 alone. Nearly every industry is working on adopting these models into their specific use cases, from insurance companies looking to optimize claims processing, wealth managers looking for unique insights across a large number of portfolios to help them improve returns, or eCommerce companies looking to make it easier to purchase their products.

To optimize investments in LLMs, it is critical that businesses understand how to properly implement them. Using base foundation models out of the box is not sufficient for specific use cases. These models need to be fine-tuned on proprietary data, improved with human feedback, and prompted properly to ensure that their outputs are reliable and accomplish the task at hand.

At we help companies realize the value of large language models, including those that are building large language models and those that are looking to adopt them to make their businesses better.