Skip to content

LLMOps

Lecture by Josh Tobin. Published May 9, 2023. Download slides.

Chapter Summaries

Why LLMOps?

Chapter 0 Cover Image

  • Topic of lecture core to whole ethos of full stack deep learning
  • Started five years ago in AI hype cycle focusing on deep learning
  • Classes teach about building with neural networks, but not getting into production
  • Philosophy carried throughout the development of courses
  • Focus on building applications with language models and considerations for production systems
  • Space for real production systems with language models is underdeveloped
  • Lecture will cover assorted topics related to building these applications
  • Provide high-level pointers, initial choices, and resources for learning more
  • Aim to tie topics together into a first-pass theory for "LLMops"

Choosing your base LLM

Chapter 1 Cover Image

  • Building an application on top of LLMs requires choosing which model to use; the best model depends on trade-offs, such as quality, speed, cost, tunability, and data security.
  • For most use cases, GPT4 is a good starting point.
  • Proprietary models, like GPT4 and Anthropic, are usually higher quality, but open source models offer more customization and better data security.
  • Consider licensing when choosing an open source model: permissive licenses (e.g., Apache 2.0) offer more freedom, whereas restricted licenses limit commercial use.
  • Be cautious with "open source" models released under non-commercial licenses, as they restrict commercial use and may not truly be open source.

Proprietary LLMs

Chapter 2 Cover Image

  • Discussed proprietary models and ranked them using criteria: number of parameters, size of context window, type of training data, subjective quality score, speed of inference, and fine-tunability.
  • Number of parameters and training data are proxies for model quality; context window crucial for model usefulness in downstream applications.
  • Four types of training data: diverse, code, instructions, and human feedback; few models use all four types.
  • Quality best determined using benchmarks and hands-on evaluations.
  • GPT-4 recognized as the highest quality model, followed by GPT-3.5 for a faster and cheaper option.
  • Claude from Anthropic and Cohere's largest model compete for quality and fine-tunability.
  • For a trade-off of quality in favor of speed and cost, consider Anthropic's offering or alternatives from OpenAI and Cohere.

Open-source LLMs

Chapter 3 Cover Image

  • Large language models have both proprietary and open-source options
  • Open-source options include T5, Flan T5, Pythia, Dolly, Stable-LM, Llama, Alpaca, Vicuna, Koala, and Opt
  • T5 and Flan-T5 have permissive licenses while other options may have license restrictions
  • Llama ecosystem is well-supported by the community, but not ideal for production
  • Benchmarks can mislead, assess language model performance on specific tasks
  • Start projects with GPT-4 to prototype, downsize to GPT-3.5 or Claude if cost/latency is a concern
  • Cohere is the best for fine-tuning among commercial providers
  • Open-source may catch up with GPT-3.5 level performance by the end of the year

Iteration and prompt management

Chapter 4 Cover Image

  • I believe prompt engineering is currently missing tools to make it more like engineering and less like ad hoc experimentation.
  • Experiment management was impactful in the deep learning world because experiments took a long time to run and there were many parallel experiments, which prompt engineering typically doesn't have.
  • I suggest three levels of tracking experiments with prompts and chains: 1) Doing nothing and using OpenAI Playground, 2) Tracking prompts in Git, and 3) Using specialized tracking tools for prompts (if necessary).
  • Most teams should use Git for tracking as it's easy and fits into their current workflows.
  • Specialized prompt tracking tools should be decoupled from Git and provide a UI for non-technical stakeholders.
  • Keep an eye out for new tools in this space, as it's rapidly evolving with recent announcements from major providers like Weights & Biases, Comet, and MLflow.

Testing LLMs: Why and why is it hard?

Chapter 5 Cover Image

  • To ensure changes to a model or prompt are effective, measure performance on a wide range of data representing end-user inputs.
  • User retention for AI-powered applications depends on trust and reliable output.
  • Traditional machine learning model testing involves training sets, held-out data, and measuring accuracy, but language models present unique challenges:
  • You don't know the training data used by API providers like OpenAI.
  • Production distribution is always different than training distribution.
  • Metrics are less straightforward and might not capture the diverse behaviors of the model.
  • Language models require a more diverse understanding of behaviors and qualitative output measurement.

Testing LLMs: What works?

Chapter 6 Cover Image

  • Two key questions for testing language models: what data to test them on and what metrics to compute on that data
  • Build evaluation sets incrementally, starting from the beginning while prototyping the model
  • Add interesting examples to the dataset, focusing on hard examples where the model struggles and different examples that aren't common in the dataset
  • Utilize the language model to help generate diverse test cases by creating prompts for the tasks you're trying to solve
  • As the model rolls out to more users, keep adding data to the dataset, considering user dislikes and underrepresented topics for inclusion
  • Consider the concept of test coverage, aiming for an evaluation set that covers the types of tasks users will actually perform in the system
  • Test coverage and distribution shift are analogous, but measure different aspects of data relationships
  • To be effective, test reliability should measure the difference between online and offline performance, ensuring that metrics are relevant to real-world user experiences.

Evaluation metrics for LLMs

Chapter 7 Cover Image

  • Evaluation metrics for language models depend on the availability of a correct answer, reference answer, previous answer, or human feedback.
  • If there's a correct answer, use metrics like accuracy.
  • With a reference answer, employ reference matching metrics like semantic similarity or factual consistency.
  • If there's a previous answer, ask another language model which answer is better.
  • When human feedback is available, check if the answer incorporates the feedback.
  • If none of these options apply, verify output structure or ask the model to grade the answer.
  • Although automatic evaluation is desirable for faster experimentation, manual checks still play an essential role.

Deployment and monitoring

Chapter 8 Cover Image

  • Deploying LLM (Language Model) APIs can be simple, but becomes more complex if there's a lot of logic behind API calls.
  • Techniques to improve LLM output quality include self-critique, sampling multiple outputs, and ensemble techniques.
  • Monitoring LLMs involves looking at user satisfaction and defining performance metrics, like response length or common issues in production.
  • Gather user feedback via low friction methods, such as thumbs up/down or short messages.
  • Common issues with LLMs in production include UI problems, latency, incorrect answers, long-winded responses, and prompt injection attacks.
  • Use user feedback to improve prompts by finding and addressing themes or problems.
  • Fine-tuning LLMs can be done through supervised fine-tuning or human feedback, though the latter is more challenging.

Test-driven development for LLMs

Chapter 9 Cover Image

  • Rapidly evolving field with no established best practices yet
  • Aim to provide main questions and resources for building applications with LLMS
  • Introduce a potential structured process: test-driven or behavior-driven development
  • Main components of process are prompt/chain development, deployment, user feedback, and logging/monitoring
  • Use interaction data from user feedback to improve model, extract test data, and iterate on prompts
  • As complexity increases, consider fine-tuning workflow with additional training data
  • Virtuous cycle of improvement as interaction data from users increases and informs subsequent iterations
  • Process repeats with individual developer, team, and end-users involved in feedback and improvements

We are excited to share this course with you for free.

We have more upcoming great content. Subscribe to stay up to date as we release it.

We take your privacy and attention very seriously and will never spam you. I am already a subscriber