AI on Draught: Engineering Experience with Conversational AI

LOcation

London AI Hub

Google Maps

Welcome to 'AI on Draught: Engineering experience with conversational AI'

How do you make your chatbot genuinely helpful, your coding agent concise, your character agent feel like your favourite movie star? What makes Claude feel like Claude and ChatGPT feel like ChatGPT? Two models a point apart on MMLU can feel completely different over twenty turns: one sycophants, one over-refuses, one has the right tone but quietly makes up citations. Users pick up on this within a handful of messages and form strong preferences that benchmarks would never predict.

The interesting question is how you engineer for that experience. A model's character (its register, its hedging, the way it pushes back, the way it recovers when wrong) is the output of hard technical decisions: data curating, post-training, system prompts, scaffolding tool use. The techniques are out there, just rarely written down, and even more rarely compared between teams working on similar problems.

Why this topic and why now

The people taking these questions seriously are scattered across engineering, product, behavioural science, and alignment research, often without realising the others exist. This evening is about putting them in the same room.

Topics on the table

Evals for open-ended conversation: LLM-as-judge pipelines and where they quietly mislead you: position bias, length bias, self-preference. Behavioural regression testing across checkpoints. Eval sets that go stale faster than you expect.
System prompts and inference-time control: Prompt architectures that survive long contexts. Persona stability across tool calls and sub-agent handoffs. When prompt-level changes stop being enough, and what you reach for next.
Context and memory. What long-context degradation looks like in practice. Retrieval vs in-context vs fine-tuned knowledge. Summarisation for multi-session agents. When to compress, when to forget.
Cross-model behavioural differences: What's actually observable between Claude, GPT, Gemini, Llama, and open-weight fine-tunes.
Interpretability in the product loop: Where SAEs, probes, and attribution methods are earning their keep in eval and debugging workflows, and where they're still a research project waiting for a product.

You'll leave with

A clearer picture of what other practitioners are actually running in production - not just what's been written up. A handful of new collaborators picked for direct overlap with your work. And probably a few opinions worth revisiting.

Who should attend

Applied AI and product engineers. People shipping conversational agents into the wild - chat, copilots, voice, multi-turn agentic workflows. Wrestling with prompt regressions, tool-call reliability, and eval coverage.
ML engineers. Post-training, RLHF/DPO pipelines, reward modelling, synthetic data, eval infrastructure, fine-tuning for behavioural properties.
Researchers. Model behaviour, alignment, interpretability, and the tooling that makes any of this measurable.