I noticed something interesting the other day: I enjoy using AI to build software more than I enjoy using most AI applications--software built with AI.
When I use AI to build software I feel like I can create almost anything I can imagine very quickly. AI feels like a power tool. It's a lot of fun.
Many AI apps don't feel like that. Their AI features feel tacked-on and useless, even counter-productive.
I am beginning to suspect that these apps are the "horseless carriages" of the AI era. They're bad because they mimic old ways of building software that unnecessarily constrain the AI models they're built with.
To illustrate what I mean by that, I'll start with an example of a badly designed AI app.
A little while ago, the Gmail team released a new feature giving users the ability to generate email drafts from scratch using Google's flagship AI model, Gemini. This is what it looks like:
Here I've added a prompt to the interface requesting a draft for an email to my boss. Let's see what Gemini returns:
As you can see, Gemini has produced perfectly reasonable draft that unfortunately doesn’t sound anything like an email that I would actually write. If I'd written this email myself, it would have sounded something like this:
The tone of the draft isn't the only problem. The email I'd have written is actually shorter than the original prompt, which means I spent more time asking Gemini for help than I would have if I'd just written the draft myself. Remarkably, the Gmail team has shipped a product that perfectly captures the experience of managing an underperforming employee.
Millions of Gmail users have had this experience and I'm sure many of them have concluded that AI isn't smart enough to write good emails yet.
This could not be further from the truth: Gemini is an astonishingly powerful model that is more than capable of writing good emails. Unfortunately, the Gmail team designed an app that prevents it from doing so.
To illustrate this point, here's a simple demo of an AI email assistant that, if Gmail had shipped it, would actually save me a lot of time:
labelEmail(label, color, priority)
archiveEmail()
draftReply(body)
reschedule
Technical co-founder?
dinner tonight
founder intro?
Transform Your Workflow with Our Platfor...
Exclusive Offer for New Customers
Security Solutions for Growing Companies
Latest Product Launches and Industry New...
Top Stories: New Programming Languages o...
This Week in Tech: AI Developments and M...
advice on fundraising
Special Pricing on Data Analytics Tools
This demo uses AI to read emails instead of write them from scratch. Each email is categorized and prioritized, and some are auto-archived while others get an automatically-drafted reply. The assistant processes emails individually according to a custom "System Prompt" that explains exactly how I want each one handled. You can try your own labeling logic by editing the System Prompt.
It's obvious how much more powerful this approach is, so why didn't the Gmail team build it this way? To answer this question, let's look more closely at the problems with their design. We'll start with its generic tone.
The draft that Gmail's AI assistant produced is wordy and weirdly formal and so un-Pete that if I actually sent it to Garry, he’d probably mistake it for some kind of phishing attack. It’s AI Slop.
Everyone who has used an LLM to do any writing has had this experience. It’s so common that most of us have unconsciously adopted strategies for avoiding it when writing prompts. The simplest such strategy is just writing more detailed instructions that steer the LLM in the right direction, like this:
Here's a little draft-writer demo you can use to compare my original prompt with this expanded one:
The generated draft sounds better, but this is obviously dumb. The new prompt is even longer than the original, and I'd need to write something like this out every time I want a new email written.
There is a simple solution to this problem that many AI app developers seem to be missing: let me write my own "System Prompt".
Viewed from the outside, large language models are actually really simple. They read in a stream of words, the “prompt”, and then start predicting the words, one after another, that are likely to come next, the “response”.
The important thing to note here is that all of the input and all of the output is text. The LLM's user interface is just text1.
LLM providers like OpenAI and Anthropic have adopted a convention to help make prompt writing easier: they split the prompt into two components: a System Prompt and a User Prompt, so named because in many API applications the app developers write the System Prompt and the user writes the User Prompt.
The System Prompt explains to the model how to accomplish a particular set of tasks, and is re-used over and over again. The User Prompt describes a specific task to be done.
You can think of the System Prompt as a function, the User Prompt as its input, and the model's response as its output:
In my original example, the User Prompt was
Google keeps the System Prompt a secret, but judging by the output we can guess what it looks like:
You are a helpful email-writing assistant responsible for writing emails on behalf of a Gmail user. Follow the user’s instructions and use a formal, businessy tone and correct punctuation so that it’s obvious the user is smart and serious.
Of course I'm being glib here, but the problem is not just that the Gmail team wrote a bad system prompt. The problem is that I'm not allowed to change it.
If, instead of forcing me to use their one-size-fits-all System Prompt, Gmail let me write my own, it would look something like this:
You're Pete, a 43 year old husband, father, programmer, and YC Partner.
You're very busy and so is everyone you correspond with, so you do your best to keep your emails as short as possible and to the point. You avoid all unnecessary words and you often omit punctuation or leave misspellings unaddressed because it's not a big deal and you'd rather save the time. You prefer one-line emails.
Do your best to be kind, and don't be so informal that it comes across as rude.
Intuitively, you can see what's going on here: when I write my own System Prompt, I'm teaching the LLM to write emails the way that I would. Does it work? Let's give it a try.
Try generating a draft using the (imagined) Gmail System Prompt, and then do the same with the "Pete System Prompt" above. The "Pete" version will give you something like this:
It's perfect. That was so easy!
Not only is the output better for this particular draft, it's going to be better for every draft going forward because the System Prompt is reused over and over again. No more banging my head against the wall explaining over and over to Gemini how to write like me!
And the best part of all? Teaching a model like this is surprisingly fun.
Spend a few minutes thinking about how YOU write email. Try writing a "You System Prompt" and see what happens. If the output doesn't look right, try to imagine what you left out of your explanation and try it again. Repeat that a few times until the output starts to feel right to you.
Better yet, try it with a few other User Prompts. For example, see if you can get the LLM to write these emails in your voice:
There's something magical about teaching an LLM to solve a problem the same way you would and watching it succeed. Surprisingly, it's actually easier than teaching a human because, unlike a human, an LLM will give you instantaneous, honest feedback about whether your explanation was good enough or not. If you get an email draft you're happy with, your explanation was sufficient. If you don't, it wasn't.
By exposing the System Prompt and making it editable, we've created a product experience that produces better results and is actually fun to use.
As of April 2025 most AI still apps don't (intentionally) expose their system prompts. Why not?
Whenever a new technology is invented, the first tools built with it inevitably fail because they mimic the old way of doing things. “Horseless carriage” refers to the early motor car designs that borrowed heavily from the horse-drawn carriages that preceded them. Here’s an example of an 1803 Steam Carriage design I found on Wikipedia:
The brokenness of this design was invisible to everyone at the time and laughably obvious after the fact.
Imagine living in 1806 and riding on one of these for the first time. Even if the wooden frame held together long enough to get you where you were going, the wooden seats and lack of suspension would have made the ride unbearable.
You'd probably think "there's no way I'd choose an engine over a horse". And you'd have been right, at least until the automobile was invented.
I suspect we are living through a similar period with AI applications. Many of them are infuriatingly useless in the same way that Gmail's Gemini integration is.
The "old world thinking" that gave us the original horseless carriage was swapping a horse out for an engine without redesigning the vehicle to handle higher speeds. What is the old world thinking constraining these AI apps?
Up until very recently, if you wanted a computer to do something you had two options for making that happen:
Programming is hard, so most of us choose option 2 most of the time. It's why I'd rather pay a few dollars for an off-the-shelf app than build it myself, and why big companies would rather pay millions of dollars to Salesforce than build their own CRM.
The modern software industry is built on the assumption that we need developers to act as middlemen between us and computers. They translate our desires into code and abstract it away from us behind simple, one-size-fits-all interfaces we can understand.
The division of labor is clear: developers decide how software behaves in the general case, and users provide input that determines how it behaves in the specific case.
By splitting the prompt into System and User components, we've created analogs that map cleanly onto these old world domains. The System Prompt governs how the LLM behaves in the general case and the User Prompt is the input that determines how the LLM behaves in the specific case.
With this framing, it's only natural to assume that it's the developer's job to write the System Prompt and the user's job to write the User Prompt. That's how we've always built software.
But in Gmail's case, this AI assistant is supposed to represent me. These are my emails and I want them written in my voice, not the one-size-fits-all voice designed by a committee of Google product managers and lawyers.
In the old world I'd have to accept the one-size-fits-all version because the only alternative was to write my own program, and writing programs is hard.
In the new world I don't need a middleman tell a computer what to do anymore. I just need to be able to write my own System Prompt, and writing System Prompts is easy!
My core contention in this essay is this: when an LLM agent is acting on my behalf I should be allowed to teach it how to do that by editing the System Prompt.
Does this mean I always want to write my own System Prompt from scratch? No. I've been using Gmail for twenty years; Gemini should be able to write a draft prompt for me using my emails as reference examples. I'd like to be able to see that prompt and edit it, though, because the way I write emails and the people I correspond with change over time.
What about people who don't know how to write prompts, won't they need developers to do it for them? Maybe at first, but prompt-writing is surprisingly intuitive and judging by how quickly ChatGPT caught on I think people will figure it out.
What about agents that aren't so personal, like an AI accounting agent, or an AI legal agent? Wouldn't it make more sense for a software developer to hire an expert accountant or lawyer to write one-size-fits-all System Prompts in these cases?
That might make sense if I were the user. A System Prompt for doing X should be written by an expert in X, and I am not an expert in accounting or law. However, I suspect most accountants and lawyers are going to want to write their own System Prompts too, because their expertise is context-specific.
YC's accounting team, for example, operates in a way that is unique to YC. They use a specific mix of in-house and off-the-shelf software. They use YC-specific conventions that would only make sense to other YC employees. The structure of the funds they manage is unique to YC. A one-size-fits-all accounting agent would be about as useful to our team as an expert accountant who knew nothing about YC: not at all.
This is the case for every accounting team in every company I've ever worked for. It's why so much of finance still runs on Excel: it's a general tool that can handle an infinite number of specific use cases.
In most AI apps, System Prompts should be written and maintained by users, not software developers or even domain experts hired by developers.
Most AI apps should be agent builders, not agents.
If developers aren't writing prompts, what will they do?
For one, they'll create UIs for building agents that operate in a particular domain, like an email inbox or a general ledger.
Most people probably won't want to write every prompt from scratch, and good agent builders won't force them to. Developers will provide templates and prompt-writing agents that help users bootstrap their own agents.
Users also need an interface for reviewing an agent's work and iterating on their prompts, similar to the little dummy email agent builder I included above. This interface gives them a fast feedback loop for teaching an agent to perform a task reliably.
Developers will also build agent tools.
Tools are the mechanism by which agents act on the outside world. My email-writing agent needs a tool to submit a draft for my review. It might use another tool to send an email without my review (if I'm feeling confident enough to allow that) or to search my inbox for previous emails from a particular email address or to check YC's founder directory to see if an email came from a YC founder.
Tools provide the security layer for agents. Whether or not an agent can do a particular thing is determined by which tools it has access to. It is much easier to enforce boundaries with tools written in code than it is to enforce them between System and User Prompts written in text.
I suspect that in the future we'll look back and laugh at the idea that a "prompt injection" (like "Ignore previous instructions...") was something to be concerned about. The whole idea that developers should secure one part of the prompt from another part of the prompt is silly, and a strong signal that the abstractions we're using are broken. As this post makes clear: if any part of the prompt is in user space then the whole thing is in user space.
As I mentioned above, however, a better System Prompt still won't save me much time on writing emails from scratch.
The reason, of course, is that I prefer my emails to be as short as possible, which means any email written in my voice will be roughly the same length as the User Prompt that describes it. I've had a similar experience every time I've tried to use an LLM to write something. Surprisingly, generative AI models are not actually that useful for generating text.
The thing that LLMs are great at is reading text and transforming it, and that's what I'd like to use an agent for. Let's revisit our email-reading agent demo:
labelEmail(label, color, priority)
archiveEmail()
draftReply(body)
dinner tonight
Technical co-founder?
founder intro?
reschedule
Exclusive Offer for New Customers
Latest Product Launches and Industry New...
advice on fundraising
Top Stories: New Programming Languages o...
Special Pricing on Data Analytics Tools
Transform Your Workflow with Our Platfor...
This Week in Tech: AI Developments and M...
Security Solutions for Growing Companies
It's not hard to imagine how much time an email-reading agent like this could save me. It already seems to do a better job of detecting spam than Gmail's built-in spam filter. It's more powerful and easier to maintain than the byzantine set of filters I use today. It could trigger a notification for every message that I think is urgent, and when I open them up I'd have a draft response ready to go, written in my voice. It could auto-archive the emails I don't need to read and summarize the ones I do.
Hell, with access to a few additional tools it could unsubscribe from lists, schedule appointments, and pay my bills too, all without my having to lift a finger.
This is what I really want from an AI-native email client: the ability to automate mundane work so that I can spend less time doing email2.
This is what AI's "killer app" will look like for many of us: teaching a computer how to do things that we don't like doing so that we can spend our time on things we do.
One of the reasons I wanted to include working demos in this essay was to show that large language models are already good enough to do this kind of work on our behalf. In fact they're more than good enough in most cases. It's not a lack of AI smarts that is keeping us from the future I described in the previous section, it's app design.
The Gmail team built a horseless carriage because they set out to add AI to the email client they already had, rather than ask what an email client would look like if it were designed from the ground up with AI. Their app is a little bit of AI jammed into an interface designed for mundane human labor rather than an interface designed for automating mundane labor.
AI-native software should maximize a user's leverage in a specific domain. An AI-native email client should minimize the time I have to spend on email. AI-native accounting software should minimize the time an accountant spends keeping the books.
This is what makes me so excited about a future with AI. It's a world where I don't have to spend time doing mundane work because agents do it for me. Where I'll focus only on things I think are important because agents handle everything else. Where I am more productive in the work I love doing because agents help me do it.
I can't wait.
Thanks to all who read drafts of this essay, including my Dad, dang, wcl & cpl, and my colleagues at YC.
1: I'm leaving some details out and of course today's models can input and output sound and video, too. For our purposes we can ignore that.
2: There are email clients out there that are already working on this, notably Superhuman and Zero