TL;DR At Martian, we are fortunate to work with many of the world's most advanced users of AI. We see the problems they face on the leading edge of AI and collaborate closely with them to overcome these challenges. In this first of a three-part series, we share a view into the future of prompt engineering we refer to as Automated Prompt Optimization (APO). In this article we summarize the challenges faced by leading AI companies including Mercor , G2 , Copy.ai , Autobound , 6sense , Zelta AI , EDITED , Supernormal , and others. We identify key issues like model variability, drift, and “secret prompt handshakes”. We reveal innovative techniques used to address these challenges, including LLM observers, prompt co-pilots, and human-in-the-loop feedback systems to refine prompts. We conclude by inviting those who are interested to collaborate with us on this problem area in future research.
Introduction Our expertise lies in Model Routing. We dynamically route every prompt to the optimal large language model (LLM) based on our customers' specific cost and performance needs. Many of the most advanced AI companies are implementing routing. By working alongside these AI leaders, we gain firsthand insight into the challenges they encounter and jointly collaborate to solve them. We are working at the future edge of commercial AI projects.
The challenges we see are likely ones you are facing now or will encounter as you advance on your AI journey. We share this information to provide a glimpse into the future of Automated Prompt Optimization (APO) and to invite the broader AI community to collaborate with us on research in this area. If you are interested in participating, please reach out to us at contact@withmartian.com.
This article is the first of a three-part series on APO Part One : In this article we summarize the challenges faced by leading AI companies including Mercor , G2 , Copy.ai , Autobound , 6sense , Zelta AI , EDITED , Supernormal , and others. We identify key issues like model variability, drift, and “secret prompt handshakes”. We reveal innovative techniques used to address these challenges, including LLM observers, prompt co-pilots, and human-in-the-loop feedback systems to refine prompts.
Part Two: Part two will focus on what people in industry are doing differently from research and academia. We will aim to drive collaboration to advance both research and real-world solutions in APO.
Part Three: will dive into Martian’s research into automatic prompt optimization. The goal is to introduce concrete solutions we’ve found for these problems in our research, and layout where we intend to improve things further. By starting with the key challenges in prompt engineering encountered by leading AI companies and the solutions they have implemented, we lay the groundwork for understanding the current state of APO. We then provide detailed interview summaries for each company, showcasing how they are addressing these issues today and their innovative ideas for the future. This sets the stage for our next article, where we will explore the cutting-edge solutions researchers are developing for prompt engineering.
Key Issues and Solutions in Prompt Engineering Challenges Faced by Prompt Engineers Variability in Prompting Techniques Across ModelsModel Differences: Each model has unique characteristics due to variations in architecture, training data, design philosophies, context window sizes, and multimodal capabilities. This diversity necessitates tailored prompts for each model. Non-Plug-and-Play Integration: Integrating new models into an existing AI stack requires reconsidering, testing, and potentially rewriting each prompt, making the adoption of new models challenging and time-consuming. Model DriftModel Point Updates Impact Prompts: Updates to models can alter how they interpret and respond to prompts. A prompt that works with one version might not work with a newer version, causing inconsistency and requiring continuous adjustments. Secret Prompt HandshakesHidden Optimization Tricks: Small changes in prompts can lead to significantly different outputs. These "secret handshakes" are often discovered through trial and error, making it challenging to predict and standardize effective prompting techniques. A striking illustration of this phenomenon is outlined in a recent blog post by Anthropic . This post delves into the research team's efforts at Anthropic to enhance the performance of Claude 2 on the Needle In A Haystack (NIAH) benchmark. This benchmark assesses a model's capacity to comprehend extensive contexts by strategically burying a sentence within large documents to test the model's ability to find “the needle in a haystack” across a range of document sizes and location depth. In the blog post, the Anthropic team calls out how the simple addition of the sentence "Here is the most relevant sentence in the context:" to the prompt boosted Claude 2.1's score from 27% to 98% in the evaluation. End Users Prompting LLMsMany companies enable end users to create their own prompts as part of their product experience, leading to usability issues. Allowing users to prompt LLMs directly can result in inconsistent results and quality control challenges Best Practices a Black BoxLimited Understanding: The best practices for prompting are not always fully known, even by the model developers. Effective prompting strategies are often discovered through experimentation and shared within the AI community. Key Themes from the Interviews Tailoring Prompts for Diverse ModelsCompanies consistently highlighted the need to customize prompts for different models, each with unique requirements and behaviors. This customization, while essential for optimal performance, is seen as excessively labor-intensive. Managing Model DriftModel updates and drifts are common, frustrating, and time-consuming issues. Companies emphasized the ongoing effort needed to adjust prompts and maintain consistency in outputs as models evolve. Setting up “evals”, automated methods to measure the quality of an LLM output, to alert the prompt engineer to degradations in performance. Relying on customers to complain is the most prevalent approach. The frustration level here suggests that a product dedicated solely to managing model drift could possibly stand on its own. End Users Prompting LLMsAllowing users to prompt LLMs directly can result in inconsistent results and quality control challenges. Copy.ai has implemented an AI system that acts as an automated prompt engineer to help novice users achieve better results, proudly stating, “No PhD in prompt writing required!” Leveraging Simulation and Evaluation ToolsAdvanced companies are using sophisticated tools and simulations to test and refine prompts. For example, Mercor uses AI models to simulate two-sided interviewer and candidate conversations, with an LLM judge evaluating the effectiveness of these conversations. Human-in-the-Loop Feedback SystemsSeveral companies use human-in-the-loop feedback systems to continuously improve prompts. This approach allows for real-time adjustments based on user feedback (thumps up, thumbs down, and comments), enhancing the relevance and accuracy of AI outputs. Emergence of the LLM JudgesThe LLM Judge involves using one or more LLMs to evaluate the quality and effectiveness of prompts and model outputs. Typically associated with the "evals" framework, companies are expanding the concept beyond the evaluation and monitoring of LLM outputs. LLM Prompt “Orchestrator”Supernormal has implemented a prompt layer that acts as a quality monitor and orchestrator, checking if meeting notes include follow-up action items. If no action items are found, the tokens are not sent to the part of the prompt chain that extracts action items, improving latency and lowering costs. Automated Systems TransparencyCompanies are eager to build trust and learn from their automated systems. A system that creates automated prompt improvements should share insights with prompt engineering and product management teams. These insights are so valuable that the team at G2 considers this area potentially the most exciting in APO. Down Stream Success Signals as FeedbackAutobound: Focuses on using downstream success signals as feedback for prompt optimization such as the email open and reply rates of personalized email messages their system creates Recursive IterationG2: Emphasized an interest in research into auto-recursive iteration in refining prompts and improving performance. By understanding these key issues and the innovative solutions implemented by leading AI companies we hope you can anticipate some of these challenges and begin planning to address them. At the same time we are better prepared to engage research and academia to collaborate on future solutions to APO.
Next, we summarise our company interviews showcasing their specific issues. how they are addressing these issues today and their innovative ideas for the future.
Company Interviews Mercor Company & AI Product Overview At Mercor , we're transforming the talent assessment and recruitment landscape with our AI-driven approach. Our main goal is to automate the tasks that human recruiters typically handle, making the hiring process much more efficient and effective.
For instance, instead of having recruiters manually review resumes, we use AI models that can evaluate and screen resumes quickly and consistently. We've also developed a Zoom-like interface for conducting interviews, where our AI can have back-and-forth verbal conversations with candidates to assess their suitability for a role.
One of the standout features of our platform is that our AI can review the code developers have posted on GitHub, as well as other types of portfolios associated with candidates. This allows us to gain deeper insights into their skills and experience beyond what's listed on their resumes.
Overall, our approach aims to enhance the experience for both companies and candidates. Today we have 300k+ people on the platform and 50+ companies which hire through our platform.
Prompt Engineering Environment In terms of our AI development environment, we use a combination of open-source and proprietary models, ensuring the most effective and efficient solutions for different tasks.
For example, we’ve developed proprietary methods for parsing resumes from PDF to JSON format, which enables better data handling and integration. For tasks requiring high-speed processing, such as semantic search and query rewriting, we use low-latency open-source models. Conversely, for more complex tasks that require advanced reasoning, like conducting AI interviews, we use frontier models like GPT-4 and Claude.
Automated Prompt Optimization
Regarding automated prompt optimization, the most complex area of Mercor's system involves crafting AI-led interview conversations to ensure context relevance and life-like dynamic question flow. In the real world interviews are dynamic, branching into new lines of questions based on a complex set of criteria, including the candidate's responses, their resume, portfolio, and other personal data. For us, this part of the experience has to be of very high quality.
To ensure this quality a key aspect is our simulation infrastructure. We run numerous simulations where AI models act as both interviewers and candidates. This helps us test and refine our interview conversation questions and flow, ensuring they are relevant and effective across various fields, from software engineering to finance to law.
On top of these simulations, we also have an AI judge that reviews interview transcripts, flags errors, and helps us continuously improve our models. This allows us to ensure high-quality and accurate AI-driven interviews.
G2 Company & AI Product Overview
At G2 , we're constantly pushing the envelope to give our buyers, sellers, and service providers the best software purchasing experience.
We launched Monty — the world’s first AI-powered B2B software recommendation assistant — to help our 90 million annual buyers instantly sort through the 2,000+ categories and 160,000+ products and services in the G2 catalog to find their next best software based on their unique requirements.
Monty is already supporting over 30,000 conversations each month with buyers, and finding that optimization point between quality and cost is always top of mind for us. While we often spring for the newest model available to give buyers the best experience, as models become incrementally capable and less differentiated, being able to quantify the opportunity cost of our LLM changes becomes extremely valuable.
Prompt Engineering Environment
Monty has a complex architecture built around multiple decision layers and a robust suite of commands. At different stages in the chat process, we require LLMs to output varied responses, from structured responses that can be parsed programmatically to human-readable text that the end-user can consume.
Being able to automatically tweak prompts and switch between models based on performance and the task at hand is becoming increasingly important to us as our product gains complexity and functionality.
Model drift and model migration are significant challenges for us. To ensure Monty stays current with the most cutting-edge AI models available, we undergo a rigorous prompt re-engineering process to avoid prompt drift. Abstracting this work to a prompt optimization system could not only save us time but also provide confidence and directionality as we integrate newer models.
Automated Prompt Optimization
To this end, we’ve researched recursive self-improvement of prompts based on evaluations and other prompt optimization techniques. One of the most exciting aspects of this technology is the potential to gain visibility into the internal learnings of the prompt optimization system. Understanding how various models score for a given prompt and a collection of prompts across quality, cost, latency, and other metrics would be incredibly beneficial. Additionally, gaining insights into automated prompt alterations done by the system could help our prompt engineers gain exposure to a broad range of prompting techniques applicable to a wide set of LLMs.
We are excited to watch this area of LLM tooling develop and believe it will significantly enhance our ability to deliver the best software purchasing experience.
Copy.ai Company & AI Product Overview
Prompt engineering and prompt management are massive undertakings at copy.ai . Our platform is designed to power complete go-to-market strategies with AI for sales and marketing teams. It includes over 400 pre-built workflows, each containing multiple prompts, addressing numerous sales and marketing use cases. For example, workflows handle tasks such as “Conduct competitor analysis from G2 reviews” or “Build customer sales FAQs from product documents.” Our platform operates with well over 2000 LLM out-of-the-box prompts. Additionally, we have over 15 million users on our platform, many of whom create custom workflows using our no-code workflow builder housing prompts they’ve written. When you do the math, our platform houses and executes a staggering number of prompts.
Prompt Engineering Environment
In our efforts to automate the improvement of prompts, our platform serves a diverse range of end users with varying levels of prompting experience. For many, copy.ai is their first experience with prompting an LLM. Naturally, we want all our users to achieve the best possible results on our platform. To assist with this, we’ve developed an AI system that processes our end-user prompts and operates behind the scenes within the copy.ai product. This helps users of all experience levels get better results and has proven highly effective. In our marketing materials, we proudly state, “No PhD in prompt writing required.”
Automated Prompt Optimization
Looking at a larger prompt engineering organization and the future of automated prompt optimization, you can imagine a unified prompting layer that interfaces effectively with multiple models designed with the needs of the engineering organization in mind. This system would contain a translation layer that has learned the unique prompting nuances that maximize each model's performance. By building this model-level prompting intelligence into a common infrastructure to handle various models’ unique prompting requirements, full-fledged prompt engineers and by extension, end-user customers can focus more on their use case requirements and less on mastering the nuances of each model. This abstraction layer between the application and the LLM model would allow for model and vendor flexibility and independence, resulting in better outcomes for users, prompt engineers, and AI development teams.
Autobound Company & AI Product Overview
At Autobound , we’ve focused deeply on delivering hyper-personalized emails for sales organizations. To make a truly transformative impact on this high-value use case, we've developed a sophisticated system that integrates data from over 150 sources, including news, social media, press releases, SEC filings, podcast mentions, and more. Today, over 10,000 users have signed up for Autobound to singificantly increase email engagement due to the heightened personalization Autobound achieves.
Prompt Engineering Environment
For Autobound to realize our vision, our product had to effectively synthesize all this information in the way a top salesperson would, prioritizing potential content inputs to ensure messages are relevant and contextual to the prospect's phase in the buyer's journey. As we honed this use case, our prompt chains became quite complex. Our first major step in optimizing our prompt development process was adopting Vellum as our platform to scale prompt chaining , versioning, and evaluations, which has made a significant difference for us. Robust tooling is essential for this, and building this tooling is not where we want our core competency to be.
Currently, our prompt optimization process includes humans in the loop. Users can provide feedback on our email outputs with thumbs up or thumbs down and comments. Our prompt engineers manually review this input and make necessary adjustments to the prompts.
Automated Prompt Optimization
Imagining the future of automated prompt optimization for our use case, we see the potential to leverage user signals as well as downstream signals from our customers' email systems to capture open and reply rates for automated prompt improvements. Data privacy issues need to be addressed. To accomplish this we would obfuscate all email engagement data, retaining only the prompt chains associated with the engagement data. This would indicate whether a prompt or prompt chain resulted in higher engagement. This data could initially be used for manual review. Additionally, we could ask an LLM judge to evaluate prompt chains and assess characteristics likely contributing to higher engagement rates. The LLM judge could recommend prompt improvements to further enhance email engagement rates. These changes would be reviewed by our prompt engineers and, if accepted, put into production. We would split-test these changes against our best-performing set of production prompts, creating an ongoing optimization process.
Based on our extensive experience with LLMs, we see this as an exciting direction forward.
6sense Company & AI Product Overview
At 6sense , we have successfully integrated multiple AI use cases into our products, serving thousands of marketers and sales professionals daily. Customer-specific predictive models leverage prospect and company information from the internet as well as first-party activity data from the customer’s websites, marketing automation, and CRM activity. We enhance this with company level (account level) intent data from our proprietary network and partners such as Bombora, G2, TrustRadius, and Gartner, driving AI-powered intelligence, automation, and personalization for B2B sales and marketing professionals. Using complex prompt chains across numerous AI models, we help B2B marketers predict pipeline and revenue, prioritize outreach, and create highly personalized messages for their prospects.
Prompt Engineering Environment
With the numerous data sources and use cases we support, complexity arises. We operate interdependent prompt chains, each calling distinct LLMs, managed by a growing team of prompt engineers. The work of one prompt engineer, the introduction of a new LLM, or LLM drift can significantly impact the final output and, consequently, our customers' experiences. To further complicate things, we foresee customers with a GenAI-first mindset creating their own prompts, which need to integrate with the prompts we provide out-of-the-box.
Automated Prompt Optimization (APO)
Our vision for automated prompt optimization (APO) keeps our prompt engineers in the loop. We envision an overarching AI that understands the goals of the overall system and the interactions between prompt chains. This system would ingest external signals, such as user feedback on account summaries, draft emails (e.g., thumbs up or thumbs down), and feedback from our prompt engineers to increase its effectiveness over time. As the APO system creates prompt revisions, our prompt engineers would review and learn from these revisions.
We see APO similarly to GitHub Copilot, as a specialized system that optimizes prompts for LLMs while having the intelligence to understand the goals of our product offerings and our LLMs as a working system.
Zelta AI Company & AI Product Overview
Zelta AI automates voice of customer analysis for software product teams. We collect data from sources like sales calls, support tickets, and social media to identify pain points, feature requests, and competitive mentions. This information aids product teams in prioritizing roadmaps and discovering new features and helps marketing teams improve messaging using their customers' own words. This ensures companies can effectively address customer needs and highlight product strengths. Today we support over 20 companies such as Bubble.io, Thoropass, Slate, Mero and Allbound.
Prompt Engineering Environment
A key challenge for our product is associated with optimizing prompts users enter directly into our product. Our core feature allows users to create reports around specific topics of interest to them, for example, users could input a prompt asking for all requests around a certain feature in their product. A typical challenge here is users not providing sufficient context around their ask, users will typically use internal company jargon or shorthand in their questions which can be misinterpreted by the LLM.
Automated Prompt Optimization (APO)
To overcome this prompt engineering obstacle we have developed a co-pilot which asks the user for further input when confidence is low (based on many diverse data points returned from vector retrieval). The co-pilot confirms its understanding of the user question and provides a list of data points it thinks are relevant. The user confirms the relevant data points which are then used to run a new vector search and create the end report.
We are excited about the progression in the field of APO and look forward to seeing where it goes.
EDITED Company & AI Product Overview
EDITED is a retail intelligence company that offers data analytics software to help brands and retailers stay competitive and boost margins and sales. Our platform provides retailers with market context visibility, AI-driven insights for predictability, and the ability to act in the moment. Currently, we serve over 200 global brands and retailers, including Abercrombie & Fitch, Chico’s, J.Crew, Lane Bryant, and Puma.
We use LLMs in several ways to enhance our data analytics. First, we deploy over 5000 web crawlers to collect data from retail websites, capturing details like product prices, stock availability, descriptions, care instructions, and images. LLMs then automatically parse the page content to identify and extract relevant information, eliminating the need for manual coding and scalability across different site structures and languages. This automation saves us a huge amount of time, especially when dealing with frequent site changes.
Further, we use LLMs to summarize dashboard insights for users. We load widget level insights into the LLM and have it summarize key takeaways for the dashboard. For example, the summaries might highlight where a retailer has assortment gaps vs competitors or highlight key categories, colors and patterns being heavily discounted. These short paragraph summaries dramatically increase speed to insight for your users and make email sharing quite convenient.
We have also experimented with LLMs to summarize content and promotional differences between pages. For example, a user might search for specific promotions or offers, like back-to-school sales, and retrieve related images and text. LLMs can summarize and compare the findings, answering questions like, "What has changed between Nike’s and Adidas' promotional offerings today?" This gives clients human-like insights into competitive differences and market trends.
Additionally, we are prototyping a feature that uses LLMs to summarize products. This helps answer questions such as, "What clusters of attributes best represent a collection of new products launched from our competitors? or "What unique, high-performing items do competitors have that we don't?" LLMs cluster products and provide summaries, allowing businesses to identify trends and follow successful competitor strategies quickly.
Prompt Engineering Environment
Today we are largely an OpenAI shop.
One area that's not clear to us in prompt engineering is the user experience for end customers to prompt our system directly. We're envisioning a prompt-based interface where users can ask questions about their data. We've been exploring different design paradigms to structure the data and guide users with suggested prompts to ensure they get accurate answers. The challenge is deciding whether to create predefined prompts or allow open-form queries. The AI is very powerful if used correctly. The concern is it could also lead users to miss important information or come to the wrong conclusions if the prompting is not done effectively.
Automated Prompt Optimization
To this end, we’ve researched ways to automate the improvement of user prompts to enhance the overall experience. We might consider an LLM Judge / Prompt Advisor concept that would monitor and refine the prompts users create, making them more effective.
Perhaps during the first phase of this LLM Judge / Prompt Advisor project, we would run the Judge as sort of a co-pilot to our prompt engineering team. It would monitor the prompts and results users are getting from the system and compare this data to a benchmark set of use cases. The result would be a ground truth of “good prompts” and “good data” outputs for common retailer queries.
The LLM Judge / Prompt Advisor would check the clarity of user prompts. Are the prompts clear and specific? Then, it would check for any ambiguous or vague terms that might lead to inaccurate answers. Relevance to the available data is also important. The LLM Prompt Advisor would be instructed to look at whether the prompts are asking for information that the system can provide based on the available data. It would determine if the prompts include all necessary details, such as dates, specific attributes, or categories, and identify common pieces of information that are often missing and result in incomplete answers.
The LLM Prompt Advisor would provide reports to our prompt engineers and product managers surfacing what is going well and areas for improvement and deeper investigation.
Over time you can imagine turning the LLM Prompt Advisor into a customer-facing Co-Pilot or AI Teammate that would recursively engage with users to understand the intent of their analysis, to learn from power users and help craft the best prompt to get the most complete answer our system has to offer.
Supernormal Company & AI Product Overview
At Supernormal, we’re developing cutting-edge AI and new AI interfaces for enhancing workplace productivity and these demand extensive prompt engineering and management. We want to automate and augment routine tasks like notetaking, agenda setting, and follow-up work, allowing teams to focus on the highest value work that truly matters. Currently, we are dedicated to streamlining the meeting experience, enabling teams to manage their meetings and make the most of their time together.
Our platform offers a comprehensive set of features for meeting management, including AI-driven transcription, notes, action items, and agendas. These tools are designed to facilitate seamless team collaboration without the need for tedious coordination. Today, over 275,000 teams, including industry leaders such as Red Hat, Salesforce, and Forbes, rely on Supernormal to get more out of their meetings.
Prompt Engineering Environment
We have built a robust suite of in-house tools within our development stack for prompt management and evaluation, and have to manage a few dozen prompts across our system using a variety of different models. A lot of our automation is for optimizing cost, latency, or some other aspect of system performance. We use LLMs for more than just generating the text like notes or agenda items that users see and interact with.
Automated Prompt Optimization
A key innovation is to use automated prompt evaluation to maximize resource efficiency and quality. For example, before processing a prompt to generate action items from meeting notes, a low-cost model first assesses whether action items are present. If that’s not the case, subsequent prompts are bypassed, conserving resources and improving latency. We also use prompts to check and remove defects in the generated output, like removing low-quality notes. These approaches are implemented throughout our prompt chains, ensuring cost-effective and timely meeting summarization.
With all of these prompts and a steady stream of model changes in the ecosystem, we can envision a future where we can automatically optimize prompts for new models as they appear or for changing user patterns, similar to how a product manager or prompt engineer might look to improve things or get them to run on less expensive models. We can imagine a system that analyzes prompt usage, identifies areas for improvement, and adapts to run on more cost-effective models when evals tell us quality stays high. This automation would not only save time and money, but also enable our engineers to focus on shipping the next innovations in the product experience.
- Jim Kleban , Head of AI @ Supernormal
Conclusion and Invitation to Participate In this first part of our series on Automated Prompt Optimization (APO), we've explored the real-world challenges and innovative solutions implemented by leading AI companies like Mercor, G2, Copy.ai, Autobound, 6sense, Zelta AI, EDITED, and Supernormal. These companies are pioneering advancements in prompt engineering, tackling issues such as model variability, drift, and the complexities of allowing users to prompt LLMs directly.
As we continue this journey, Part Two will engage and further academic research on APO. Finally, Part Three will propose solutions to APO.
We invite you to join us in this exploration, share your thoughts, and contribute to the evolving conversation on APO. Whether you are an AI practitioner, researcher, or enthusiast, your insights and experiences are invaluable in shaping the future of automated prompt optimization. Reach out to us at contact@withmartian.com . We would love to hear from you.
Additional Thanks In addition to the people already mentioned in this article, I would like to thank the following people for contributing:.
Mahima Chhagani - AI Prompt Engineer, Meta
Kyle Coleman - CMO, Copy.ai
Akash Sharma - Akash Sharma, CEO at vellum
Anita Kirkovska - GenAI Growth, Vellum
Jenny Gardynski - Jenny Gardynski, Director of Communications at G2
Thor Ernstsson - Founder, ArcticBlue.ai
Francesco Magnocavallo - Group Generative AI Strategist, Digital360
Daniel Karlsson - Co-Founder, AI Advisor, TOKSTARK
Emily E - Head of Marketing, Supernormal
Sean Anderson - Head of Product Marketing, Vectara
Ofer Mendelevitch - Head of Developer Relations. Vectara
Shellie Vornhagen - CMO, EDITED