
With 620 million monthly users, applying a frontier model for each image recommendation is not a strategy, it is a bill. Pinterest CTO Matt Madrigal solved this by destroying Qwen3-VL’s vision layer and rebuilding it with proprietary embeddings, reducing costs by 90% and increasing accuracy by 30%.
The Madrigal team has been investing heavily in customizing open source models “fundamentally internally.”
“If you have really unique data that you can then fine-tune an open source model with, the quality of the data will, frankly, exceed or exceed the size of the model,” Madrigal explained in a recent article. VB Podcast Beyond the Pilot.
How Pinterest Customized Qwen for Visual Discovery
Pinterest, which has around 620 million monthly active users, has long applied open source models for search and visual discovery, dating back to Google’s BERT and OpenAI’s CLIP. The company refined its own Pin CLIP on the latter, incorporating proprietary visual embeds and image metadata.
Pinterest’s conversational shopping assistant, Navigator 1, was built on Qwen3-VL and customized in “pretty significant” ways. Basically, Madrigal’s team “ripped off” Qwen’s vision encoder layer and fine-tuned the model in proprietary multimodal embeddings. This has allowed them to capture metadata around pins and images that can then be pre-computed offline and periodically retrained with new information to deliver personalized experiences.
“Open source models, especially with open Apache licenses where you can really modify a lot of open weights and customize them for unique use cases – that’s where we’ve found open source to be so powerful for us,” Madrigal said.
Incorporating your own embeds allows your team to get context around metadata, pins, and images; Furthermore, in particular, the model performs better at runtime and in inference. Without these embeds, developers would have to call and encode each returned image at runtime, one at a time. That results in “20 times worse” latency from an inference perspective, Madrigal said.
“If it’s something that’s going to be critical to our end users, that’s going to drive engagement, that’s going to have to scale to over 600 million monthly active users, we’re probably going to build it or take advantage of open source and customize it to the max,” he said.
How a taste graph captures evolving interests
To guide users from inspiration to purchase, the Madrigal team created a "flavor chart"– A dynamic representation of what individual users really like, not just what they click on. “It’s this representation of the evolving tastes of billions of people,” he said.
People go to Google or other search engines when they have a clear idea of what they want; Pinterest is for when they’re still in the discovery phase, Madrigal said. Pinterest’s goal is to encourage “lateral exploration” and transform discovery into intent (i.e. clicking on ads or making purchases).
Under the hood, the architecture combines a graph structure with representational learning. User onboardings capture a user’s changing tastes. These are constantly updated based on activity and new content and signals. “It’s not a social graph,” Madrigal said. “It’s much more of a preference graph: What will inspire you? What are you trying to do next?”
For example, a user may be interested in mid-century modern designs; another may prefer the Nantucket aesthetic. Those preferences will be captured in the user’s embeds and, as a result, the taste graph will display specific and relevant products.
“It goes from the upper funnel, the discovery of inspiration, to the intention of the lower funnel,” Madrigal said.
Listen to the full podcast to learn more about:
-
How Pinterest uses sandboxes to encourage creativity in a safe and contained way;
-
Why a continuous feedback loop can prevent AI visual decline;
-
The importance of constant benchmarking to measure user engagement, performance, latency, and other factors.
You can also listen and subscribe Beyond the pilot in Spotify, Apple or wherever you get your podcasts.





