Your LLM issues are really data issues

Your LLM issues are really data issues - Stack Overflow

Stack Overflow Business Stack Internal: the knowledge intelligence layer that powers enterprise AI.Stack Data Licensing: decades of verified, technical knowledge to boost AI performance and trust.Stack Ads: engage developers where it matters — in their daily workflow.They explore how schema changes, inconsistent definitions (like “customer”), and weak governance can break both your analytics and MLs, and what companies can do to get their data AI-ready, from metadata management to observability.Collate is a semantic intelligence platform built on a semantic metadata graph for discovery, governance, and AI observability across your data ecosystem.Connect with Harsha on LinkedIn.Congrats to user buttonsrtoys, who won a Famous Question badge for their question Possible to edit PDF without embedded font installed?.TRANSCRIPT[Intro Music]Ryan Donovan: Hello, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Donovan, your host, and today we are talking about [how] with AI, the rubber hits the road and tries to dig into production data. It can't handle all that structured data, there. So, we're gonna talk about what's the issue and potential solutions with co-founder and CTO at Collate, Harsha Chintalapani. So, welcome to the show, Harsha.Harsha Chintalapani: Thanks, man.Ryan Donovan: Of course. So, before we get into it, we'd like to get to know our guest a little bit. So, tell us a little bit about how you got into software and technology.Harsha Chintalapani: Yeah, so my career started back in Yahoo, all the way back into 2007 when I started working on search engines in product search, and how do we rank, and everything else. So, that's how I started, and that's where I started exporting to the big data, how you index the large amounts of data. And thankfully, Yahoo, at that time, had a research ring who was working on productionizing map pictures, which today is Hadoop, and getting my hands on it to actually scale our indexing system for product search. So, that's how I got into data, and that's where my journey started in the tech.Ryan Donovan: We've been told one of AI's sort of very good use cases is with processing and understanding data, but definitely have heard some issues where the data was not processed correctly or not understood. And today, we're talking about specifically the issues with production data, structured real-time data. What are the issues around that?Harsha Chintalapani: Yeah, I think I can go into a few examples of where we are coming from. So again, going back to Yahoo days, you don't really think of real-time data itself becoming a big scale back in 2008. So, it started [with] Yahoo. Obviously, Google is trying to produce map pictures and Yahoo open-sourced that as a form of Hadoop. It helped web indexing, product search indexing, to scale the huge amounts of data that we get across the web, and how do we actually in real-time index it at a scale and put it into a search engine, so that users can use through the app? So, from there, my co-founder, who is one of the core engineers behind Hadoop, went on to be co-founders of Hortonworks, which not only kind of open-sourced all of the technologies that we're building in Hadoop, in Yahoo, and made it a supportable, enterprise-ready, deployable solution. So, I get to meet Suresh in Hortonworks, became a committer and PMC in Apache Kafka and Apache Spark, which deal with real-time data and indexing at distributor systems. So, our focus during the early stage of big data movement is how do we actually scale and deploy systems that can actually accept that they won't appear at that automation is storing, and process them efficiently, and make them available for analytics, right? That's where your entire business intelligence comes from, your ML models, and everything else. So, during our journey, what we end up [with is] user behavior, the company's behavior is, okay, no. [Inaudible] is there, and everything else- great.' And we see the movement towards the cloud solutions themselves. What we notice is the distributor system, the complexity of crossing, is becoming a solved solution, because you have amazing systems coming through, through the cloud providers across Amazon, Google, Azure, and whatnot. And we found what happens if data is exploring, data can be processed efficiently. Is that data is being useful to the automations? Is it actually moving the needle? Are they getting the right business stations and everything else? So, at that point we thought, hey, what are the companies that are kind of [inaudible] of the challenges are solutions, in fact, right? With that, we both joined Uber's inter-product reset on data. Like, how do you analyze all of this data efficiently and make decisions, make riders and the drivers make itterations in real time, and all of that, right? So, when [looking] outside in, Uber looked like an amazing place because there's amazing engineering work that is going on. We thought, hey, we're gonna learn a lot about how to operate data at scale, not just store and process, but understand the data at scale. But when we went there, the problems are every day or every other day. Problems are different scopes. The problems are not so much in processing the data itself, but understanding the data. Like when the schema changes, what breaks downstream? If you ask a [for] business concept, such as location, it depends on which team, which user you ask, which engineer you ask -you get a different definition. So, location is such a core concept to Uber, we're not able to understand [it] efficiently and universally. The data itself, discovery problems, right? When data infrastructure became self-service, like we did in Uber, everyone wants to pay it because everyone wants to run some experiments, [and] understand the data. So, if you have a trips table, there might be hundreds of different trips table named slightly differently. Now, if someone is building a quarterly report and they wanna show that, 'in this quarter we made X amount of trips,' which table should they go to? This is a famous incident that happened. We under reported a number of trips taken because analyst accidentally found a table that is not kept up to date. So, how do you analyze the data that is fresh and complete? Then comes the lineage problems, then comes the GDPR problems. When the GDPR mandate came in, I think 200 of us manually classified all the data across the board. So, how do you manage the data at this scale? To me, to your question, is this the problem new to the AI just because LLMs are coming into the page? No, it's a problem that has existed. Now, it became amplified that much more, because now you have an entire data ecosystem you're just throwing [to the] LLM, and it can't figure out what you don't as a human. So, that's the challenge that you're facing.Ryan Donovan: It's interesting. Uber seems to have pioneered a lot of problems real time. We talked to a bunch of other companies that have spun out of things that of managing the massive service ecosystem there, and also the data there. So, we talking about these problems, are these solely problems with Uber scale, or do these affect smaller companies, too?Harsha Chintalapani: Yeah, so our initial reaction [was], hey, this is probably very unique to Uber because of the amounts of data, the amount of people, and all the combinations that can come into play. But when we went down, talked to our network, we have been building that infrastructure from Yahoo days, Hortonworks has tons of customers, of waiting capabilities, right? You have fortune 500 all the way to middle tier automations, but after a certain very small scale, data becomes that much of a harder problem to maintain because [or] your incentive. You wanna set data engineers free in the sense that you don't wanna gate keep them on how to use it because you wanna understand how they analyze the data, and how they come up with the ideas, and the explosion of ML models too, right? You're actually infusing this data that is coming through for ML models to do AP testing. You wanna analyze, for example, in case the of Uber, and what is the ETA across the restaurants, what exists in all of this. So, there are a lot of cool things that happen because of the data that is coming in. So, it became a problem even at a middle scale, or even a small scale. Outside of, I would say, when a company comes in a startup, everyone is- we are all data engineers to an extent. That's part of the problem, right? As soon as you started building a data team, or you started saying that, 'hey, we have a product, we need to analyze the data that is coming in,' this becomes a problem. Even at Collate, we are a small company right now. We see , hey, where is this customer stage, and who is the customer? Are they happy with our product? And all of these things, we're doing our own in the top 40 in terms of data. The data compared to Uber is extremely small, right? We don't even want to compare to that. But there is a challenge, inherent challenges of even the small data, that lead to understanding the data. And what is a customer? What is a customer means? What are the different measurements that you wanna- metrics you wanna measure against? all of them beyond, let's say, 10 people, where you can talk to each other. It becomes that hard to maintain.Ryan Donovan: A lot of companies split their production database and their analytics database. Is this something that comes from that, where you're taking trips or whatever, and aggregating them as something else? Or is this a sort of standardization of names problem, or some secret third thing?Harsha Chintalapani: Just to address the first point: so, you have your traditional databases where your application's running against your users, and APS and [inaudible]. The need for the big data came from- the need for the data warehouse came because you don't wanna run a l

Your LLM issues are really data issues

Your LLM issues are really data issues - Stack Overflow

Related Articles

The Singleton Labyrinth

Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes.

Check Wallet Balances Across 4 Chains with Zero Dependencies — chain_balance.py

Comments