▲Evaluating modular RAG with reasoning modelskapa.ai

62 points by emil_sorensen 126 days ago | 31 comments

serjester 124 days ago [-]

We tried something similar and found much better results with o1 pro than o3 mini. RAG seems to require a level of world knowledge that the mini models don’t have.

This comes at the cost of significantly higher latency and cost. But for us, answer quality is a much higher priority.

eternityforest 124 days ago [-]

RAG seems to work with 0.5 and 1.5B models just fine a lot of the time, it just can't handle anything that's not directly spelled out in the documents.

Or, at least it seems to in the limited amount of testing I did in a weekend. I'm an embedded dev without any real AI experience or an actual use case for building a RAG at the moment.

Foobar8568 124 days ago [-]

RAG is basically a fancy name to augment a prompt with data.

Companies are being sold they can augment their LLM with their unstructured massive dataset but it's all wishful thinking.

Workaccount2 124 days ago [-]

Yeah, LLM capabilities are measured with fresh context windows, yet people want to use them with 50k, 100k, 500k tokens.

As you pack in more and more context the model's abilities really start to deteriorate.

The first 10k tokens are the juiciest, after that it just gets worse and worse.

eternityforest 124 days ago [-]

Oh wow, I was thinking 500 tokens was way too much, since I've only ever done anything programmatic with tiny models on CPUs....

serjester 124 days ago [-]

That's essentially what an embedding model is - a smaller, faster model that's good at finding information quickly. Then you feed that to a larger, more powerful reasoning model to synthesize and you've invented RAG.

eternityforest 123 days ago [-]

In my limited weekend testing with just a CPU, the 1.5B model is the larger and more powerful model at the end!

I'm definitely excited to see what new applications are possible with NPUs, when we can run this stuff for real on stuff anyone other than enthusiasts can afford, without waiting 40 seconds.

emil_sorensen 124 days ago [-]

Super cool! Yep, a lot seems to get lost through distillation.

SubiculumCode 124 days ago [-]

I found it interesting the parts that discussed current limitations of llm's understanding of tools, despite apparent reasoning abilities, it didn't seem to have an intuitive understanding of when to use the specific search tools.

I wonder whether this would benefit from a fine tuned llm module for that specific step, or even by providing a set of examples in the prompt of when to use what tool?

EngineeringStuf 124 days ago [-]

Am I correct in reading that the RAG pipeline runs in realtime in response to a user query?

If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.

That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.

The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.

Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

TechDebtDevin 124 days ago [-]

> Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

Go on please :)

triyambakam 124 days ago [-]

So if the user submitted a question not already generated, would that be like a cache miss and it would instead fall back to a real time query?

EngineeringStuf 124 days ago [-]

Yes, but you could optimise the generated questions over time to reduce cache-misses.

ekianjo 124 days ago [-]

> time and generate possible questions from the LLM based on the context of the current semantically split chunk.

Possible but very compute intensive. Imagine if you have hundreds of thousands of chunks...

EngineeringStuf 124 days ago [-]

The number of chunks would be the same regardless of either approach.

The generation of questions can be done out-of-band by a cheaper model.

Their current implementation approach seems to require some computation per request. It would be a balance to see which strategy provides the most value.

The speed of responses overall would be faster.

aantix 124 days ago [-]

When aggregating data from multiple systems, how do you handle the case of only searching against data chunks that the user is authorized to view? And if those permissions change?

emil_sorensen 124 days ago [-]

We focus mainly on external use cases (e.g., helping companies like Docker and Monday.com deploy customer facing "Ask AI" assistants) so we don't run into much of that given all data is public.

For internal use cases that require user level permissions that's a freaking rabbit role. I recently heard someone describe Glean as a "permissions company" more so than a search company for that reason. :)

3abiton 124 days ago [-]

> fine-tuning a model on tool usage could also allow it to gain familiarity with specific retrieval mechanisms.

I am curious if finetuning on specific usecases would outperform RAG approaches, assuming the data is static (say company documentation). I know there has been lots of posts on this, but yet to see quanitifications, especially with o3-mini.

anonymousDan 125 days ago [-]

Is RAG any good for coding tasks?

mvieira38 124 days ago [-]

For what it's worth, Anthropic decided against using RAG in Claude Code (https://news.ycombinator.com/item?id=43164089)

Workaccount2 124 days ago [-]

I used Claude 3.7 last night to write a program for building tooling code from legacy manufacturing files for cnc and electronics manufacturing. Basically it renders the old files visually (they are sorta like SVGs) and then a human can click through them to create the necessary measurements, which the program then indexes and stores for the user. It has a nice GUI with buttons, highlights your selections, graphically demarcates previous measurements, and shows a running list of calculated outputs based on your selections, which you can delete if incorrect. Then you click export it exports it in the correct modern Place File structure. Totally knocked my socks and feet off too.

There are no programs online which do this (lots of viewers, but not interpreters/converters), and I actually had gotten a quote for proprietary software that can do it, but is $1k/yr to use.

I _did not_ think claude would be able to do it, but thought I would give it a shot. It took 3 prompts to get to 95% of the way there. The last 5% was done by o3mini because Claude ran out of capacity for me.

raggedasil 125 days ago [-]

I'd say it's essential to provide whatever you're asking context. In fully local environments I've been able to integrate the responses directly without having the generalize -> generate -> de-generalize loop, highly increasing LLM's value for me.

afhammad 124 days ago [-]

Could you share more on your local setup please?

mkesper 125 days ago [-]

Latency must be brutal here. This will not be possible for any chat application, I guess.

bauefi 125 days ago [-]

It depends on how you do retrieval. If you just use dense embeddings for example you can get the latency of one search query down to maybe something like 400ms. In that case multiple sequential look ups would be ok but your embeddings need to be good enough of course

laichzeit0 124 days ago [-]

It's not just the retrieval, tool calls entail another call to the LLM (ToolMessage) and possibly the result will then require other tool calls. Massive latency.

eternityforest 124 days ago [-]

These ultra fast embeddings are really cool, because you can just spam them at everything and it's pretty much instant.

I was able to get them to answer very simple questions without any vector database or pre indexing, just expanding the search query to synonyms, then using normal fulltext search, using embeddings to match article titles to the query, plus adding a few "Personality documents" that are always in every result set no matter what.

Then I do chunking on the fly based on similarity to to query.

Retrieval takes about 1 second on a CPU, but then the actual LLM call takes 10 to 40 seconds, because you need about 1500 bytes of context to consistently get something that has the answers in it... Not exactly useful at the moment on cheap consumer hardware but still very interesting.

https://huggingface.co/blog/static-embeddings

emil_sorensen 125 days ago [-]

Yep even with a small bump in performance (which we only saw for a subset of coding questions), it wouldn't be worth the huge latency penalty. Though that will surely go down over time.

emil_sorensen 125 days ago [-]

Curious if anyone else has run similar experiments?

zurfer 125 days ago [-]

Yes. Our main finding was that o3 mini especially is great on paper but surprisingly hard to prompt, compared to non reasoning models. I don't think it's a problem with reasoning, but rather with this specific model. I also suspect that o3 mini is a rather small model and so it can lack useful knowledge for broad applications. Especially for RAG, it seems that larger and fast models (e.g. gpt4o) perform better as of today.

emil_sorensen 124 days ago [-]

I suspect you're right here! Excited to get our hands on the non-distilled o3. :)

Loading comments...

serjester 124 days ago [-]

We tried something similar and found much better results with o1 pro than o3 mini. RAG seems to require a level of world knowledge that the mini models don’t have.

This comes at the cost of significantly higher latency and cost. But for us, answer quality is a much higher priority.

eternityforest 124 days ago [-]

RAG seems to work with 0.5 and 1.5B models just fine a lot of the time, it just can't handle anything that's not directly spelled out in the documents.

Or, at least it seems to in the limited amount of testing I did in a weekend. I'm an embedded dev without any real AI experience or an actual use case for building a RAG at the moment.

Foobar8568 124 days ago [-]

RAG is basically a fancy name to augment a prompt with data.

Companies are being sold they can augment their LLM with their unstructured massive dataset but it's all wishful thinking.

Workaccount2 124 days ago [-]

Yeah, LLM capabilities are measured with fresh context windows, yet people want to use them with 50k, 100k, 500k tokens.

As you pack in more and more context the model's abilities really start to deteriorate.

The first 10k tokens are the juiciest, after that it just gets worse and worse.

eternityforest 124 days ago [-]

Oh wow, I was thinking 500 tokens was way too much, since I've only ever done anything programmatic with tiny models on CPUs....

serjester 124 days ago [-]

eternityforest 123 days ago [-]

In my limited weekend testing with just a CPU, the 1.5B model is the larger and more powerful model at the end!

I'm definitely excited to see what new applications are possible with NPUs, when we can run this stuff for real on stuff anyone other than enthusiasts can afford, without waiting 40 seconds.

emil_sorensen 124 days ago [-]

Super cool! Yep, a lot seems to get lost through distillation.

SubiculumCode 124 days ago [-]

I wonder whether this would benefit from a fine tuned llm module for that specific step, or even by providing a set of examples in the prompt of when to use what tool?

EngineeringStuf 124 days ago [-]

Am I correct in reading that the RAG pipeline runs in realtime in response to a user query?

If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.

That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.

The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.

Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

TechDebtDevin 124 days ago [-]

> Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

Go on please :)

triyambakam 124 days ago [-]

So if the user submitted a question not already generated, would that be like a cache miss and it would instead fall back to a real time query?

EngineeringStuf 124 days ago [-]

Yes, but you could optimise the generated questions over time to reduce cache-misses.

ekianjo 124 days ago [-]

> time and generate possible questions from the LLM based on the context of the current semantically split chunk.

Possible but very compute intensive. Imagine if you have hundreds of thousands of chunks...

EngineeringStuf 124 days ago [-]

The number of chunks would be the same regardless of either approach.

The generation of questions can be done out-of-band by a cheaper model.

Their current implementation approach seems to require some computation per request. It would be a balance to see which strategy provides the most value.

The speed of responses overall would be faster.

aantix 124 days ago [-]

When aggregating data from multiple systems, how do you handle the case of only searching against data chunks that the user is authorized to view? And if those permissions change?

emil_sorensen 124 days ago [-]

We focus mainly on external use cases (e.g., helping companies like Docker and Monday.com deploy customer facing "Ask AI" assistants) so we don't run into much of that given all data is public.

3abiton 124 days ago [-]

> fine-tuning a model on tool usage could also allow it to gain familiarity with specific retrieval mechanisms.

anonymousDan 125 days ago [-]

Is RAG any good for coding tasks?

mvieira38 124 days ago [-]

For what it's worth, Anthropic decided against using RAG in Claude Code (https://news.ycombinator.com/item?id=43164089)

Workaccount2 124 days ago [-]

There are no programs online which do this (lots of viewers, but not interpreters/converters), and I actually had gotten a quote for proprietary software that can do it, but is $1k/yr to use.

raggedasil 125 days ago [-]

afhammad 124 days ago [-]

Could you share more on your local setup please?

mkesper 125 days ago [-]

Latency must be brutal here. This will not be possible for any chat application, I guess.

bauefi 125 days ago [-]

laichzeit0 124 days ago [-]

It's not just the retrieval, tool calls entail another call to the LLM (ToolMessage) and possibly the result will then require other tool calls. Massive latency.

eternityforest 124 days ago [-]

These ultra fast embeddings are really cool, because you can just spam them at everything and it's pretty much instant.

Then I do chunking on the fly based on similarity to to query.

https://huggingface.co/blog/static-embeddings

emil_sorensen 125 days ago [-]

Yep even with a small bump in performance (which we only saw for a subset of coding questions), it wouldn't be worth the huge latency penalty. Though that will surely go down over time.

emil_sorensen 125 days ago [-]

Curious if anyone else has run similar experiments?

zurfer 125 days ago [-]

emil_sorensen 124 days ago [-]

I suspect you're right here! Excited to get our hands on the non-distilled o3. :)