How to Build a Powerful LLM Knowledge Base

is a concept where you store a lot of information, and you make it accessible for future use. This is incredibly powerful for:

Better decision-making
Quickly picking up on past context
Aligning your team

Lately, I’ve started working a lot with setting up a knowledge base and routing as much context as possible into it to help me improve all of the points above. Knowledge bases were always useful even before LLMs, because it’s always useful to access past knowledge. However, the knowledge bases have grown exponentially more powerful because of LLMs.

This is because of two main reasons:

You can capture more information in the knowledge bases
You can more easily query the knowledge base (you don’t have to look through it manually)

In this article, I’ll cover why you should set up your own LLM-powered knowledge base, how to capture as much information as possible, and how to actively use the knowledge base.

I’ve been discussing this topic a bit before, but I have grown more and more fond of the topic of knowledge bases because of how popular it’s become. You, for example, have the president of Y Combinator building GBrain, or Andrej Karpathy building an LLM wiki, which are both examples of knowledge bases.

There is, of course, no ground truth for the optimal way to build a knowledge base. I think the most important thing is to actually start storing all of your context into a knowledge base and figuring out how to query the knowledge base effectively all the time, for example, when writing code, in meetings, or similar.

Why you should have a knowledge base

First of all, I’d like to cover why you should have a knowledge base. You can have different knowledge bases. For example, you can have a personal one consisting of all the context that you have personally, or you can have a company-wide knowledge base consisting of knowledge or context that the company possesses.

The reason you should have a knowledge base is that information is extremely valuable. The more information you can store and then later access when needed, the better you will perform. You will, for example, be able to:

Make better decisions because you have access to more context
More quickly pick up on previous topics without having to look through a variety of different sources to find the information you had on the topic
Align different people together because they have a single source of truth.

The same concepts apply basically to both if you have a personal knowledge base and if you have a company-wide knowledge base. I also believe that these knowledge bases have become far more powerful because you can query them with LLMs. Previously, you would have had to manually look through the knowledge base to find relevant information. You would have to use your own memory to recall if a certain piece of information was stored in the knowledge base and then decide whether to spend time finding that information or not.

Now that is completely turned around. The LLM can itself query the knowledge base, for example, with a RAG-type approach, and automatically find relevant information immediately. The LLM can itself decide when it needs to use the knowledge base.

I.e., you completely remove the layer, the human-in-the-loop requirement, to access information on a knowledge base, which makes it so much more powerful.

Capturing information into the knowledge base

The first step of the knowledge base is, of course, to capture information into the knowledge base. Depending on how your knowledge base is built up, this can happen in a variety of different ways.

However, the first thing I urge you to do is to think of all the different sources of information that you have access to, either personally or at the company. These are, for example:

Meetings
Your project management tool, such as Linear.
Your coding agent, such as Claude Code or Codex. What have you been working on lately with these models (and which tasks are completed)
Physical office discussions.

You can probably think of a lot of different other sources of information. Of course, this depends a bit on how you work and where you work. The point is that you should map out all these different information sources, and you should figure out an automatic way to route information from these sources into your knowledge base.

You and other people will not be willing to spend more time manually putting things into knowledge bases. You need to figure out a way to automatically do this to have your knowledge base up to date.

It’s important that you fully automate the routing of information from the source to the knowledge base. If you require a manual step (for example, pasting meeting notes into the knowledge base), you’ll definitely forget about it and lose important context, which goes against the entire concept of the knowledge base. The whole point of the knowledge base is that you store absolutely all information there and don’t leave anything out. That’s what makes a knowledge base so powerful.

For example, with meeting notes, you can have a cron job that syncs daily. It takes each meeting note that everyone in the company has had or that you have had personally, and stores it in a knowledge base. You can set up a similar cron job for your Linear or project management tool to sync everything that happened there. Sync your coding agent with what you’ve been working on, and anything you’ve discussed with your coding agent, and so on. All this can easily be synced into the knowledge base with a daily cron job.

Physical office discussions are a point that’s harder to fully automate. I haven’t fully been able to figure this one out yet myself, but two options would be:

to record everything going on all the time, which would of course require consent
or just manually writing down things after having a discussion in the office

However, I think that you might not even need to explicitly store the office discussions, because most times after I have a discussion physically in the office, the person I had the discussion with or I will take context from that discussion and write it into their coding agent. That discussion was usually had because of a question with an implementation, so if that knowledge is actively used in your coding agent afterwards, you can fetch it from the coding agent logs.

So if you completed this step successfully and stored all the context you encounter every day into your knowledge base, you’ve done most of the work. This is the hard part about the knowledge base. In the next section, I’ll cover the easier part, which is actively using that information from the knowledge base when making decisions or interacting with your coding agents.

Utilizing information from the knowledge base

If you have a synced knowledge base with all the information you require, you can now move on to actively utilizing this information. I think there are two main approaches to using the information from a knowledge base:

You can just query the knowledge base if you have a question. This should, of course, be done through your coding agent. You ask it a question, and it should know that it should query the knowledge base to find the answer.
The second is to have the coding agent passively utilize the knowledge base whenever it does work.

I think the first application here is pretty self-explanatory. Just ask it the question whenever you’re unsure of something. That’s why I’ll spend more time discussing the second point here.

Having the coding agent passively utilize the knowledge base whenever it does work, for example, to do a code implementation, fix a bug, etc. It’s very powerful. Again, I think there are two main approaches to doing this.

Grep-based inference

One is to have a top-level markdown file in the knowledge base that explains the entire knowledge base and where the different information is. This file is, of course, updated whenever you add more information to the knowledge base.

The upside of this approach is that you’re using grep, which is usually more powerful than embedding-based search because it’s better able to find the correct information when needed. However, this also requires you to put that markdown file into the context of the LLM that you’re using all the time. This markdown file can grow quite big, which can become a problem after a while.

Embedding-based inference

The second way to use the knowledge base actively is to have embedding-based inference. This is what GBrain is made for. Basically, whenever you run a query, you run an embedding search, like a RAG against the knowledge base, and it fetches some relevant chunks from the knowledge base. If the LLM thinks that it’s fetched some relevant information using the embedding search, it can look further into the relevant files.

I think this is probably the better approach to using the knowledge base during inference because it doesn’t require an active search, and it doesn’t require spending a lot of input tokens on the knowledge base for everything that you do.

However, which approach works best will definitely depend on your use cases.

Conclusion

All in all, I urge you to:

Try to set up a knowledge base
Write as much information into it as possible
Read on how others have set up these knowledge bases
Try to set it up yourself

Then you should actively use this knowledge base whenever you do work on your computer using a coding agent (which should basically be for all work that you do). I believe knowledge bases will become incredibly powerful and valuable in the years to come, and it can also give you a moat because having access to a lot of information will be a definite advantage in the future. Furthermore, this is specific data to your company or your personal context that, in many cases, only you have access to. Thus, if you don’t store it, you’ll never be able to access that information again in the future.

👋 Get in Touch

👉 My free eBook and Webinar:

🚀 10x Your Engineering with LLMs (Free 3-Day Email Course)

📚 Get my free Vision Language Models ebook

💻 My webinar on Vision Language Models

👉 Find me on socials:

💌 Substack

🔗 LinkedIn

🐦 X / Twitter

Source link