Knowledge Base Mode that does not vectorize the text Botpress #👀feature-requests

Knowledge Base Mode that does not vectorize the te...

cool-policeman-94412

09/28/2023, 8:44 AM

Hear me out on that - it is a lot more tricks than you think. The KB chunking has a serious issue with cross chunk knowledge - known issue of that naive way to do it. One way around that that has been found is easy to implement in programming, not really doable in wordpress: Q&A format. Take the original text, generate Q&A pairs out of that. Any decent AI does a really good job. Ouf of that you get a lot of small Q&A pairs. Here is the trick: now you vectorize ONLY the question, NOT the answer - and then return the answer as embedding for the question. Because the question is on a topic and small, it means that the vector is not diluted by the answering text. For that a specific text format is needed (maybe lines starting with Q: and then correlating the next A: - that would allow multiple Q lines and multiple Q&A pairs in a document. The answer is significantly more quality (if there is a hit) than what exists now and because there are no chunk boundaries it those cannot ruin the answer. Q&A answers should be ranked higher than pure chunking - possibly add a scoring bonus. The next step would be to possibly calcualte all the Q vectors for an A and remove duplicate vectors that are way too close (0.999) - then the AI can generate a Q, create multiple variatns of that, and the vectorization can decide which of those are too similar. You also get significantly more vectors - a one page document may tun into 20 or 30 questions. Automatically. But the vectors hava much higher quality and the answer, again, do not cross boudaries. Been tested with a number of use cases and it is a much better mechanism in a lot of areas. Note that another step may then for etter quality mean NOT using only 5 chunks - get 15, have them again aggregated before sending them to the main AI.

Previous Next