native vector store support
# 👀feature-requests
w
ability to upload files through botpress and you vectorize them
a
Hey @wide-postman-24865, could you elaborate a little bit more about what you mean by this? Right now, documents uploaded as a Knowledge Base source get vectorized. We use OpenAI Embeddings and store the vectors in a postgres database using pgvector. There's no way to download these vectors, though- is that what you're referring to?
w
No, but I have a very large corpus of information and can't upload it into your system because of obvious limitations you put in place. So then I'm thinking I'll put it into Pinecone myself and now I'm going to have to use an API, I guess to duplicate what I've done in python to talk to Pinecone. Would be a lot more convenient if I could give you large files to push to Pinecone or wherever you store them (tensorflow?) with a fee of course. Seems like you could/should charge for storage space which is pretty typical in the space from what I've seen, since some users use a ton, some use hardly any.
I'm hoping to use Botpress for many corporations financial documents. Not real clear how to get at the information I want using botpress currently. API to SEC database returns whole documents, not answers. Botpress to Pinecone - can do that I guess (hoping). Botpress to structured google search which then goes through the LLM? This I've done in Python, works pretty well much of the time if you structure your queries right. I'd rather not use python and have to maintain our own codebase.
i was hoping also to use your native 'search a url' functionality but can't put folders into it, so i cant say 'search yahoonews.com/company/AZ
the final thing i thought of was to put it on a subdomain on a website i control like az.mysite.com and then hope google indexes it all, but that could take days.
r
@wide-postman-24865 you can right now upload whole documents (and lots of them).. which we vectorize and store for you. You can also use the web search although less powerful. In a few days we'll have dynamic KBs that you can upload using API etc
w
it wasn't happy with my 100+ mb file and that was only 1 year of data
if i knew the limit it would be helpful
r
yeah there's a limit on individual file sizes but you can chunk them up. i guess we should improve the experience and expose clear limits etc
w
yes its a complete mystery
r
@freezing-printer-49373 maybe also consider charging for going beyond limits
w
Yes
I would expect it
Its not free to store stuff on pinecone. its not expensive but its not free
But what is the limit per file? And if I sat here and uploaded dozens of them, when am I going to hit a limit on the number of them? I have a real deliverable here so I can't just sit here uploading files if its not going to work. 🙂
r
how many of them do you expect to have? i'll need to check the limit but sometimes filesize !== text size due to images etc in PDFs
if you have an estimate order of magnitude of PDF pages total, like 1k or more like 1m?
@wide-postman-24865 technically, current limit is 100 docs x 50mb each
w
cool thanks, that's a lot, enough for my client demo anyway
they aren't pdfs they are text, no images
g
@wide-postman-24865 do you mind sharing the performance of botpress with that many PDF? Especially since your document is financially related. Just curious how good the performance is when dealing with financial related documents. Since I have tried to upload a few years of financial reports as well, the bot could not answer as expected yet the information has been given.
w
they aren't pdfs, they are text.
I had better results when i used google to provide the result and then LLM to process it, actually.
the fact that its structured financial data is definitely problematic
at least for me with my current team. its better for us to use pre-digested data like you find on the internet
r
mind sharing an example of the structure @wide-postman-24865 ?
we're working on Structured KBs (ie. think a mix of SQL tables and documents you can query in natural language)
w
that would be super helpful
its SEC data, you can find it at SEC.gov
g
What do you mean by using google to provide the result?
w
The source is their wacky format, in which everything is labelled
r
yeah lol.. i remember working on an edgar parser 10years ago
painful memories
w
I'm taking natural language questions of course, and then I was using langchain agents with websearch as one of the agents. It worked pretty well because Google has already indexed everyones commentary and analysis and so it matched users questions relatively well. And adding Wolfram would be even better. Keep thinking I need to go back to that method - python/langchain. I just don't want to, was hoping botpress would save me from the development costs and time associated with improving and maintaining that.
I don't like that thumbsup, its ominous! 🙂
r
can you shoot me an example question that works well with your custom langchain project that would be cool to have working with botpress?
g
Easy bro. Thumbsup for giving a clear explanation. Thanks
w
'what are some risk factors associated with the cybertruck', 'what was the revenue from 2020-2023', 'whats the projected revenue in 2023', 'what significant changes to the board happened in the past 5 years'. I understand there might be limitations to what I can expect but these are all pretty easy to find out. Maybe the 'in past 5 years' is asking a bit much, but certainly in the SEC data they report significant changes to the makeup of the board. I am not at the point with this project where I need to produce charts and graphs. Just give answers that the average investor might have of a public company.
r
thanks that helps!
w
Hmm well, every document i try to upload gives me a network error on upload. Doesn't seem real to me.
I got a rejection > 50,000,000
but at 17 mb also something is up, I don't think its failing but I don't have a way to be sure.
my network seems fine, and it continually fails on botpress only
this file was only 5 mb. and it took a looong time to respond
r
yeah .txt upload is a known issue, the workaround is to make it a PDF in the meantime
b
I think another nice feature to have for me would be to be able to see which pages in the knowledge base the vector store indexes (for instance the pages in a website) and also to provide more tools to directly manipulate the result of the search rather than to directly feed the output into a contextual QnA. This way, the user could transform the knowledge base query into page suggestions from within the chat bot, utilizing the LLM to transform the knowledge base query result in a specialized way
In other words, I think that exposing more of the internals of the knowledge base including providing metadata and customizability is to me a key feature for it to become much more useful, than simply providing a black box
w
agree 100%
i suspect that a lot of this can be done using API. i recall pinecone returns a meta tag if you set one to begin with.
b
btw, if any of these features are managed within the open source UI, I am open to try to contribute some thing, especially if it can accelerate my use case. As far as open ended QnA goes, so far I've seen more of what I want from voice flow, but I'm confident the botpress team can get botpress to the same level and I'm happy to help in any way
Personally, i find the "agents" abstraction too magical and opaque. Most of the agents (including KB agent) do not work as well as desired.
Furthermore, the knowledge agent needs to have some prompt-engineering-ability. If the user question is outside of scope, the knowledge agent should identify it, and prevent the query, before even performing a knowledge base search
One should be able to define a clear scope using a prompt. The scope should not be defined by the scope of the knowledge base, since this could include a web search
w
I totally agree with this.
r
i agree @billions-kitchen-47235 , thanks for the feedback. Agents (including all prompts they use) will all be open-sourced in a few days and the community will be able to extend and improve them. Also they'll have much more knobs available.. right now they are pretty basic
3 Views