native vector store support Botpress #👀feature-requests

Join Discord

native vector store support

# 👀feature-requests

wide-postman-24865

06/07/2023, 3:06 PM

ability to upload files through botpress and you vectorize them

acceptable-kangaroo-64719

06/07/2023, 3:09 PM

Hey @wide-postman-24865, could you elaborate a little bit more about what you mean by this? Right now, documents uploaded as a Knowledge Base source get vectorized. We use OpenAI Embeddings and store the vectors in a postgres database using pgvector. There's no way to download these vectors, though- is that what you're referring to?

wide-postman-24865

06/07/2023, 3:11 PM

No, but I have a very large corpus of information and can't upload it into your system because of obvious limitations you put in place. So then I'm thinking I'll put it into Pinecone myself and now I'm going to have to use an API, I guess to duplicate what I've done in python to talk to Pinecone. Would be a lot more convenient if I could give you large files to push to Pinecone or wherever you store them (tensorflow?) with a fee of course. Seems like you could/should charge for storage space which is pretty typical in the space from what I've seen, since some users use a ton, some use hardly any.

wide-postman-24865

06/07/2023, 3:14 PM

I'm hoping to use Botpress for many corporations financial documents. Not real clear how to get at the information I want using botpress currently. API to SEC database returns whole documents, not answers. Botpress to Pinecone - can do that I guess (hoping). Botpress to structured google search which then goes through the LLM? This I've done in Python, works pretty well much of the time if you structure your queries right. I'd rather not use python and have to maintain our own codebase.

wide-postman-24865

06/07/2023, 3:18 PM

i was hoping also to use your native 'search a url' functionality but can't put folders into it, so i cant say 'search yahoonews.com/company/AZ

wide-postman-24865

06/07/2023, 3:18 PM

the final thing i thought of was to put it on a subdomain on a website i control like az.mysite.com and then hope google indexes it all, but that could take days.

rich-battery-69172

06/07/2023, 3:21 PM

@wide-postman-24865 you can right now upload whole documents (and lots of them).. which we vectorize and store for you. You can also use the web search although less powerful. In a few days we'll have dynamic KBs that you can upload using API etc

wide-postman-24865

06/07/2023, 3:21 PM

it wasn't happy with my 100+ mb file and that was only 1 year of data

wide-postman-24865

06/07/2023, 3:22 PM

if i knew the limit it would be helpful

rich-battery-69172

06/07/2023, 3:22 PM

yeah there's a limit on individual file sizes but you can chunk them up. i guess we should improve the experience and expose clear limits etc

wide-postman-24865

06/07/2023, 3:22 PM

yes its a complete mystery

rich-battery-69172

06/07/2023, 3:23 PM

@freezing-printer-49373 maybe also consider charging for going beyond limits

wide-postman-24865

06/07/2023, 3:23 PM

Yes

wide-postman-24865

06/07/2023, 3:23 PM

I would expect it

wide-postman-24865

06/07/2023, 3:23 PM

Its not free to store stuff on pinecone. its not expensive but its not free

wide-postman-24865

06/07/2023, 3:24 PM

But what is the limit per file? And if I sat here and uploaded dozens of them, when am I going to hit a limit on the number of them? I have a real deliverable here so I can't just sit here uploading files if its not going to work. 🙂

rich-battery-69172

06/07/2023, 3:26 PM

how many of them do you expect to have? i'll need to check the limit but sometimes filesize !== text size due to images etc in PDFs

rich-battery-69172

06/07/2023, 3:27 PM

if you have an estimate order of magnitude of PDF pages total, like 1k or more like 1m?

rich-battery-69172

06/07/2023, 3:28 PM

@wide-postman-24865 technically, current limit is 100 docs x 50mb each

wide-postman-24865

06/07/2023, 3:29 PM

cool thanks, that's a lot, enough for my client demo anyway

wide-postman-24865

06/07/2023, 3:29 PM

they aren't pdfs they are text, no images

great-london-12061

06/07/2023, 3:33 PM

@wide-postman-24865 do you mind sharing the performance of botpress with that many PDF? Especially since your document is financially related. Just curious how good the performance is when dealing with financial related documents. Since I have tried to upload a few years of financial reports as well, the bot could not answer as expected yet the information has been given.

wide-postman-24865

06/07/2023, 3:33 PM

they aren't pdfs, they are text.

wide-postman-24865

06/07/2023, 3:34 PM

I had better results when i used google to provide the result and then LLM to process it, actually.

wide-postman-24865

06/07/2023, 3:34 PM

the fact that its structured financial data is definitely problematic

wide-postman-24865

06/07/2023, 3:34 PM

at least for me with my current team. its better for us to use pre-digested data like you find on the internet

rich-battery-69172

06/07/2023, 3:34 PM

mind sharing an example of the structure @wide-postman-24865 ?

rich-battery-69172

06/07/2023, 3:35 PM

we're working on Structured KBs (ie. think a mix of SQL tables and documents you can query in natural language)

wide-postman-24865

06/07/2023, 3:35 PM

that would be super helpful

wide-postman-24865

06/07/2023, 3:36 PM

its SEC data, you can find it at SEC.gov

great-london-12061

06/07/2023, 3:36 PM

What do you mean by using google to provide the result?

wide-postman-24865

06/07/2023, 3:36 PM

The source is their wacky format, in which everything is labelled

rich-battery-69172

06/07/2023, 3:37 PM

yeah lol.. i remember working on an edgar parser 10years ago

rich-battery-69172

06/07/2023, 3:37 PM

painful memories

wide-postman-24865

06/07/2023, 3:38 PM

I'm taking natural language questions of course, and then I was using langchain agents with websearch as one of the agents. It worked pretty well because Google has already indexed everyones commentary and analysis and so it matched users questions relatively well. And adding Wolfram would be even better. Keep thinking I need to go back to that method - python/langchain. I just don't want to, was hoping botpress would save me from the development costs and time associated with improving and maintaining that.

wide-postman-24865

06/07/2023, 3:40 PM

I don't like that thumbsup, its ominous! 🙂

rich-battery-69172

06/07/2023, 3:40 PM

can you shoot me an example question that works well with your custom langchain project that would be cool to have working with botpress?

great-london-12061

06/07/2023, 3:42 PM

Easy bro. Thumbsup for giving a clear explanation. Thanks

wide-postman-24865

06/07/2023, 3:49 PM

'what are some risk factors associated with the cybertruck', 'what was the revenue from 2020-2023', 'whats the projected revenue in 2023', 'what significant changes to the board happened in the past 5 years'. I understand there might be limitations to what I can expect but these are all pretty easy to find out. Maybe the 'in past 5 years' is asking a bit much, but certainly in the SEC data they report significant changes to the makeup of the board. I am not at the point with this project where I need to produce charts and graphs. Just give answers that the average investor might have of a public company.

rich-battery-69172

06/07/2023, 4:01 PM

thanks that helps!

wide-postman-24865

06/07/2023, 4:06 PM

Hmm well, every document i try to upload gives me a network error on upload. Doesn't seem real to me.

wide-postman-24865

06/07/2023, 4:07 PM

I got a rejection > 50,000,000

wide-postman-24865

06/07/2023, 4:07 PM

but at 17 mb also something is up, I don't think its failing but I don't have a way to be sure.

wide-postman-24865

06/07/2023, 4:07 PM

my network seems fine, and it continually fails on botpress only

wide-postman-24865

06/07/2023, 4:10 PM

this file was only 5 mb. and it took a looong time to respond

rich-battery-69172

06/07/2023, 4:20 PM

yeah .txt upload is a known issue, the workaround is to make it a PDF in the meantime

billions-kitchen-47235

06/08/2023, 12:16 AM

I think another nice feature to have for me would be to be able to see which pages in the knowledge base the vector store indexes (for instance the pages in a website) and also to provide more tools to directly manipulate the result of the search rather than to directly feed the output into a contextual QnA. This way, the user could transform the knowledge base query into page suggestions from within the chat bot, utilizing the LLM to transform the knowledge base query result in a specialized way

billions-kitchen-47235

06/08/2023, 12:17 AM

In other words, I think that exposing more of the internals of the knowledge base including providing metadata and customizability is to me a key feature for it to become much more useful, than simply providing a black box

wide-postman-24865

06/08/2023, 2:02 AM

agree 100%

wide-postman-24865

06/08/2023, 2:02 AM

i suspect that a lot of this can be done using API. i recall pinecone returns a meta tag if you set one to begin with.

billions-kitchen-47235

06/08/2023, 8:17 AM

btw, if any of these features are managed within the open source UI, I am open to try to contribute some thing, especially if it can accelerate my use case. As far as open ended QnA goes, so far I've seen more of what I want from voice flow, but I'm confident the botpress team can get botpress to the same level and I'm happy to help in any way

billions-kitchen-47235

06/08/2023, 9:15 AM

Personally, i find the "agents" abstraction too magical and opaque. Most of the agents (including KB agent) do not work as well as desired.

billions-kitchen-47235

06/08/2023, 9:20 AM

Furthermore, the knowledge agent needs to have some prompt-engineering-ability. If the user question is outside of scope, the knowledge agent should identify it, and prevent the query, before even performing a knowledge base search

billions-kitchen-47235

06/08/2023, 9:29 AM

One should be able to define a clear scope using a prompt. The scope should not be defined by the scope of the knowledge base, since this could include a web search

wide-postman-24865

06/08/2023, 3:11 PM

I totally agree with this.

rich-battery-69172

06/08/2023, 4:52 PM

i agree @billions-kitchen-47235 , thanks for the feedback. Agents (including all prompts they use) will all be open-sourced in a few days and the community will be able to extend and improve them. Also they'll have much more knobs available.. right now they are pretty basic

85 Views

Previous Next