KB Doc Upload Rate limited. Botpress #🤝help

KB Doc Upload Rate limited.

icy-address-63961

07/25/2023, 10:13 PM

**What I want to do**: Upload hundreds of .pdfs into the KB **What I expect to happen**: Batch uploads should work. **What's happening instead**: Getting rate limited for a minute after 14 uploads, uploads in queue being rejected (so still making the post calls), picks back up after timeout ends. Ok before I make a screen recording, is there an advantage to having files split up? I have each pdf classified and properly named, but if the KB is ignoring file names and simply aggregating each file into one single corpus then I'll just write myself a script to combine each one of my pdfs into one single corpus.

rich-battery-69172

07/25/2023, 10:17 PM

@icy-address-63961 hey i'm wondering what is the use-case for hundreds of files – do you think having the ability for KBs to answer from Tables would work for you, if you could store vectorized text inside them?

icy-address-63961

07/25/2023, 10:21 PM

I'm dealing with hundreds of pdfs because It's the product of my specific data aggregation method. Are you saying that each document is stored in a table with vectorized text in it?

rich-battery-69172

07/25/2023, 10:24 PM

Well, I'm trying to understand if document upload is the right data source for your use-case or if we should perhaps develop a new data source that is more suited to your use case. Since we have Tables in Botpress and we're soon releasing the ability to make Tables a Data Source for KBs, I thought I would ask if that would be easier for you to manage than uploading hundreds of PDFs.

rich-battery-69172

07/25/2023, 10:25 PM

Maybe it would be easier to have a "Google Drive" data source / "AWS S3" data source? where Botpress would do the work of periodically syncing the documents

rich-battery-69172

07/25/2023, 10:26 PM

-- But yeah to answer your initial question – you are indeed rate limited in the studio to 15 requests per minute on most operations

icy-address-63961

07/25/2023, 10:26 PM

Yes this would definitely be a better solution for my needs. Does this currently exist?

rich-battery-69172

07/25/2023, 10:27 PM

It doesn't but I think that would make sense product-wise

icy-address-63961

07/25/2023, 10:27 PM

But again, does KB look at file names at all? Does it aggregate my data into one corpus before vectorization?

rich-battery-69172

07/25/2023, 10:28 PM

What do you mean aggregate your data into one corpus? do you mean reducing the document recursively to make the embedding better? Or something else?

rich-battery-69172

07/25/2023, 10:29 PM

And yes it does look at the file names

icy-address-63961

07/25/2023, 10:32 PM

I suppose I'm unclear about how its working under the hood. You'll have to forgive me as i'm fairly new to embeddings/llms/vector DBs Let me just share what I'm doing exactly: I'm transcribing individual pieces of content from creators, summarizing that content to generate a filename, and then exporting the raw transcription + some other datapts from the contents metadata as a pdf -- hence why I have file amounts that are in the hundreds

rich-battery-69172

07/25/2023, 10:34 PM

OK got it, makes sense. And how will you query the KBs? Will you filter based on metadata the files in which you search, or always search all documents?

icy-address-63961

07/25/2023, 10:34 PM

always search all documents. This will be a sort of general Q&A for their audience

rich-battery-69172

07/25/2023, 10:35 PM

ok makes sense. are the file names important for the KBs in that case?

icy-address-63961

07/25/2023, 10:36 PM

I'm not sure -- I assumed that it would be best to name them incase the names were somehow relevant in making search faster/more accurate

rich-battery-69172

07/25/2023, 10:38 PM

ok so if knowing the exact file (like Joe_Rogan_2021_e2345.pdf) isn't critical for your bot, you will probably have no big difference in results if you have one giant file vs many small files

icy-address-63961

07/25/2023, 10:38 PM

Gotcha thank you!!

rich-battery-69172

07/25/2023, 10:39 PM

just make sure to user headings in your PDF, split your paragraphs, etc. Good formatting is still important and will impact responses!

rich-battery-69172

07/25/2023, 10:39 PM

In fact separating each transcript with a page break is best (so embeddings won't be mixed up)

icy-address-63961

07/25/2023, 10:41 PM

And to be clear on this, you mean that each transcription will have it's own page correct? Not just a blank page between each?

rich-battery-69172

07/25/2023, 10:42 PM

yeah each their own page

Previous Next