I have a txt file 99 880 lines of training data between char Botpress #🌎general

I have a .txt file (99,880 lines) of training data...

strong-cartoon-78207

06/07/2023, 8:10 AM

I have a .txt file (99,880 lines) of training data between {{char}} and {{user}} to train the bot.. the server said error when I tried to upload the 16.9MB .txt.. any solution?

acceptable-kangaroo-64719

06/07/2023, 12:47 PM

gonna add @gifted-hydrogen-5337 in here, as they have a similar question

acceptable-kangaroo-64719

06/07/2023, 12:48 PM

We haven't come across any limits stemming from pure filesize. Some folks have uploaded 700 page PDFs without issue from the database

acceptable-kangaroo-64719

06/07/2023, 12:48 PM

The issues that do come up from these huge sources are more about the LLM finding the right answers in a decent amount of time.

gifted-hydrogen-5337

06/07/2023, 12:49 PM

the LLM u use for this is GPT-3.5, right?

acceptable-kangaroo-64719

06/07/2023, 12:49 PM

correct

acceptable-kangaroo-64719

06/07/2023, 12:49 PM

Best practices for KB is to chunk your info into relevant docs, and then upload those docs as separate sources

gifted-hydrogen-5337

06/07/2023, 12:49 PM

how do you "give" the documents to the LLM (gpt) as context?

acceptable-kangaroo-64719

06/07/2023, 12:50 PM

Make a KB with a good title and description, give the sources good titles and descriptions, too

gifted-hydrogen-5337

06/07/2023, 12:50 PM

but there is a limit to how big the "context" can get, right?

gifted-hydrogen-5337

06/07/2023, 12:51 PM

Also, I've had some problems when bulk-uploading my docs. I could upload 20 of 200, got many "Too many requests" error

acceptable-kangaroo-64719

06/07/2023, 12:53 PM

ooof, I don't have any advice on the bulk uploading bit, sorry 😣 it's gonna take you a while

acceptable-kangaroo-64719

06/07/2023, 12:54 PM

what do you mean by the "context"?

gifted-hydrogen-5337

06/07/2023, 12:59 PM

if i'm correct there is a thing called context window. ChatGPT-4: "In the context of language models like GPT-3, the "context window" refers to the amount of recent input text (in terms of tokens) that the model considers when generating a response. For GPT-3, this context window is 2048 tokens. This means that if the conversation exceeds 2048 tokens, the model will not be able to see or consider the text beyond that limit when generating a response." my question is: Can all the uploaded document's data fill into ChatGPT-3.5's "context window"? Will the model be able to answer everything I upload or some parts could be "forgotten"?

acceptable-kangaroo-64719

06/07/2023, 1:01 PM

We don't send the whole, unadulterated KB sources to GPT. When the user asks a question, there are a few steps that happen first:

acceptable-kangaroo-64719

06/07/2023, 1:02 PM

1. When you upload a document (or text) as a KB source, it gets split into chunks of about 200-500 words. These chunks are then put into a fancy database.

acceptable-kangaroo-64719

06/07/2023, 1:03 PM

2. When the user asks a question that gets sent to the knowledge base, the user's question is compared to all these chunks and the top 5 - 10 most relevant chunks are selected

acceptable-kangaroo-64719

06/07/2023, 1:03 PM

3. These top 5 - 10 chunks are then sent, along with the user's question and some other prompt matter to GPT 3.5 Turbo to answer

acceptable-kangaroo-64719

06/07/2023, 1:04 PM

4. If you have AI personality turned on, that answer is modified again by GPT to match the personality description or language

gifted-hydrogen-5337

06/07/2023, 1:05 PM

nice, thank you very much for the clarification.

gifted-hydrogen-5337

06/07/2023, 1:06 PM

one more question; how do you decide what chunks are the most "relevant"?

acceptable-kangaroo-64719

06/07/2023, 1:07 PM

A bunch of maths 😅

acceptable-kangaroo-64719

06/07/2023, 1:09 PM

We use OpenAI's embeddings model to turn text into a Tensor, or a list of ~1500 numbers, where each number represents a bit of the phrase's meaning. We also turn the user's question into a Tensor like this, too. Finally, there are formulas like Cosine distance or Euclidean distance that are used to find which tensors are closest to each other, and that's how we find the most relevant bits

acceptable-kangaroo-64719

06/07/2023, 1:09 PM

It's a lot more complicated than that, but it's quite the rabbit hole to go down. Let me know if you'd like some papers or other further reading

gifted-hydrogen-5337

06/07/2023, 1:10 PM

Thank you very much!

strong-cartoon-78207

06/08/2023, 2:11 AM

I appreciate your help, Gordy.. will work on cutting the txt file into smaller files or convert to .pdf using a .py script, then dividing into what will upload without error. Thanks ✌️ 😎

57 Views

Previous Next