I have a .txt file (99,880 lines) of training data...
# 🌎general
s
I have a .txt file (99,880 lines) of training data between {{char}} and {{user}} to train the bot.. the server said error when I tried to upload the 16.9MB .txt.. any solution?
a
gonna add @gifted-hydrogen-5337 in here, as they have a similar question
We haven't come across any limits stemming from pure filesize. Some folks have uploaded 700 page PDFs without issue from the database
The issues that do come up from these huge sources are more about the LLM finding the right answers in a decent amount of time.
g
the LLM u use for this is GPT-3.5, right?
a
correct
Best practices for KB is to chunk your info into relevant docs, and then upload those docs as separate sources
g
how do you "give" the documents to the LLM (gpt) as context?
a
Make a KB with a good title and description, give the sources good titles and descriptions, too
g
but there is a limit to how big the "context" can get, right?
Also, I've had some problems when bulk-uploading my docs. I could upload 20 of 200, got many "Too many requests" error
a
ooof, I don't have any advice on the bulk uploading bit, sorry 😣 it's gonna take you a while
what do you mean by the "context"?
g
if i'm correct there is a thing called context window. ChatGPT-4: "In the context of language models like GPT-3, the "context window" refers to the amount of recent input text (in terms of tokens) that the model considers when generating a response. For GPT-3, this context window is 2048 tokens. This means that if the conversation exceeds 2048 tokens, the model will not be able to see or consider the text beyond that limit when generating a response." my question is: Can all the uploaded document's data fill into ChatGPT-3.5's "context window"? Will the model be able to answer everything I upload or some parts could be "forgotten"?
a
We don't send the whole, unadulterated KB sources to GPT. When the user asks a question, there are a few steps that happen first:
1. When you upload a document (or text) as a KB source, it gets split into chunks of about 200-500 words. These chunks are then put into a fancy database.
2. When the user asks a question that gets sent to the knowledge base, the user's question is compared to all these chunks and the top 5 - 10 most relevant chunks are selected
3. These top 5 - 10 chunks are then sent, along with the user's question and some other prompt matter to GPT 3.5 Turbo to answer
4. If you have AI personality turned on, that answer is modified again by GPT to match the personality description or language
g
nice, thank you very much for the clarification.
one more question; how do you decide what chunks are the most "relevant"?
a
A bunch of maths 😅
We use OpenAI's embeddings model to turn text into a Tensor, or a list of ~1500 numbers, where each number represents a bit of the phrase's meaning. We also turn the user's question into a Tensor like this, too. Finally, there are formulas like Cosine distance or Euclidean distance that are used to find which tensors are closest to each other, and that's how we find the most relevant bits
It's a lot more complicated than that, but it's quite the rabbit hole to go down. Let me know if you'd like some papers or other further reading
g
Thank you very much!
s
I appreciate your help, Gordy.. will work on cutting the txt file into smaller files or convert to .pdf using a .py script, then dividing into what will upload without error. Thanks ✌️ 😎
55 Views