Notes »

Running LLMs locally

Why locally?

I was interested in ChatGPT as I saw examples of interactions pour out onto social media, but I refused to even check out their site. Maybe you see me like a Native afraid of my soul being stolen.

privateGPT - Ask questions to your documents without an internet connection

Where by feeding my prompts and reactions into a LLM (Large Language Model) I am in a very real sense putting my spiritual imprint, my soul, into another being. I originally didn't realize that's what was going on with Google (basically founded by CIA money) and slowly came to the realization over the years. Not going to get into that very much, but I did want to say that I was interested in the technology, even GPT3 was interesting, but I never played with that either as I really didn't have the time and energy when that was the cutting edge.

I did take some of my time to delve into what is coming out these days finally, especially afer Llama was leaked which was the first step towards allowing these large models to be run on individual's hardware. I am still pretty ignorant when it comes to the underlying technology and will be learning with baby steps as I experiment.

To finally answer the question "Why locally?" directly and succinctly: Privacy. I am paranoid about what could be considered "thought crime", the Overton window has shifted rapidly, and honestly that's all I feel comfortable saying online now. Why not start modeling personalities off the inputs you're giving, cross-referencing with your GMail account, search and purchase histories? It's already happening and I just don't want to integrate with that system. I'm just a fuddy duddy I guess.

What I've got running so far, and how I did it

First I built llama.cpp from the tar.gz source code release One of the easiest builds I've ever done, no errors, just produced binaries!

The first model I tried was a 13 billion parameter Vicuna model, described as:

Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. It is an auto-regressive language model, based on the transformer architecture.

I downloaded ggml-old-vic13b-uncensored-q5_1.bin and ran it with llama.cpp with this command:

./main --interactive-first -r "### Human:" --temp 0 -c 2048 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vic13b-uncensored-q5_1.bin

Today I tried an uncensored WizardLM model -- again I'm really just exploring like a kid in a candy shop right now and am blown away by the ease at which I'm able to spin up these models using just my CPU and 16GB of RAM.

This is the hugging face model card for the 13B Wizard Vicuna GGML model, if you want to try other models with llama.cpp make sure they have GGML in the title! I wasted some bandwidth and time downloading a WizardLM model that wouldn't load earlier. I loaded the Wizard-Vicuna-13B-Uncensored.ggmlv3.q8_0.bin file with the same parameters as before and was greeted by a much more "personable" bot. Lowercase 'i', hahas and ascii smilies :)


6/2/23: Just installed more RAM to load a larger model.

6/4/23: Trying to load this based model using GPU assistance this time:

./main -t 8 -ngl 16 -m ./models/based-30b.ggmlv3.q5_1.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 --interactive-first --ignore-eos --instruct

I guess this loads 16 layers into the GPU (-ngl 16), I might try and load more, I don't know what I can get away with yet.

This "based" model is extremely conversational, more than any other model that I've tried so far. It is inducing me to type much more than other models because it asks questions, like what are you favorite movies, books, etc and carries a pretty natural conversation.

edit SideBar

Page last modified on July 08, 2023, at 04:29 pm