Using LLaMA with M1 Mac

2023/03/12

#ml #llm #python #cpu

Table of contents

LLaMA

The large language models wars

With the increasing interest in artificial intelligence and its use in everyday life, numerous exemplary models such as Meta’s LLaMA, OpenAI’s GPT-3, and Microsoft’s Kosmos-1 are joining the group of large language models (LLMs). The only problem with such models is the you can’t run these locally. Up until now. Thanks to Georgi Gerganov and his llama.cpp project it is possible to run Meta’s LLaMA on a single computer without a dedicated GPU.

Running LLaMA

There are multiple steps involved in running LLaMA locally on a M1 Mac. I am not sure about other platforms or other OSes so in this article we are focusing only the aforementioned combination.

Step 1: Downloading the model

The official way is to request the model via this web form and download it afterward.

There is a PR open in the repository, that describes an alternative way (that is probably a violation of the terms of service).

https://github.com/facebookresearch/llama/pull/73

Anyways, after you downloaded the model (or more like models because there are a few different kinds of models in the folder) you should have something like this:

❯ exa --tree
.
├── 7B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  └── params.json
├── 13B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  └── params.json
├── 30B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  └── params.json
├── 65B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  ├── consolidated.04.pth
│  ├── consolidated.05.pth
│  ├── consolidated.06.pth
│  ├── consolidated.07.pth
│  └── params.json
├── tokenizer.model
└── tokenizer_checklist.chk

As you can see the different models are in a different folders. Each model has a params.json that contains details about the model.

For example:

{
  "dim": 4096,
  "multiple_of": 256,
  "n_heads": 32,
  "n_layers": 32,
  "norm_eps": 1e-06,
  "vocab_size": -1
}

Step 2: Installing dependencies

Xcode must be installed to compile the C++ project. If you don’t have it, please do the following:

xcode-select --install

These are the dependencies for building the C++ project (pkgconfigand cmake).

brew install pkgconfig cmake

Finally, we can install Torch.

I assume you have Python 3.11 installed so you can create a virtual env like this:

/opt/homebrew/bin/python3.11 -m venv venv

Activating the venv. I am using fish. For other shells just drop the .fish suffix.

. venv/bin/activate.fish

After activated the venv we can install Pytorch:

pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu

If you are interesting leveraging the new Metal Performance Shaders (MPS) backend for GPU training acceleration you can verify it by running the following. This is not required for running LLaMA on you M1 though:

❯ python
Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.backends.mps.is_available()
True

Now lets compile llama.cpp.

Step 3: Compile LLaMA CPP

Cloning the repo:

git clone git@github.com:ggerganov/llama.cpp.git

After installing all the dependencies you can run make:

❯ make
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main  -framework Accelerate
./main -h
usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 4)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -n N, --n_predict N   number of tokens to predict (default: 128)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --temp N              temperature (default: 0.8)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/llama-7B/ggml-model.bin)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize  -framework Accelerate

Step 4: Converting the model

Assuming you placed the models under models/ in the llama.cpp repo.

python convert-pth-to-ggml.py models/7B 1

You should see an output like this:

{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}
n_parts =  1
Processing part  0
Processing variable: tok_embeddings.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: norm.weight with shape:  torch.Size([4096])  and type:  torch.float16
  Converting to float32
Processing variable: output.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: layers.0.attention.wq.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wk.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wv.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wo.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.feed_forward.w1.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w2.weight with shape:  torch.Size([4096, 11008])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w3.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.attention_norm.weight with shape:  torch.Size([4096])  and type:  torch.float
16
...
Done. Output file: models/7B/ggml-model-f16.bin, (part  0 )

The next step would be to perform the quantizing:

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

Output:

llama_model_quantize: loading model from './models/7B/ggml-model-f16.bin'
llama_model_quantize: n_vocab = 32000
llama_model_quantize: n_ctx   = 512
llama_model_quantize: n_embd  = 4096
llama_model_quantize: n_mult  = 256
llama_model_quantize: n_head  = 32
llama_model_quantize: n_layer = 32
llama_model_quantize: f16     = 1
...
layers.31.attention_norm.weight - [ 4096,     1], type =    f32 size =    0.016 MB
layers.31.ffn_norm.weight - [ 4096,     1], type =    f32 size =    0.016 MB
llama_model_quantize: model size  = 25705.02 MB
llama_model_quantize: quant size  =  4017.27 MB
llama_model_quantize: hist: 0.000 0.022 0.019 0.033 0.053 0.078 0.104 0.125 0.134 0.125 0.104 0.078 0.053 0.033 0.019 0.022

main: quantize time = 29389.45 ms
main:    total time = 29389.45 ms

Step5: Running the model

❯ ./main -m ./models/7B/ggml-model-q4_0.bin \
        -t 8 \
        -n 128 \
        -p 'The first president of the USA was '
main: seed = 1678615879
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'The first president of the USA was '
main: number of tokens in prompt = 9
     1 -> ''
  1576 -> 'The'
   937 -> ' first'
  6673 -> ' president'
   310 -> ' of'
   278 -> ' the'
  8278 -> ' USA'
   471 -> ' was'
 29871 -> ' '

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he
 would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity.

main: mem per token = 14434244 bytes
main:     load time =  1311.74 ms
main:   sample time =   278.96 ms
main:  predict time =  7375.89 ms / 54.23 ms per token
main:    total time =  9216.61 ms

Enjoy!