You are currently viewing Ollama Server Logs, What Does All That Mean…?

Ollama Server Logs, What Does All That Mean…?

Credits – screenshot from Ollama’s git: https://github.com/ollama/ollama

While building a workflow with n8n using a local Ollama model, I came across a lot of verbose logging from the Ollama server.

Logging is critical, it helps with debugging code, apps and pretty much anything. So, here I’m sharing some pieces of logging I came across with the hope that it helps you better understand how Ollama server behaves behind the curtains, or at least what it tells you it’s doing.

Overall Structure of the Ollama Server Logs

Ollama create writes a lot of log messages. Image created with Pollinations.ai

The whole log is at the end of the post for visualization purposes. As you can see, there is a bunch of stuff in there that can tell you a lot about the model, how it works, what it does and whether there is anything I should/may do to make it work better.

The log consists of several lines, each starting with a timestamp and a level (INFO, WARN, GIN). This indicates different events occurring within the Ollama server. Here I will go through some main logging messages I received, addressing what they mean. Hopefully, it will help you better understand what happens when you fire up Ollama server.

The Golang Web Framework, or GIN

This is not the gin you drink, sorry, but rather, a framework for HTTP calls written in Go (Golang). Even if it’s hidden behind Ollama, it is fundamental to run Ollama requests.

Image credits to the Gin project on Github: https://github.com/gin-gonic/gin

It uses a custom HTTP router (based on a radix tree) that is highly efficient in matching incoming requests to their corresponding handlers. This is crucial for an API like Ollama’s, which might handle a significant number of requests.

The [GIN] prefix on some log lines indicates that these specific log messages are being generated by Gin as it handles the incoming HTTP requests to Ollama’s API. You see it logging:

  • The HTTP method (GET, POST).
  • The HTTP status code (200).
  • The request path (/api/tags, /api/chat).
  • The time taken to process the request.
  • The client’s IP address.

Ollama uses the Gin Web Framework because it provides the necessary performance, features (like middleware and JSON handling), and ease of use for building a robust and efficient API to interact with its large language models.

In the examples from my log:

Line 1: [GIN] 2025/04/09 - 20:34:14 | 200 |     8.423565ms |       127.0.0.1 | GET       "/api/tags"

  • [GIN]: Indicates this log line is from the Gin web framework used by Ollama’s API.
  • 2025/04/09 - 20:34:14: Timestamp of the event (Year/Month/Day – Hour:Minute:Second).
  • 200: HTTP status code. 200 OK means a successful request.
  • 8.423565ms: The request took 8.42 milliseconds to process…a lifetime…I know…
  • 127.0.0.1: The IP address of the client making the request (localhost because I’m hosting everything locally on my laptop).
  • GET: The HTTP method used for the request.
  • "/api/tags": The endpoint that was requested.

Ollama Memory Management in my Request

Since my computer is an old-ish MacBook pro, I don’t have the fancy memory swap that new Macs have, and this is captured by the section of the log below. Moreover, it will tell you exactly how much memory you have available, how much memory it needs, and how it’s going to manage it and allocate it. Here you go:

  • msg="system memory" total="32.0 GiB" free="12.2 GiB" free_swap="0 B": Shows that my Mac has 32GB of total memory, 12.2GB is currently free, and there is no free swap space (told you, no fancy Nancy here…).
  • msg=offload library=cpu layers.requested=-1 layers.model=35 layers.offload=0 ...: This is technical information about how the model (gemma3:4b in this case) is being loaded:
    • library=cpu: The model will be processed on the CPU. layers.offload=0 confirms no layers are being offloaded to a GPU (which isn’t available or requested). While I do have a 4Gb GPU in my laptop, that wouldn’t be enough to load the model anyway…
    • layers.requested=-1: This usually means Ollama will try to load all layers it can within the available memory.
    • layers.model=35: The model has 35 layers. As a reference, the new Llama4 from Meta has 48 layers.
    • memory.available="[12.2 GiB]": The available system memory.
    • memory.required.full="5.3 GiB": The estimated memory required to load the entire model.
    • memory.required.partial="0 B": No partial loading is being done.
    • memory.required.kv="682.0 MiB": Memory required for key-value cache (used during generation).
    • memory.weights.total="2.3 GiB": Total size of the model weights (this is only the weights, not the whole model)
  • Other memory.* details: Provide a breakdown of memory usage for different components of the model.

Ollama Tokenizer Warning Message

Ollama warning message
Yes, lamas can give you warnings…they are quite nice… – Image generated with Pollinations.ai

The tokenizer is the component responsible for converting text into numerical tokens the model understands and vice versa. They are the chunks used by LLMs to understand what you are asking…but you already knew that…

This specific warning just means that the metadata about the tokenizer is not explicitly stating whether an “end-of-text” token should be automatically added during tokenization. It could be because the tokenizer likely has its own internal logic or default behavior for handling the end of sequences, my apologies if I butchered this. So, what does it look like?

Lines 4-9: time=... level=WARN source=ggml.go:152 msg="key not found" key=...

  • level=WARN: These are warnings, not critical errors, but they indicate that some expected configuration keys were not found in the model file.
  • source=ggml.go:152: Origin of the warning in the ggml code.
  • msg="key not found" key=... default=...: here is a message saying that it could not find information on the tokenizer in the metadata (e.g., tokenizer.ggml.add_eot_token, gemma3.attention.layer_norm_rms_epsilon) …It also shows the default value that will be used in its absence.
  • These warnings are common and often don’t indicate a serious problem with the model’s functionality. The model likely has sensible defaults for these missing keys.

Ollama Server Logs

Here is where we start getting into the logs sent by the actual Ollama server during startup. In my case, it’s saying that it will listen on port 49221, it tells me that it will use 8 threads, and it loaded one runner.



Lines 10-11: time=... level=INFO source=server.go:405 msg="starting llama server" ... and time=... level=INFO source=sched.go:451 msg="loaded runners" ...

These indicate the start of the actual language model process:

  • msg="starting llama server" cmd="...": Ollama is launching the runner process for the language model. The cmd shows the specific command being executed, including the path to the Ollama runner, the model file being loaded, context size, batch size, number of threads (8, indicating CPU usage), and the port the runner will listen on (49221).
  • msg="loaded runners" count=1: One runner process for the model has been loaded.

Error Messages Ollama Server

Errors do happen, and some error messages can pop up while spinning up an Ollama server. In my case, they were not something I had to worry about because it resolved itself. Technically, the 614 error comes from the server.go file in Ollama server, and it indicates a temporary failure or timeout during one of the readiness checks while the main Ollama server was waiting for the runner process to become fully operational.

Since I am loading the model on my old MacBook, it could be that the “slowness” in loading the model triggered the error because of limited RAM or CPU. That said, the server loaded correctly a few lines down.

Lines 12-13: time=... level=INFO source=server.go:580 msg="waiting for llama runner to start responding" and time=... level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"

Ollama Runner …It’s Running…

Ollama server is running and listening to prt 49221 – Image generated with Pollinations.ai

Lines 14-15: time=... level=INFO source=runner.go:816 msg="starting ollama engine" and time=... level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:49221"

This other bit of log tells us that there are two messages originated from the runner.go file, from line 816 and 879, respectively. These two messages tell you that the Ollama engine is (finally) starting, and that it’s ready and listening on your localhost on post 49221.

Gemma3 4b Coming Up

Lines 18-19: time=... level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=35 and load_backend: loaded CPU backend from ...

This section is quite straightforward and if offers information about the model, the number of tensors, layers, and whether it’s loaded on a CPU or GPU. In this case, Gemma3 4b is a quantized model that has 883 tensors and 35 layers.

  • architecture=gemma3: Confirms the model architecture is Gemma3.
  • file_type=Q4_K_M: Indicates the quantization format of the model (a way to reduce its size and memory usage).
  • num_tensors=883 num_key_values=35: Details about the model’s internal structure.
  • load_backend: loaded CPU backend from ...The CPU backend for processing the model has been loaded.
WHAT ARE TENSORS

Tensors are data structure, aka a defined way to store and represent data. Tensors can 0 to many dimensions. A tensor with 0 dimensions is a single number, 1 dimension would be a series/array of numbers, 2 dimensions is a matrix, and so on for n dimensions.

Gemma3 4b Finally Loaded

At this point, things are going well, and the log tells me a few more info about my (old) CPU, and tells me that the model weights take up 3.6 Gb.

Lines 20-21: time=... level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 ... compiler=cgo(clang)

Line 22: time=... level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.6 GiB"

More specifically, this section tells me that my CPU has a something called SSE3.
SSE stands for Streaming SIMD Extensions 3. SSE is a set of Single Instruction, Multiple Data (SIMD) extensions for the x86 instruction set architecture (used by both Intel and AMD CPUs). SIMD is what allows the CPU to run simultaneous operations on multiple pieces of data, aka in parallel, within a single clock cycle. And SSE3 is the third iteration of these extensions, introduced by Intel some time ago.

And why would I care about it? Well, the whole thing about LLM and AI is being able to run stuff in parallel.
Ollama runs the gglm tensor library, that detects the SSE3 infrastructure in my old CPU and says: “Hey, this doesn’t suck too much, I can actually run some (slow) inference here”.

Gemma3 Getting Ready to Run

Gemma3 ready to run on ollama server
Gemma3 4b is now (almost) ready to go! Image generated with OpenAI GTP 4o

Line 24: time=... level=INFO source=ggml.go:388 msg="compute graph" backend=CPU buffer_type=CPU

The compute graph is the computational structure that the ggml tensor library builds to represent how an LLM (Gemma3 for me) will process data. It’s like a series of instructions on how the various mathematical operations and the flow of data (tensors) move through the different layers of the neural network to generate an output. In my case, it has been prepared for my (slow) CPU, but it can similarly be ready for executions on a GPU. That’s the “beauty” of a compute graph, it is very flexible in terms of applications and can run on multiple infrastructures.

Line 31: time=... level=INFO source=server.go:619 msg="llama runner started in 5.66 seconds"

And now we are ready to go!

Test API Request to Ollama Server via n8n

Basic LLM chain in n8n using local Ollama with Gemma3 4b

Now that things seem to be setup, it’s time to test it out. In the full post about using Ollama local with n8n you can see the setup, so here I’ll just show the backend log from the server when a call is made.
In this case a call is just a simple message on the n8n chat, that trigs the LLM chain with Ollama’s Gemma3. I made two requests.

Lines 52-53: [GIN] 2025/04/09 - 20:34:57 | 200 | 26.354405708s |       127.0.0.1 | POST     "/api/chat"

  • POST     "/api/chat": This tells me that an HTTP POST request was made to the /api/chat endpoint. This is basically the request made to initiate the chat interation.
  • 26.354405708s : This request took approximately 26.35 seconds to process. This is the total time it took for the Ollama server to receive the request, process it (which includes the LLM generating a response), and send that response back to me.

In Summary:

This Ollama server log shows that it started correctly, it loaded gemma3:4b model onto the CPU (as no GPU offloading was possible, sorry…), some non-critical warnings about missing metadata keys in the model file came up but they were not critical, and finally, the test call via n8n nodes worked well.

Do you use Ollama server? How do you spin it up? Do you run it locally? Let me know, I would love to hear it.

Full Ollama Server Log

[GIN] 2025/04/09 - 20:34:14 | 200 |    8.423565ms |       127.0.0.1 | GET      "/api/tags"
time=2025-04-09T20:34:31.157-04:00 level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="12.2 GiB" free_swap="0 B"
time=2025-04-09T20:34:31.159-04:00 level=INFO source=server.go:138 msg=offload library=cpu layers.requested=-1 layers.model=35 layers.offload=0 layers.split="" memory.available="[12.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.3 GiB" memory.required.partial="0 B" memory.required.kv="682.0 MiB" memory.required.allocations="[5.3 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-04-09T20:34:31.313-04:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-09T20:34:31.323-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-09T20:34:31.323-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-09T20:34:31.323-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-09T20:34:31.323-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-09T20:34:31.323-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-09T20:34:31.324-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/seb/.ollama/models/blobs/sha256-377655e65351a68cddfbd69b7c8dc60c1890466254628c3e494661a52c2c5ada --ctx-size 8192 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 49221"
time=2025-04-09T20:34:31.328-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-09T20:34:31.328-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-09T20:34:31.329-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-09T20:34:31.369-04:00 level=INFO source=runner.go:816 msg="starting ollama engine"
time=2025-04-09T20:34:31.369-04:00 level=INFO source=runner.go:879 msg="Server listening on 127.0.0.1:49221"
time=2025-04-09T20:34:31.513-04:00 level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
time=2025-04-09T20:34:31.513-04:00 level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
time=2025-04-09T20:34:31.513-04:00 level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=35
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so
time=2025-04-09T20:34:31.549-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2025-04-09T20:34:31.556-04:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="3.6 GiB"
time=2025-04-09T20:34:31.583-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
time=2025-04-09T20:34:36.811-04:00 level=INFO source=ggml.go:388 msg="compute graph" backend=CPU buffer_type=CPU
time=2025-04-09T20:34:36.819-04:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-04-09T20:34:36.838-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-04-09T20:34:36.838-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-04-09T20:34:36.838-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-04-09T20:34:36.838-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-04-09T20:34:36.838-04:00 level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-04-09T20:34:36.984-04:00 level=INFO source=server.go:619 msg="llama runner started in 5.66 seconds"
llama_model_loader: loaded meta data with 34 key-value pairs and 883 tensors from /Users/seb/.ollama/models/blobs/sha256-377655e65351a68cddfbd69b7c8dc60c1890466254628c3e494661a52c2c5ada (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv   1:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv   2:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv   3:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv   4:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv   5:                         gemma3.block_count u32              = 34
llama_model_loader: - kv   6:                      gemma3.context_length u32              = 8192
llama_model_loader: - kv   7:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv   8:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv   9:         gemma3.vision.attention.head_count u32              = 16
llama_model_loader: - kv  10: gemma3.vision.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                  gemma3.vision.block_count u32              = 27
llama_model_loader: - kv  12:             gemma3.vision.embedding_length u32              = 1152
llama_model_loader: - kv  13:          gemma3.vision.feed_forward_length u32              = 4304
llama_model_loader: - kv  14:                   gemma3.vision.image_size u32              = 896
llama_model_loader: - kv  15:                 gemma3.vision.num_channels u32              = 3
llama_model_loader: - kv  16:                   gemma3.vision.patch_size u32              = 14
llama_model_loader: - kv  17:                       general.architecture str              = gemma3
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  20:           tokenizer.ggml.add_padding_token bool             = false
llama_model_loader: - kv  21:           tokenizer.ggml.add_unknown_token bool             = false
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,514906]  = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262145]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262145]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,262145]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  31:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  479 tensors
llama_model_loader: - type  f16:  165 tensors
llama_model_loader: - type q4_K:  205 tensors
llama_model_loader: - type q6_K:   34 tensors
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 7
load: token to piece cache size = 1.9446 MB
[GIN] 2025/04/09 - 20:34:57 | 200 | 26.354405708s |       127.0.0.1 | POST     "/api/chat"

Seb

I love AI and automations, I enjoy seeing how it can make my life easier. I have a background in computational sciences and worked in academia, industry and as consultant. This is my journey about how I learn and use AI.

Leave a Reply