Services
Blog
Français
This guide demonstrates how to launch an Intel ipex-llm Docker container, start a vLLM API server configured for the Qwen3-8B model with tool-calling capabilities, query the server using curl, and interpret a sample response. Then, we’ll develop a simple agent loop with tool calling using the amazing litellm library.
This setup enables running large language models (LLMs) on Intel XPUs with features like automatic tool choice and reasoning parsing. All commands assume a Linux environment with Docker installed and access to Intel hardware (e.g., via /dev/dri).
Before entering the container, download the Qwen3-8B model to your host’s Hugging Face cache directory using the huggingface-cli. This ensures the model is pre-fetched and available when the container mounts the cache volume, speeding up the server startup.
huggingface-cli download Qwen/Qwen3-8B
This command requires the huggingface-hub package installed on the host (install via pip install huggingface-hub if needed).
The model files will be saved to ~/.cache/huggingface/hub/ (or equivalent), which is mounted into the container in the next step.
If you encounter authentication issues, log in with huggingface-cli login using a Hugging Face token.
Once the download completes (it may take time depending on your connection), proceed to the next step.
First, start an interactive Docker container using the intelanalytics/ipex-llm-serving-xpu:latest image. This mounts necessary directories for model caching and data persistence, allocates memory and shared memory, and uses host networking for simplicity.bash
docker run -it \
-v ~/llm_root:/root \
-v ~/.cache/huggingface/:/root/.cache/huggingface \
--rm \
--net=host \
--device=/dev/dri \
-v ~/.cache/huggingface/:/root/.cache/huggingface \
--name ipex \
--memory=40g \
--shm-size 16g \
--entrypoint bash \
intelanalytics/ipex-llm-serving-xpu:latest
Key Flags Explained:
-v: Mounts volumes for Hugging Face cache (to reuse downloaded models) and a custom llm_root directory to keep a .bash_history.--rm: Automatically removes the container on exit.--net=host: Uses the host’s network stack (e.g., for accessing localhost:8000).--device=/dev/dri: Grants access to Intel GPU devices.--memory=40g and --shm-size 16g: Limits memory and expands shared memory for large models.Once inside the container, proceed to the next step.
Inside the container, launch the vLLM OpenAI-compatible API server using the ipex-llm module. This loads the Qwen/Qwen3-8B model, enables tool calling with Hermes-style parsing, and configures it for XPU (Intel GPU) inference.bash
python3 -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--model Qwen/Qwen3-8B \
--served-model-name coder \
--gpu-memory-utilization 0.4 \
--device xpu \
--dtype float16 \
--max-model-len 4096 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--reasoning-parser deepseek_r1
Key Arguments Explained:
--model: Specifies the Hugging Face model path (downloads if not cached).--served-model-name: Alias for the model in API requests (e.g., “coder”).--gpu-memory-utilization 0.4: Reserves 40% of XPU memory for the model.--device xpu: Targets Intel XPU hardware.--dtype float16: Uses half-precision for efficiency.--max-model-len 4096: Limits context length to 4096 tokens.--enable-auto-tool-choice: Allows the model to decide when to call tools.--tool-call-parser hermes and --reasoning-parser deepseek_r1: Custom parsers for structured tool calls and reasoning (e.g., tags).The server will start on http://localhost:8000 (due to host networking).
In a separate terminal (outside the container), send a chat completion request via curl. This example prompts the model to use a file_read tool to read a file named proc.py. Define the tool schema in the request payload.
curl http://localhost:8000/v1/chat/completions \
-H 'content-type: application/json' \
--data-raw '{
"model": "coder",
"messages": [
{
"type": "plain",
"role": "user",
"content": "use the file_read tool to read proc.py"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "file_read",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "path of the file to read"
}
}
}
}
}
]
}'
Request Breakdown:
The server responds with a JSON object indicating the model has decided to call the tool. It includes reasoning in tags (via the parser) and a tool_calls array with the function invocation.
{
"id": "chatcmpl-20d2b9b36c8a48769afa3826b89cbdf6",
"object": "chat.completion",
"created": 1763052703,
"model": "coder",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "<think>\nOkay, the user wants me to use the file_read tool to read the file proc.py. Let me check the tools provided. The available tool is file_read, which takes a path parameter. So I need to call file_read with the path set to \"proc.py\". I should make sure the arguments are correctly formatted as a JSON object. Alright, that's straightforward. Just specify the name as \"file_read\" and the arguments with \"path\": \"proc.py\".\n</think>\n\n",
"tool_calls": [
{
"id": "chatcmpl-tool-f35bcf52b63344078076df301cd8514b",
"type": "function",
"function": {
"name": "file_read",
"arguments": "{\"path\": \"proc.py\"}"
}
}
]
},
"logprobs": null,
"finish_reason": "tool_calls",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 154,
"total_tokens": 274,
"completion_tokens": 120,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
Response Highlights:
{"path": "proc.py"} as JSON-stringified arguments.This setup allows iterative tool use—your application would then execute the tool (e.g., read the file) and feed the result back in a follow-up message for the model to continue. For production, consider adding authentication and error handling.
First, install litellm and cli2 with pip, then run this script:
import json
import litellm
# feel free to enable this to have debug output (requests/responses) from litellm
# litellm._turn_on_debug()
# Defining a couple of functions for the LLM
def file_read(path):
try:
with open(path, 'r') as f:
return f.read()
except FileNotFound as exc:
return str(exc)
def file_write(path, content):
with open(path, 'w') as f:
f.write(content)
# Map function name strings to actual function callbacks, we need this because
# the LLM will reply with function names
tool_callbacks = {
'file_read': file_read,
'file_write': file_write,
}
# OpenAI Tools description protocol
tools = [
{
"type": "function",
"function": {
"name": "file_read",
"description": "Reads the entire contents of a file at the given path and returns it as a string. Use for loading text, configs, or data files.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "path of the file to read"
},
}
}
}
},
{
"type": "function",
"function": {
"name": "file_write",
"description": "Writes the provided content to a file at the given path, overwriting if it exists. Use for saving outputs, configs, or generated data.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "path of the file to write"
},
"content": {
"type": "string",
"description": "content to write to the file"
},
}
}
}
},
]
# Context we're starting off with
messages = [
{
"type": "plain",
"role": "system",
"content": """
You are called by a command line agent, you will be asked to read and write
files, which you must do by tool calling.
""".strip(),
},
{
"type": "plain",
"role": "user",
"content": """
Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
a new line to those contents with `1.1.1.1 example`.
Once you have the new contents, write the new contents into /tmp/newhosts
""".strip(),
}
]
# we'll keep going until the llm has no function to call
while True:
print({'querying with messages': messages})
# call the llm with messages and tools
response = litellm.completion(
# litellm provider_name/model_name
model='hosted_vllm/coder',
# your docker endpoint
api_base='http://localhost:8000/v1',
messages=messages,
tools=tools,
temperature='.1',
)
# extract message from the response
message = response.choices[0].message
print({'LLM replied with message': message.to_dict()})
# add the llm's response message to the context
messages.append(message.to_dict())
if not message.tool_calls:
print('We are finished working')
break
for tool_call in message.tool_calls:
# get the actual python function from the tool name using our map
tool_callback = tool_callbacks[tool_call.function.name]
# call the callback with the parsed arguments
tool_result = tool_callback(**json.loads(tool_call.function.arguments))
# add the tool result to the context
messages.append({
'role': 'tool',
'tool_call_id': tool_call.id,
'content': tool_result,
})
Then, you will see that /tmp/newhost contains the contents of your original
/etc/hosts + the new 1.1.1.1 example line.
And this will output:
querying with messages:
- content: "You are called by a command line agent, you will be asked to read and write
files, which you must do by tool calling."
role: system
type: plain
- content: "Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
a new line to those contents with `1.1.1.1 example`.
Once you have the new contents, write the new contents into /tmp/newhosts"
role: user
type: plain
Ok, so far we’ve send our base messages, LLM Response:
LLM replied with message:
content: '<think>
Okay, let''s see. The user wants me to read the /etc/hosts file, add a new line with "1.1.1.1 example", and then write the modified content to /tmp/newhosts.
First, I need to use the file_read function to get the current contents of /etc/hosts. Once I have that, I can append the new line. Wait, but how exactly? The user says "add a new line to those contents", so I should take the existing content, add the new line at the end, and then write it to the new file.
So step by step: Read /etc/hosts, then modify the content by adding "1.1.1.1 example" as a new line. Then write that modified content to /tmp/newhosts using file_write.
I need to make sure that when I read the file, I get the entire content as a string. Then, I can append the new line. Let me check if there are any specific requirements for the format. The user didn''t mention any, so just adding the line at the end should be fine.
So the first tool call is file_read with path /etc/hosts. Then, take the output, add the new line, and call file_write with path /tmp/newhosts and the content being the original content plus the new line.
</think>'
function_call: null
role: assistant
tool_calls:
- function:
arguments: '{"path": "/etc/hosts"}'
name: file_read
id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
type: function
See how it’s calling the file_read function with path=/etc/hosts?! Our code
proceeds to calling that and then append the output to the messages, so we’re
querying the LLM again in a subsequent loop iteration with those messages:
querying with messages:
- content: "You are called by a command line agent, you will be asked to read and write
files, which you must do by tool calling."
role: system
type: plain
- content: "Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
a new line to those contents with `1.1.1.1 example`.
Once you have the new contents, write the new contents into /tmp/newhosts"
role: user
type: plain
- content: '<think>
[...snipped...]
</think>
'
function_call: null
role: assistant
tool_calls:
- function:
arguments: '{"path": "/etc/hosts"}'
name: file_read
id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
type: function
- content: "# Static table lookup for hostnames.\n# See hosts(5) for details.\n127.0.0.1 localhost ethlocal tzlocal tzkt-local\n104.26.10.87 www.ygg.re \n188.114.97.2 yggtorrent.top\n162.159.137.6 thepiratebay.org\n"
role: tool
tool_call_id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
So basically, the new query shows the LLM that it has already called the
file_read tool and got it’s output. Let’s see what it replies now:
LLM replied with message:
content: "<think>
Okay, I need to take the contents provided from the /etc/hosts file and add a new line with \"1.1.1.1 example\". Then, write the modified content to /tmp/newhosts.
First, let me check the original content. The user provided the following lines:
127.0.0.1 localhost ethlocal tzlocal tzkt-local
104.26.10.87 www.ygg.re
188.114.97.2 yggtorrent.top
162.159.137.6 thepiratebay.org
I need to append \"1.1.1.1 example\" as a new line. So I'll add that line at the end. Then, use the file_write function to save this to /tmp/newhosts.
Wait, but how exactly should the new line be formatted? The original lines have varying numbers of spaces. For example, the first line has \"127.0.0.1 localhost ethlocal tzlocal tzkt-local\" with multiple entries. The new line should probably follow the same style, but the user specified \"1.1.1.1 example\" as the new line. So I'll add that as a new line at the end of the content.
So combining the original content with the new line, then writing that to /tmp/newhosts. Let me make sure I don't miss any steps. The user wants the new content written to /tmp/newhosts, so after reading the original, I modify it by adding the new line, then write the entire modified content to the new file path.
</think>"
function_call: null
role: assistant
tool_calls:
- function:
arguments: '{"path": "/tmp/newhosts", "content": "127.0.0.1 localhost ethlocal tzlocal tzkt-local\n104.26.10.87 www.ygg.re \n188.114.97.2 yggtorrent.top\n162.159.137.6 thepiratebay.org\n1.1.1.1 example"}'
name: file_write
id: chatcmpl-tool-400d7b6895f74a67871be31e0dbf258a
type: function
As you can see, it took the contents of /etc/hosts that it previously got
from file_read function, added the 1.1.1.1 example line, and calls
file_write with that new content into /tmp/newhosts.
So, we’re basically done here, but the loop is still going on, querying again with the whole set of messages:
querying with messages:
- content: "You are called by a command line agent, you will be asked to read and write
files, which you must do by tool calling."
role: system
type: plain
- content: "Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
a new line to those contents with `1.1.1.1 example`.
Once you have the new contents, write the new contents into /tmp/newhosts"
- content: '<think>[...]</think>'
function_call: null
role: assistant
tool_calls:
- function:
arguments: '{"path": "/etc/hosts"}'
name: file_read
id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
type: function
- content: "# Static table lookup for hostnames.
# See hosts(5) for details.
127.0.0.1 localhost ethlocal tzlocal tzkt-local
104.26.10.87 www.ygg.re
188.114.97.2 yggtorrent.top
162.159.137.6 thepiratebay.org"
role: tool
tool_call_id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
- content: "<think>[...]</think>"
function_call: null
role: assistant
tool_calls:
- function:
arguments: '{"path": "/tmp/newhosts", "content": "127.0.0.1 localhost ethlocal tzlocal tzkt-local\n104.26.10.87 www.ygg.re \n188.114.97.2 yggtorrent.top\n162.159.137.6 thepiratebay.org\n1.1.1.1 example"}'
name: file_write
id: chatcmpl-tool-400d7b6895f74a67871be31e0dbf258a
type: function
- content: null
role: tool
tool_call_id: chatcmpl-tool-400d7b6895f74a67871be31e0dbf258a
Now, the LLM replies with:
LLM replied with message:
content: '<think>
Okay, let me see. The user asked to read /etc/hosts, add a line "1.1.1.1 example", and write to /tmp/newhosts. I first called file_read to get the hosts content. Then I appended the new line and used file_write to save it to /tmp/newhosts. The tool response for file_write was a success, so now I need to confirm the action.
Wait, the user might expect a confirmation that the file was written. But since the task is done, I should just indicate that the new hosts file has been created. Let me check if there''s anything else needed. No, the steps are complete. The final answer should state that the new hosts file is written to /tmp/newhosts.
</think>
The new hosts file with the added entry has been written to `/tmp/newhosts`.'
function_call: null
role: assistant
tool_calls: null
Because there is no function call in that response, the program prints: We are finished working and exits.
Left as an exercise to the reader: connect a 3d printer, expose tools for the LLM to pilot it, and ask the LLM to print guns and robots to go and conquer the world, or something like that, hope for time travelling.