Running VibeThinker-1.5B on android Samsung Tablet — Edge AI in Action

by Javedalam - opened 27 days ago

Discussion

Javedalam

27 days ago

Model: VibeThinker-1.5B (Qwen 2.5 Math finetune)

Quantization: 4-bit GGUF

Inference engine: llama-server under Termux

Temperature: 0.2

System prompt:

“You are a concise solver. Always stop after giving a single line beginning with ‘Final Answer:’. Never explain or continue reasoning.”

With this setup, the model successfully solved the differential equation

y'' - y = e^x,\quad y(0)=0,\quad y'(0)=1,

y(x)=\tfrac14e^{x}-\tfrac14e^{-x}+\tfrac12x e^{x}.

At a generation rate of roughly 3 tokens per second, VibeThinker-1.5B handled both the mathematical reasoning and logical structure smoothly. For a model under 1.2 billion parameters, this performance is remarkable. It demonstrates that, with improved quantization and refined prompting, Edge AI on mobile devices has become a practical reality—bringing private, on-device reasoning to everyday hardware.

Javedalam changed discussion title from Running VibeThinker-1.5B on a Samsung Tablet — Edge AI in Action to Running VibeThinker-1.5B on an android Samsung Tablet — Edge AI in Action 27 days ago

Javedalam changed discussion title from Running VibeThinker-1.5B on an android Samsung Tablet — Edge AI in Action to Running VibeThinker-1.5B on android Samsung Tablet — Edge AI in Action 27 days ago

darwin2025

20 days ago

Hi, what's the chat UI do you use here?

Edge-Quant

20 days ago

Hi, what's the chat UI do you use here?

Official llama.cpp web ui

SiddhJagani

14 days ago

here is Dockerfile for it:

FROM archlinux:latest

ENV DEBIAN_FRONTEND=noninteractive

# passed from space environment
ARG MODEL_ID="unsloth/gemma-3-270m-it-GGUF"
ARG QUANT="Q8_0"
ARG SERVED_NAME="Gemma 270m"
ARG PARALLEL=4
ARG CTX_SIZE="4096"
ARG EMBEDDING_ONLY=0
ARG RERANK_ONLY=0

# llama.cpp env configs
ENV LLAMA_ARG_HF_REPO="${MODEL_ID}"
ENV LLAMA_ARG_CTX_SIZE="${CTX_SIZE}"
ENV LLAMA_ARG_BATCH=512
ENV LLAMA_ARG_N_PARALLEL="${PARALLEL}"
ENV LLAMA_ARG_FLASH_ATTN=on
# ENV LLAMA_ARG_CACHE_TYPE_K="q8_0"
# ENV LLAMA_ARG_CACHE_TYPE_V="q4_1"
ENV LLAMA_ARG_MLOCK=1
ENV LLAMA_ARG_N_GPU_LAYERS=0
ENV LLAMA_ARG_HOST="0.0.0.0"
ENV LLAMA_ARG_PORT=7860
ENV LLAMA_ARG_ALIAS="${SERVED_NAME}"
ENV LLAMA_ARG_EMBEDDINGS=${EMBEDDING_ONLY}
ENV LLAMA_ARG_RERANKING=${RERANK_ONLY}
ENV LLAMA_ARG_ENDPOINT_METRICS=1

RUN pacman -Syu --noconfirm --overwrite '*'
RUN pacman -S base-devel git git-lfs cmake curl openblas openblas64 blas64-openblas python gcc-libs glibc --noconfirm --overwrite '*'

RUN mkdir -p /app && mkdir -p /.cache
# cache dir for llama.cpp to download models
RUN chmod -R 777 /.cache

WORKDIR /app
RUN git clone --depth 1 --single-branch --branch master https://github.com/ggml-org/llama.cpp.git
# RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git llama.cpp
WORKDIR /app/llama.cpp
RUN cmake -B build \
          -DGGML_LTO=ON \
          -DLLAMA_CURL=ON \
          -DLLAMA_BUILD_SERVER=ON \
          -DLLAMA_BUILD_EXAMPLES=ON \
          -DGGML_ALL_WARNINGS=OFF \
          -DGGML_ALL_WARNINGS_3RD_PARTY=OFF \
          -DGGML_BLAS=ON \
          -DGGML_BLAS_VENDOR=OpenBLAS \
          -DGGML_NATIVE=ON \
          -DGGML_LLAMAFILE=ON \
          -Wno-dev \
          -DCMAKE_BUILD_TYPE=Release
RUN cmake --build build --config Release --target llama-server -j $(nproc)

WORKDIR /app

EXPOSE 7860

CMD ["/app/llama.cpp/build/bin/llama-server", "--verbose-prompt", "--prio", "3"]

Jameswhitmore1122

11 days ago

Really impressive work getting VibeThinker-1.5B running so smoothly on a Samsung tablet. Solving a differential equation correctly at ~3 tokens/sec on 4-bit GGUF shows how far edge AI has come. This is a great example of practical, private on-device reasoning, excited to see where mobile-first inference goes next.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment