Downloads
To run Palimpsest locally you need the base model and at least one LoRA adapter.
| File | Description | |
|---|---|---|
| saiga-nemo12b-base-q6k.gguf ~9 GB | Base model — Saiga Nemo 12B (Q6_K quantization) | Download |
The base model is a GGUF quantization of IlyaGusev/saiga_nemo_12b. LoRA adapters were fine-tuned with QLoRA on author-specific corpora and converted to GGUF-LoRA format.
Setup guide
Palimpsest requires two components: llama.cpp (the inference server) and the Node.js app (the web UI).
1. Install llama.cpp
brew install llama.cpp
Requires Homebrew. Apple Silicon Macs get Metal GPU acceleration automatically.
# Build from source with CUDA (NVIDIA GPU) git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j # The server binary will be at: # build/bin/llama-server
For CPU-only, omit -DGGML_CUDA=ON. For AMD GPUs, use -DGGML_HIPBLAS=ON.
# Option A: Pre-built binaries # Download the latest release from: # https://github.com/ggml-org/llama.cpp/releases # Get the llama-*-bin-win-cuda-*.zip for NVIDIA GPU # or llama-*-bin-win-avx2-*.zip for CPU-only # Option B: Build with CMake + Visual Studio git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON cmake --build build --config Release
NVIDIA GPU recommended. Requires CUDA Toolkit for GPU builds. CPU-only works but is slow for 12B models.
2. Download model files
Create a directory for the models and download the base model and adapters:
mkdir -p models cd models # Download from this server (or copy the files manually) curl -O /download/saiga-nemo12b-base-q6k.gguf curl -O /download/pelevin.lora.gguf curl -O /download/lovecraft.lora.gguf curl -O /download/marquez.lora.gguf curl -O /download/sorokin.lora.gguf
3. Start llama-server
llama-server \ -m models/saiga-nemo12b-base-q6k.gguf \ --lora models/pelevin.lora.gguf \ --lora models/lovecraft.lora.gguf \ --lora models/marquez.lora.gguf \ --lora models/sorokin.lora.gguf \ --lora-init-without-apply \ --port 8080 \ --ctx-size 4096
The --lora-init-without-apply flag loads all adapters but doesn't apply any by default —
adapter mixing is controlled per-request via the chat UI.
You need at least 16 GB of RAM (or VRAM) for the base model plus adapters.
4. Start the web app
git clone https://github.com/lambda-house/palimpsest.git cd palimpsest npm install
LLAMA_URL=http://localhost:8080 \ ADAPTERS="pelevin:Пелевин,lovecraft:Лавкрафт,marquez:Маркес,sorokin:Сорокин" \ node server.js
set LLAMA_URL=http://localhost:8080 set ADAPTERS=pelevin:Пелевин,lovecraft:Лавкрафт,marquez:Маркес,sorokin:Сорокин node server.js
In PowerShell, use $env:LLAMA_URL = "http://localhost:8080" syntax instead.
5. Open the UI
Navigate to http://localhost:3000 in your browser. Use the Mixer sliders to blend author styles.
Environment variables
| Variable | Default | Description |
|---|---|---|
PORT | 3000 | Web server port |
LLAMA_URL | http://localhost:8080 | llama.cpp server URL |
ADAPTERS | Comma-separated id:Label list matching --lora order | |
MODELS_DIR | Path to models directory (enables file downloads) |