Palimpsest

literary style mixing

Downloads

To run Palimpsest locally you need the base model and at least one LoRA adapter.

File Description
saiga-nemo12b-base-q6k.gguf ~9 GB Base model — Saiga Nemo 12B (Q6_K quantization) Download

The base model is a GGUF quantization of IlyaGusev/saiga_nemo_12b. LoRA adapters were fine-tuned with QLoRA on author-specific corpora and converted to GGUF-LoRA format.

Setup guide

Palimpsest requires two components: llama.cpp (the inference server) and the Node.js app (the web UI).

1. Install llama.cpp

brew install llama.cpp

Requires Homebrew. Apple Silicon Macs get Metal GPU acceleration automatically.

# Build from source with CUDA (NVIDIA GPU)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# The server binary will be at:
# build/bin/llama-server

For CPU-only, omit -DGGML_CUDA=ON. For AMD GPUs, use -DGGML_HIPBLAS=ON.

# Option A: Pre-built binaries
# Download the latest release from:
# https://github.com/ggml-org/llama.cpp/releases
# Get the llama-*-bin-win-cuda-*.zip for NVIDIA GPU
# or llama-*-bin-win-avx2-*.zip for CPU-only

# Option B: Build with CMake + Visual Studio
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

NVIDIA GPU recommended. Requires CUDA Toolkit for GPU builds. CPU-only works but is slow for 12B models.

2. Download model files

Create a directory for the models and download the base model and adapters:

mkdir -p models
cd models

# Download from this server (or copy the files manually)
curl -O /download/saiga-nemo12b-base-q6k.gguf
curl -O /download/pelevin.lora.gguf
curl -O /download/lovecraft.lora.gguf
curl -O /download/marquez.lora.gguf
curl -O /download/sorokin.lora.gguf

3. Start llama-server

llama-server \
  -m models/saiga-nemo12b-base-q6k.gguf \
  --lora models/pelevin.lora.gguf \
  --lora models/lovecraft.lora.gguf \
  --lora models/marquez.lora.gguf \
  --lora models/sorokin.lora.gguf \
  --lora-init-without-apply \
  --port 8080 \
  --ctx-size 4096

The --lora-init-without-apply flag loads all adapters but doesn't apply any by default — adapter mixing is controlled per-request via the chat UI. You need at least 16 GB of RAM (or VRAM) for the base model plus adapters.

4. Start the web app

git clone https://github.com/lambda-house/palimpsest.git
cd palimpsest
npm install
LLAMA_URL=http://localhost:8080 \
ADAPTERS="pelevin:Пелевин,lovecraft:Лавкрафт,marquez:Маркес,sorokin:Сорокин" \
node server.js
set LLAMA_URL=http://localhost:8080
set ADAPTERS=pelevin:Пелевин,lovecraft:Лавкрафт,marquez:Маркес,sorokin:Сорокин
node server.js

In PowerShell, use $env:LLAMA_URL = "http://localhost:8080" syntax instead.

5. Open the UI

Navigate to http://localhost:3000 in your browser. Use the Mixer sliders to blend author styles.

Environment variables

VariableDefaultDescription
PORT3000Web server port
LLAMA_URLhttp://localhost:8080llama.cpp server URL
ADAPTERSComma-separated id:Label list matching --lora order
MODELS_DIRPath to models directory (enables file downloads)