Installing KHOJ AI assistant using docker compose (local only)

posted on 2023-08-10

Introduction

Below I have wrote a reference guide how to install a local-only (i.e. data won’t leave your environment) of KHOJ AI assistant using docker compose and setting Khoj’s Emacs plugin to use it as a local server instance.

Disk space requirements

There are some pretty hefty disk space requirements, noting them here just for completion. On my computer it took north of 11GB of disk space after installation:

Disk space used:
% du -hs Applications/khoj/
5.4G    Applications/khoj/

plus the size of docker images:

% docker images
REPOSITORY             TAG       IMAGE ID       CREATED       SIZE
ghcr.io/khoj-ai/khoj   latest    5641a8f9b32d   9 hours ago   5.94GB

The RAM requirements are start at 3.2GB but it grows in time and queries can take all of the CPU cores and max them when composing an answer to a query.

Assumptions to configuration

Some assumption to the instructions below:

~/Application/khoj/ directory will held for docker compose project and all the KHOJ non-transient related data.

I keep my org-mode files and PDF files that I wanted to have indexed in:

/home/adam/Cloud/GTD/Getting Things Done

and

/home/adam/Cloud/GTD/Indexed PDF files

directories respectively. Substitute those below for directories matching your files and directory structure.

I keep pristine KHOJ source code repository clone in ~/Src/opensource/khoj/.

KHOJ docker compose guide

Clone the source repository:

cd ~/Src/opensource/
git clone https://github.com/khoj-ai/khoj

make directory for modified sample docker compose projects and its data:

mkdir -p ~/Application
cd ~/Application
mkdir khoj
cd khoj
cp ~/Src/opensource/khoj/docker-compose.yml .
cp ~/Src/opensource/khoj/config/khoj_docker.yml .

Modify the KHOJ’s configuration file khoj/config/khoj_docker.yml to reflect the contents below.

The config also disables the on-by-default telemetry, and changes the encoder to use one better suited for multilingual documents:

app:
  should_log_telemetry: false
content-type:
  github: null
  notion: null
  org:
    compressed-jsonl: /data/embeddings/notes.jsonl.gz
    embeddings-file: /data/embeddings/note_embeddings.pt
    index_heading_entries: false
    input-files: null
    input-filter:
    - /data/org/**/*.org
  pdf:
    compressed-jsonl: /data/embeddings/pdf.jsonl.gz
    embeddings-file: /data/embeddings/pdf_embeddings.pt
    input-files: null
    input-filter:
    - /data/pdf/**/*.pdf
  plugins: null
processor:
  conversation:
    conversation-logfile: /data/embeddings/conversation_logs.json
    enable-offline-chat: true
    openai: null
search-type:
  asymmetric:
    cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
    encoder: paraphrase-multilingual-MiniLM-L12-v2
    model_directory: /data/models/asymmetric
  image:
    encoder: sentence-transformers/clip-ViT-B-32
    model_directory: /data/models/image_encoder
  symmetric: null
version: 0.0.0

Next modify sample docker compose project file khoj/docker-compose.yml to use the config and mount directories holding our notes and PDF documents for indexing. You definitely want to change the /home/adam/Cloud/GTD/... paths below to reflect yours.

version: "3.9"
services:
  server:
    image: ghcr.io/khoj-ai/khoj:latest
    # if you would like to have the server running even after restart.
    #restart: unless-stopped
    ports:
      # Default Khoj port, changing it requires changing khoj-server-url Emacs package variable
      - "42110:42110"
    working_dir: /app
    volumes:
      - .:/app
      # Volumes pointing to org-mode (and other) files for indexing.
      # the provided khoy_docker.yml config expects data in:
      # /data/org and /data/pdf directories.
      # Please change the left side to point to your own directories
      # holding org-mode and pdf files for indexing.
      - /home/adam/Cloud/GTD/Getting Things Done:/data/org/
      - /home/adam/Cloud/GTD/Indexed PDF files:/data/pdf/
      # Embeddings and models are populated after the first run
      # You can set these volumes to point to empty directories on host
      - ./data/embeddings/:/data/embeddings/
      - ./data/models/:/root/.khoj/search/
      # Models and cache
      # (as we don't want the 3GB+ model files being downloaded each time we restart)
      - ./data/cache:/root/.cache/
      - ./data/models:/data/models/
    # Use 0.0.0.0 to explicitly set the host ip for the service on the container. https://pythonspeed.com/articles/docker-connection-refused/
    command: --host="0.0.0.0" --port=42110 -c=khoj_docker.yml -vv

Last setp is to launch KHOJ AI assistant using docker compose:

cd ~/Application/hkoj/
docker compose up -d

After a while you should be able to access Khoj instance at: http://localhost:42110/

Khoj Emacs guide

Now we can set up KHOJ’s AI Assistant plugin to Emacs to be able to interact with our local AI from our Editor of Choice™ instead of web browser 😉

Instructions below are using straight.el:

(use-package khoj
     :after org
     :straight (khoj :type git :host github :repo "khoj-ai/khoj" :files (:defaults "src/interface/emacs/khoj.el"))
     ;; not mentioned in the online quick start, but this will prevent the emacs package
     ;; downloading and configuring a separate khoj instance and will use already running
     ;; one instead.
     :config (setq khoj-auto-setup nil))

That’s all what’s needed to be able to interact with KHOJ AI assistant running locally from Emacs.

Conclusions

I won’t write any conclusions as I myself are still exploring how useful such a tool is for my own workflow. Hope those instructions could be helpful in at least trying using AI for your own personal notes and making your own conclusions.

Happy hacking!