Note that the modelPath is the only required parameter. For testing you can set this in the environment variable LLAMA_PATH.

interface LlamaCppInputs {
    modelPath: string;
    batchSize?: number;
    contextSize?: number;
    embedding?: boolean;
    f16Kv?: boolean;
    gbnf?: string;
    gpuLayers?: number;
    jsonSchema?: object;
    logitsAll?: boolean;
    maxTokens?: number;
    prependBos?: boolean;
    seed?: null | number;
    temperature?: number;
    threads?: number;
    topK?: number;
    topP?: number;
    trimWhitespaceSuffix?: boolean;
    useMlock?: boolean;
    useMmap?: boolean;
    vocabOnly?: boolean;
}

Hierarchy (view full)

  • LlamaBaseCppInputs
  • Toolkit
    • LlamaCppInputs

Properties

modelPath: string

Path to the model on the filesystem.

batchSize?: number

Prompt processing batch size.

contextSize?: number

Text context size.

embedding?: boolean

Embedding mode only.

f16Kv?: boolean

Use fp16 for KV cache.

gbnf?: string

GBNF string to be used to format output. Also known as grammar.

gpuLayers?: number

Number of layers to store in VRAM.

jsonSchema?: object

JSON schema to be used to format output. Also known as grammar.

logitsAll?: boolean

The llama_eval() call computes all logits, not just the last one.

maxTokens?: number
prependBos?: boolean

Add the begining of sentence token.

seed?: null | number

If null, a random seed will be used.

temperature?: number

The randomness of the responses, e.g. 0.1 deterministic, 1.5 creative, 0.8 balanced, 0 disables.

threads?: number

Number of threads to use to evaluate tokens.

topK?: number

Consider the n most likely tokens, where n is 1 to vocabulary size, 0 disables (uses full vocabulary). Note: only applies when temperature > 0.

topP?: number

Selects the smallest token set whose probability exceeds P, where P is between 0 - 1, 1 disables. Note: only applies when temperature > 0.

trimWhitespaceSuffix?: boolean

Trim whitespace from the end of the generated text Disabled by default.

useMlock?: boolean

Force system to keep model in RAM.

useMmap?: boolean

Use mmap if possible.

vocabOnly?: boolean

Only load the vocabulary, no weights.

""