Skip to content

Create evaluation

POST
/workspaces/{workspace_id}/evaluations

Create a new evaluation

workspace_id
required
string format: uuid

Workspace ID

object
name
required
string
description
string
evaluation_type
required
string
Allowed values: benchmark task_specific operational safety comparison
targets
required
Array<object>
>= 1 items <= 10 items
object
target_type
string
Allowed values: model stack scenario comparison
target_id

ID of stack or scenario (if applicable)

string format: uuid
model_provider

Model provider for direct model evaluation

string
model_id

Model ID for direct model evaluation

string
label

Display label for this target

string
benchmarks
Array<object>
object
benchmark_type
string
Allowed values: mmlu human_eval gsm8k truthful_qa hellaswag winogrande arc custom
subset

Specific benchmark subset

string
num_samples

Number of samples to evaluate

integer
custom_eval
object
prompt_template

Prompt template for evaluation

string
criteria
Array<object>
object
name
string
weight
number
description
string
judge_config
object
provider
string
model
string
methodology
object
sample_size
integer
default: 100
confidence_level
number
default: 0.95
random_seed
integer

Evaluation created

object
id
string format: uuid
workspace_id
string format: uuid
name
string
description
string
evaluation_type
string
Allowed values: benchmark task_specific operational safety comparison
targets
Array<object>
object
target_type
string
Allowed values: model stack scenario comparison
target_id

ID of stack or scenario (if applicable)

string format: uuid
model_provider

Model provider for direct model evaluation

string
model_id

Model ID for direct model evaluation

string
label

Display label for this target

string
benchmarks
Array<object>
object
benchmark_type
string
Allowed values: mmlu human_eval gsm8k truthful_qa hellaswag winogrande arc custom
subset

Specific benchmark subset

string
num_samples

Number of samples to evaluate

integer
custom_eval
object
prompt_template

Prompt template for evaluation

string
criteria
Array<object>
object
name
string
weight
number
description
string
judge_config
object
provider
string
model
string
methodology
object
sample_size
integer
default: 100
confidence_level
number
default: 0.95
random_seed
integer
status
string
Allowed values: pending running completed failed cancelled
progress_percent
number
started_at
string format: date-time
completed_at
string format: date-time
error_message
string
results
object
target_results
Array<object>
object
target_label
string
metrics
object
key
additional properties
number
benchmark_scores
object
key
additional properties
number
comparisons
Array<object>
object
target_a
string
target_b
string
metric
string
delta
number
winner
string
overall_summary
string
created_at
string format: date-time
updated_at
string format: date-time