Rather than judging the quality of AI tools solely by feel, we have a record of quantifying the extent of improvement through datasets and automatic scoring.

prompt‑smith is a tool that checks your prompts and suggests points for improvement.
It provides a flow to organize requirements, check missing elements, and create improvement plans.
I was using prompt‑smith quite well in practice, but at some point a thought occurred to me. “Does it really work?”
There was a feeling of improvement, but I needed numbers to explain it to the team or determine the next direction for improvement.
So this time, I decided to check by measurement rather than by feeling.
We chose a form that frequently appears in practice. The input is the query + context, and the output is the JSON needed for routing.
{
"category": "...",
"priority": "...",
"needs_human": true,
"language": "ko|en",
"action": "...",
"confidence": 0.0,
"injection_detected": false
}
The domain is Fintech, but this is just an example. There were 50 data (ko:en = 7:3), including 3 injection cases. PII is not included.
Please note that since this is a synthetic dataset rather than actual user logs, there may be differences in distribution.
The goal is simple. To check if the improved prompts with prompt‑smith actually improve performance.
So we compared P0 (normal prompt) and P1 (improved prompt) on the same dataset and same model.
The evaluation indicators were simplified to PASS/FAIL, and the reasons for failure were collected separately as an automatic scoring log.
In other words, the judgment was made not by “it looks good” but by “how many were correct?”
Model output was collected as CSV and PASS/FAIL calculated using an automated scoring script.
This experiment was conducted with gpt‑4.1‑mini. The choice was made considering the balance between cost and performance.
python automation/scripts/ps_run_openai.py \
--dataset ps_fintech_router_dataset_v1.json \
--model gpt-4.1-mini \
--prompt-file ps_fintech_router_prompt_p0.md \
--out automation/outputs/ps_fintech_runs_filled_41mini_p0.csv
python automation/scripts/taskpack_cli.py ps-grade \
--dataset ps_fintech_router_dataset_v1.json \
--runs automation/outputs/ps_fintech_runs_filled_41mini_p0.csv \
--out automation/outputs/ps_fintech_runs_graded_41mini_p0.csv
Automatic scoring is performed by comparing the correct answer (JSON) and model output (JSON), and missing keys/format errors/value mismatches are processed as FAIL.
First, we created a baseline by running the entire dataset using the P0 prompt.
Looking at the results, we summarized the failure pattern and reflected the criteria in the prompt‑smith improvement plan to create P1 prompt.
We ran it again under the same conditions and compared the two results.
Although the design is simple, we focused on maintaining the principle of “comparing under the same conditions.”
P0 wrote it shallowly at the level of “basic instructions + schema guidance.” Afterwards, based on prompt‑smith, we created P1, which specified the decision rules and specified the routing criteria.
| Prompt | PASS | FAIL | Improvement |
|---|---|---|---|
| P0 | 12 | 38 | — |
| P1 | 47 | 3 | +35 |
Rather than simply refining sentences, clearly including decision rules significantly improved performance.
The reasons for many failures in P0 were similar. needs_human Underjudgment, priority under/overjudgment, confusion in some categories.
At P1, we stated this rule in advance and the results improved dramatically. It is correct to view this change not because “the model has become smarter,” but because the criteria for judgment have been clearly provided.
The table below shows the most common reasons for failure in automatic scoring based on P0.
| Cause | Number of cases |
|---|---|
| Needs_human underestimation | 13 |
| priority high → medium | 8 |
| priority urgent → high | 5 |
As P1 added more detailed rules, the length of the prompt increased significantly. Based on this experiment, the input tokens increased by approximately 3.6 times and the total tokens increased by approximately 3.2 times. Performance has improved, but reducing token efficiency remains as the next improvement task.
At least in this experiment, it worked. The improvement was significant, the results were reproducible under the same conditions, and the failure pattern was noticeably reduced.
However, this conclusion depends on synthetic dataset + specific domain + specific model. Holdout/replay testing is required for application to practical logs.
Next, we plan to further strengthen the rules for remaining failure cases (customer complaints/security priorities) in P2. We also plan to compare cost/performance with gpt‑4o‑mini and re-verify with a more realistic/large-scale log-based dataset.
Instead of approaching prompt improvement solely by feel, I switched to measurable improvement, and the direction became much clearer. Because the numbers show where the next improvement needs to be made.
I hope this article helps those who want to quantify the progress of improving their prompts.