# Script Adherence best practices

The accuracy of Script Adherence operator results depends heavily on how clearly you write your criteria descriptions. Vague criteria are the leading cause of misclassifications and inconsistent results, including labels that flip between `Succeeded` and `Failed` across repeated evaluations of the same transcript.

> \[!NOTE]
>
> Improving criteria clarity reduces the chance of misclassifications. However, due to the non-deterministic nature of LLMs, consistent results cannot always be guaranteed.

## Writing clear criteria

**Core principle:** Each criterion should have an unambiguous yes/no answer based solely on the transcript. If two people could reasonably disagree, the model will too.

* **Be specific about what counts**: Spell out exactly what the agent must say or do. Don't leave it to interpretation.

  | Label               | Weak                                    | Strong                                                                                                                   |
  | ------------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
  | `reservation_recap` | The agent should recap the reservation. | The agent should confirm the specific room type, check-in date, check-out date, nightly rate, and the guest's full name. |
* **Quantify when possible**: Replace words like "some", "relevant", or "appropriate" with specific quantities or items.

  | Label              | Weak                                                   | Strong                                                                                                                                                           |
  | ------------------ | ------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | `loyalty_benefits` | The agent should mention the loyalty program benefits. | The agent should mention at least three loyalty program benefits, each with the tier name and specific perk (e.g., late checkout, room upgrade, free breakfast). |
* **One observable behavior per criterion**: If one part is met but another isn't, the model has to make an ambiguous judgment call. Split combined requirements into separate criteria.

  | Label                       | Weak                                                           | Strong                                                                                                                                                                                                                  |
  | --------------------------- | -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | `greeting_and_verification` | The agent should greet the guest and verify their reservation. | `greeting`: The agent should greet the caller and identify themselves by first name and hotel name. `reservation_verification`: The agent should ask for the guest's confirmation number, full name, and check-in date. |
* **Specify the agent's obligation, not the outcome**: Criteria should describe what the agent must do or say, not what should happen overall.

  | Label                | Weak                                        | Strong                                                                  |
  | -------------------- | ------------------------------------------- | ----------------------------------------------------------------------- |
  | `verify_reservation` | The guest's reservation should be verified. | The agent should ask for the guest's confirmation number and full name. |
* **Avoid subjective qualifiers**: Words like "clearly", "appropriately", and "properly" invite inconsistent interpretation. Replace them with specific, observable actions.

  | Label                 | Weak                                                      | Strong                                                                                                               |
  | --------------------- | --------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
  | `cancellation_policy` | The agent should clearly explain the cancellation policy. | The agent should state that cancellations made less than 24 hours before check-in are subject to a one-night charge. |
* **Define edge cases**: If a criterion only applies under certain conditions, state when it does and doesn't apply. Otherwise, the model will guess.

  ```text
  - early_checkin: The agent should offer early check-in if the guest
    mentions arriving before the standard check-in time. If the guest does
    not mention an early arrival, this criterion does not apply and should
    be marked as Succeeded.
  ```

## Validating your criteria

After writing your script:

1. **Read each criterion and ask**: "Could I unambiguously determine pass/fail from the transcript text alone?" If not, tighten the description. Be wary of relying on domain-specific knowledge that may not be clear to the LLM.
2. **Identify partial-match scenarios**: For each criterion, think of a case where the agent half-addresses it. Does your wording make the expected result clear?
3. **Run the evaluation multiple times**: If any labels flip between `Succeeded` and `Failed` across runs, the criterion is ambiguous and needs to be more specific.
4. **Start with a golden output**: Manually label a transcript yourself, then compare the operator's output. Focus on mismatches to identify descriptions that need to be improved.