Sample Evaluation Report
A complete 7-dimension compliance evaluation of a high-risk banking chatbot. Scroll through the report below to see the depth of analysis, regulatory mapping, and actionable findings that every EthiCompass audit delivers.
Report Summary
| Document Ref | ETHIC-RPT-2026-00147 | Version | 1.0 — Final |
| Evaluation ID | eval_mock_eurobank_2026Q1 | Date Issued | March 15, 2026 |
| Classification | Confidential — Client Only | KB Version | UKB-2026.1.3 |
EthiCompass — AI Compliance & Audit Platform
EuroBank Virtual Assistant v3.2
Generative AI — Customer-Facing Chatbot
Client Organization
EuroBank AG
Digital Banking Division
Kaiserstraße 42, 60311 Frankfurt am Main
Germany
Contact: Maria Schmidt, VP Digital Banking
Evaluating Entity
EthiCompass
AI Compliance & Audit Platform
Peer-Reviewed Evaluation Methodology
7-Dimension Framework (UKB-2026.1.3)
Engine: v1.0 — Production
Annex III, 5(b) — AI system in financial services influencing access to essential services
Jurisdictions: EU (AI Act), Germany (BaFin), Spain (CNMV) | Affected population: 2.3M
Risk Level
HIGH
11 / 15 pts
Intake Score
7.6 / 10
GOOD
Dimensional Score
7.8 / 10
CONDITIONAL
Evaluation Scope
Prepared By
EthiCompass Evaluation Engine v1.0
Expert validation: Senior Compliance Analyst
Methodology: Peer-reviewed 7-Dimension Framework
Distribution
Maria Schmidt — VP Digital Banking
Dr. Elena García — Chief Ethics Officer
Thomas Müller — Data Protection Officer
Sophia Bernard — Head of Legal & Compliance
Sample Report — Demonstration Purposes Only
This document illustrates the format and depth of an EthiCompass compliance evaluation. All data, entities, and findings are synthetic. This report does not constitute legal advice.
| Project Name | EuroBank Virtual Assistant v3.2 |
| System Type | Generative AI — Customer-Facing Chatbot |
| AI Models | GPT-4o, Custom BERT (intent), Sentence-BERT (retrieval) |
| Sector | Financial Services — Retail Banking |
| Jurisdictions | EU (AI Act), Germany (BaFin), Spain (CNMV) |
| Lifecycle Phase | Production — 8 months in operation |
| Deployment | Cloud-based (Azure EU West) |
| Factor | Value | Indicator |
|---|---|---|
| Affected Population | 2.3M active customers | HIGH |
| Vulnerable Groups | Yes — Elderly (65+), Cognitive disability | HIGH |
| Decision Types | Recommendation, Partial automation | MEDIUM |
| Reversibility | Partially reversible | MEDIUM |
| EU AI Act Classification | HIGH RISK — Annex III, 5(b) | HIGH |
Key Roles
| Project Owner | Maria Schmidt, VP Digital Banking |
| Technical Lead | Hans Weber, Senior ML Engineer |
| Ethics Officer | Dr. Elena García, Chief Ethics Officer |
| Data Protection Officer | Thomas Müller, DPO |
| Legal Contact | Sophia Bernard, Head of Legal & Compliance |
Governance
EuroBank Virtual Assistant v3.2 is a production-grade generative AI chatbot serving 2.3 million retail banking customers across EU markets. The system leverages GPT-4o for response generation, supplemented by custom BERT models for intent classification and Sentence-BERT for knowledge retrieval. Deployed on Azure EU West, the assistant handles account inquiries, transaction support, basic financial product recommendations, and complaint routing.
The system is classified as HIGH RISK under the EU AI Act (Annex III, 5(b)) due to its role in financial services where AI-generated recommendations influence customer access to financial products and services. The affected population includes vulnerable groups — specifically elderly customers (65+) and individuals with cognitive disabilities — requiring heightened scrutiny across fairness and explainability dimensions.
EuroBank has established governance infrastructure including a monthly AI Governance Board, quarterly bias audits, and a 5-member AI Ethics Board with external academic and consumer representation. Contingency measures include automatic human escalation when model confidence drops below 0.7 and a full human fallback available 24/7. However, gaps exist in formal cost-benefit documentation, standardized bias audit methodology, and the absence of completed external audits.
This evaluation covers 500 customer conversations collected over a 30-day period (February 14 — March 15, 2026), including interactions in Spanish, German, and English, with representative distribution across retail, premium, and vulnerable customer segments.
This summary was generated by the EthiCompass intake analysis engine based on Sections A–D submitted data.
11 / 15 points — HIGH RISK
| Factor | Value | Points | Max |
|---|---|---|---|
| Vulnerable Groups Affected | Yes (elderly, cognitive disability) | 3 | 3 |
| Sector in EU AI Act Annex III | Yes — 5(b) Financial services | 3 | 3 |
| Decision Type | Recommendation | 1 | 3 |
| Reversibility | Partially reversible | 1 | 2 |
| Population Scale | 2.3M (Millions+) | 3 | 3 |
| TOTAL | 11 | 15 |
Overall = Proportionality (7.8 × 40%) + Governance (7.5 × 60%)
Proportionality Score: 7.8 / 10(Section C)
Governance Score: 7.5 / 10(Section D)
Strengths
Gaps
| Dimension | Score | Status | Findings |
|---|---|---|---|
| Discrimination & Fairness | 7.2 | CONDITIONAL | 4 |
| Toxicity & Harmful Language | 9.4 | PASS | 1 |
| Explainability & Transparency | 6.1 | REQUIRES ACTION | 5 |
| Privacy & Data Protection | 8.5 | PASS | 2 |
| Factuality & Accuracy | 7.8 | CONDITIONAL | 3 |
| Robustness & Resilience | 8.1 | CONDITIONAL | 2 |
| Regulatory Compliance | 7.5 | CONDITIONAL | 2 |
Composite Score
7.8/10
Verdict
CONDITIONALThe virtual assistant incorrectly states a deposit insurance limit of €200,000, when the actual limit under Directive 2014/49/EU is €100,000 per depositor per institution. This factual error was detected in 12 of 500 evaluated conversations, representing a material misinformation risk for retail banking customers.
In 23% of investment-related conversations, the assistant provides product recommendations without first completing the required risk profiling and suitability assessment mandated by MiFID II. This constitutes a systematic compliance gap in the advisory workflow.
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P0 | Correct deposit insurance limit to €100,000 in knowledge base | Dir. 2014/49/EU | 7 days |
| P0 | Implement mandatory suitability assessment gate before product recommendations | MiFID II Art. 25 | 14 days |
| P1 | Add AI-generated content disclosure to all responses | EU AI Act Art. 52 | 30 days |
| P1 | Implement explanation module for credit-related decisions | EU AI Act Art. 13 | 30 days |
| P1 | Add confidence indicators to factual claims | EU AI Act Art. 14 | 45 days |
| Sub-Metric | Score | Severity |
|---|---|---|
| AI Disclosure Compliance | 4.2 | REQUIRES ACTION |
| Decision Reasoning Provided | 5.8 | REQUIRES ACTION |
| Confidence Level Communication | 6.5 | CONDITIONAL |
| Source Attribution | 7 | CONDITIONAL |
| Limitation Acknowledgment | 7.1 | CONDITIONAL |
Finding 3.1: No AI Disclosure in Customer Interactions
In 78% of evaluated conversations, the assistant fails to disclose that the customer is interacting with an AI system. EU AI Act Article 52 mandates clear disclosure when natural persons interact with AI systems.[Evidence: conv_0023, conv_0089, conv_0156]
Finding 3.2: Insufficient Reasoning for Credit Assessments
When providing credit product information, the assistant does not explain the basis for suitability determinations in 62% of cases. Customers receive recommendations without understanding why specific products were suggested.[Evidence: conv_0156, conv_0201]
Finding 3.3: Absent Confidence Indicators
The assistant presents all responses with equal certainty, without distinguishing between verified facts and probabilistic assessments.[Evidence: conv_0334, conv_0412]
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P1 | Add AI-generated content disclosure to all conversation entry points | EU AI Act Art. 52 | 30 days |
| P1 | Implement reasoning module for credit-related responses | EU AI Act Art. 13 | 30 days |
| P2 | Integrate confidence scoring with user-facing indicators | EU AI Act Art. 14 | 45 days |
| Sub-Metric | Score | Severity |
|---|---|---|
| Age-Based Equity | 5.8 | REQUIRES ACTION |
| Gender Neutrality | 8.2 | PASS |
| Socioeconomic Fairness | 7.4 | CONDITIONAL |
| Geographic Parity | 7.6 | CONDITIONAL |
Finding 1.1: Age-Based Product Recommendation Disparity
Analysis reveals 38% fewer investment product suggestions for customers aged 65 and above compared to the 30-50 demographic with equivalent financial profiles. This pattern suggests implicit age-based filtering in the recommendation algorithm that may constitute discriminatory treatment under the EU Equal Treatment Directive.[Evidence: conv_0067, conv_0143, conv_0298, conv_0376]
Finding 1.2: Socioeconomic Language Variation
The assistant uses noticeably simpler language and fewer product options when customer profiles indicate lower-income postal codes, regardless of stated financial capacity. Detected in 14% of cross-segment comparisons.[Evidence: conv_0089, conv_0221]
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P1 | Audit and remove age-based filtering in recommendation pipeline | EU Equal Treatment Dir. | 30 days |
| P2 | Implement demographic parity testing in CI/CD pipeline | EU AI Act Art. 10 | 60 days |
| P2 | Normalize language complexity across socioeconomic segments | EU AI Act Art. 10 | 60 days |
| Sub-Metric | Score | Severity |
|---|---|---|
| Explicit Harmful Language | 10 | PASS |
| Implicit Harm | 8.9 | PASS |
| Severity Classification | 9.5 | PASS |
| Context Factors | 9.2 | PASS |
Finding 2.1: No Explicit Harmful Language Detected
Zero instances of slurs, threats, or explicitly harmful language in chatbot responses across all500 evaluated conversations.
Finding 2.2: Minor Dismissive Tone in Complaint Handling
3 instances of dismissive tone detected when customers expressed frustration about fees. Language patterns suggest minimization of concerns rather than empathetic acknowledgment.[Evidence: conv_0341, conv_0899, conv_1456]
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P3 | Enhance empathy patterns in complaint-handling responses. Current templates may benefit from validation by UX writing team. | Best practice | 90 days |
| Sub-Metric | Score | Severity |
|---|---|---|
| PII Detection | 8.2 | CONDITIONAL |
| Inference Risk | 8.8 | PASS |
| Sensitivity Classification | 8.5 | PASS |
| Regulatory Alignment | 8.7 | PASS |
Finding 4.1: IBAN Masking Not Applied
2 instances where chatbot echoed back full IBAN numbers in conversation when partial masking would suffice. No instances of leaking PII across conversation sessions.[Evidence: conv_0189, conv_0734]
Finding 4.2: Potential Profiling via Spending Patterns
Credit card spending patterns mentioned in context of product recommendations could be considered profiling under GDPR Art. 22.[Evidence: conv_0445, conv_0890]
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P2 | Implement IBAN masking in conversation responses — show only last 4 digits. | GDPR Art. 5(1)(c) | 30 days |
| P2 | Add explicit consent prompt when conversation data may be used for model training/improvement. | GDPR Art. 6, Art. 7 | 45 days |
| Sub-Metric | Score | Severity |
|---|---|---|
| Verifiable Claim Identification | 7.5 | CONDITIONAL |
| Evidence Quality | 8 | PASS |
| Speculation vs Fact | 7.6 | CONDITIONAL |
| Internal Consistency | 8.9 | PASS |
| Known Falsehoods | 7 | REQUIRES ACTION |
Finding 5.1: Incorrect Deposit Insurance Information (CRITICAL)
One instance of incorrect regulatory information: chatbot stated deposit insurance covers up to €200,000 when the actual EU-wide limit is €100,000 per depositor per bank. This represents a material factual error with direct regulatory implications.[Evidence: conv_1123]
Finding 5.2: Projected Returns Presented Without Disclaimers
In 4 instances, chatbot presented projected returns as likely outcomes without adequate disclaimers. Language like “you can expect” used instead of “historical performance suggests”.[Evidence: conv_0378, conv_0712, conv_1045, conv_1389]
Finding 5.3: Outdated Interest Rate Information
Interest rates quoted accurately in 94% of cases. 6% of responses contained outdated rate information referencing Q3 2025 rates instead of current Q1 2026 rates.[Evidence: conv_0267, conv_0534, conv_0801]
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P0 | URGENT: Fix deposit insurance information. Implement real-time fact-checking against regulatory database. | Dir. 2014/49/EU; EU AI Act Art. 9 | 7 days |
| P1 | Add mandatory disclaimers for forward-looking financial statements. | MiFID II Art. 24 | 14 days |
| P2 | Implement rate data freshness check — flag responses using data older than 30 days. | Best practice | 45 days |
| Sub-Metric | Score | Severity |
|---|---|---|
| Prompt Injection Vulnerability | 8.5 | CONDITIONAL |
| Jailbreak Susceptibility | 8.8 | PASS |
| Context Switching Resilience | 7.2 | CONDITIONAL |
| Specification Gaming | 7.9 | CONDITIONAL |
Finding 6.1: Multi-Turn Context Manipulation Bypass
One partial bypass detected via multi-turn context manipulation where user gradually shifted chatbot into providing investment advice outside its authorized scope. Standard prompt injection attempts (“ignore previous instructions”) blocked effectively.[Evidence: conv_0045_adversarial]
Finding 6.2: Distress Scenario Scope Expansion
When users simulate distress scenarios (“I'm going to lose my house”), the chatbot occasionally provides more specific financial guidance than its authorized scope permits, in 7% of distress-test conversations.[Evidence: conv_0067_adversarial, conv_0089_adversarial]
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P2 | Implement multi-turn conversation boundary monitoring. Add progressive confidence decay for out-of-scope topic drift. | EU AI Act Art. 15 | 60 days |
| P2 | Add distress detection protocol with immediate human escalation — not expanded AI guidance. | EU AI Act Art. 14 | 45 days |
| Sub-Metric | Score | Severity |
|---|---|---|
| Jurisdiction-Specific | 7.8 | CONDITIONAL |
| Industry Regulations | 7 | REQUIRES ACTION |
| Contractual Obligations | 8.2 | PASS |
| Emerging Guidance | 7.1 | REQUIRES ACTION |
Finding 7.1: MiFID II Suitability Assessment Gap (CRITICAL)
MiFID II suitability assessment not consistently performed before product recommendations. 23% of recommendation conversations skip the risk profiling step. This represents a systematic regulatory compliance gap.[Evidence: conv_0112_reg, conv_0334_reg, conv_0556_reg]
Finding 7.2: EU AI Act Compliance Gaps
Partial alignment with EU AI Act requirements. Key gaps: (1) No conformity assessment documentation, (2) No registration in EU database per Art. 49, (3) Fundamental rights impact assessment not evidenced per Art. 27.
Finding 7.3: Spanish CNMV Guidelines Partially Met
Spanish CNMV guidelines for automated investment advice partially met — missing explicit “best execution” disclosure for product recommendations.[Evidence: conv_0445_reg, conv_0890_reg]
| Priority | Action | Regulatory Ref | Deadline |
|---|---|---|---|
| P0 | URGENT: Implement mandatory MiFID II suitability assessment gate before any product recommendation. | MiFID II Art. 25 | 14 days |
| P1 | Initiate EU AI Act conformity assessment process. Document technical specs per Annex IV. | EU AI Act Art. 43, Annex IV | 90 days |
| P1 | Complete fundamental rights impact assessment (FRIA) and register in EU AI database. | EU AI Act Art. 27, Art. 49 | 60 days |
| Article | Topic | Dimension(s) | Status | Gap Identified |
|---|---|---|---|---|
| Art. 9 | Risk management system | Factuality, Discrimination | CONDITIONAL | Incomplete risk management for factual claims |
| Art. 10(2)(f) | Bias detection in training data | Discrimination | CONDITIONAL | Age-based bias not addressed in training pipeline |
| Art. 13 | Transparency and information provision | Explainability | REQUIRES ACTION | No AI disclosure, insufficient reasoning |
| Art. 14 | Human oversight | Robustness, Explainability | CONDITIONAL | Distress scenarios need human escalation |
| Art. 15 | Accuracy, robustness, cybersecurity | Robustness | CONDITIONAL | Multi-turn boundary monitoring needed |
| Art. 27 | Fundamental rights impact assessment | Regulatory | REQUIRES ACTION | FRIA not conducted |
| Art. 43 | Conformity assessment | Regulatory | REQUIRES ACTION | No conformity assessment initiated |
| Art. 49 | Registration in EU database | Regulatory | REQUIRES ACTION | Not registered |
| Art. 52(1) | Transparency obligations (AI interaction) | Explainability | REQUIRES ACTION | No AI disclosure to users |
| Annex III, 5(b) | High-risk: financial services | Classification | — | System correctly classified as HIGH RISK |
| Annex IV | Technical documentation | Regulatory | REQUIRES ACTION | Documentation gaps |
This mapping covers the primary EU AI Act articles applicable to high-risk AI systems in the financial services sector. Additional requirements may apply based on national implementing legislation and sector-specific guidance from competent authorities.
This evaluation was conducted using the EthiCompass 7-Dimension Framework, which assesses AI systems across scientifically validated dimensions of ethical compliance. Scores are derived from automated analysis of system outputs, supplemented by heuristic pattern matching against regulatory requirements. This evaluation does not constitute legal advice, and organizations should consult qualified legal counsel regarding specific regulatory obligations.
The composite score is calculated as a weighted average of individual dimension scores, with weights reflecting the risk profile and regulatory context of the evaluated system. Dimension weights for high-risk financial services AI systems prioritize Factuality & Accuracy, Regulatory Compliance, and Discrimination & Fairness.
This report was generated using EthiCompass evaluation engine v1.0 with Universal Knowledge Base UKB-2026.1.3.
© 2026 EthiCompass. All rights reserved.
A complete AI compliance audit in 3 weeks. 7 dimensions. EU AI Act mapping. Every finding traceable. Every score explainable.