Fintech LLMs Investment Evaluation Analysis

Nov 2025 - Dec 2025
Python / LLM Evaluation / NLP (FinBERT) / Portfolio Risk Metrics / Backtesting / Bias Analysis
LLM Portfolio Evaluation Framework

About Project

  • Type: Personal Applied AI + FinTech Research Project
  • Models Compared: ChatGPT vs Gemini vs FinGPT (finance-specialized LLM)
  • Investor Personas: Conservative / Balanced / Aggressive (standardized prompts to remove prompt bias)
  • Deliverables: Evaluation notebook + results tables + summary report/presentation with findings and recommendations

Objective

Evaluate whether LLMs can generate reliable investment-style outputs by testing factual accuracy, sentiment bias, portfolio risk alignment, diversification quality, and backtested performance across different investor profiles.

Tools & Technologies

Python, Pandas, NumPy, FinBERT (Sentiment), Portfolio Risk Metrics, Z-Score Normalization, HHI Concentration Index, Backtesting vs SPY, Sharpe Ratio / Max Drawdown

Key Work & Findings

Designed an end-to-end evaluation pipeline that converts unstructured LLM responses into structured datasets (tickers, asset types, allocations, rationale) for apples-to-apples comparison across models.

Built factual accuracy checks to validate ticker existence and asset description correctness, scoring reliability with an objective pass/fail approach.

Measured sentiment bias using FinBERT to quantify whether language skewed overly optimistic, overly cautious, or encouraged excessive risk-taking across personas.

Quantified portfolio risk alignment using volatility-based risk scoring and z-score normalization to compare risk levels consistently across asset classes and model outputs.

Evaluated diversification quality using the Herfindahl–Hirschman Index (HHI) to detect concentration risk and compare asset mix quality between models.

Backtested generated portfolios and benchmarked against SPY using total return, Sharpe ratio, and max drawdown to separate “high returns” from sustainable risk-adjusted performance.

Key finding: general-purpose LLMs produced more usable and reliable outputs than the finance-specialized model in this setup, reinforcing the need for validation layers + human oversight before real-world use.

External Links