DQA Profiler - Data Quality Assessment & Profiling System

A structured profiling workflow for auditing dataset quality and generating decision-ready quality reports for operations, analytics, and monitoring teams.

Overview

CIGMA Data Profiler is a full-stack Data Quality Assessment application built to audit incoming datasets, detect quality risks early, and generate decision-ready outputs for analysts, data teams, and operations leads.

  • File upload and automatic data nature detection
  • Column-level profiling with missingness, uniqueness, and anomaly flags
  • Data quality scoring, root cause analysis, and blast-radius summary
  • Auto-remediation plan generation for quality improvements

Core Capabilities

  • Data type inference: quantitative / qualitative profiling
  • File nature recognition: CSV, TSV, XLSX, JSON, XML, DOCX
  • Data grain analysis with row/column metadata
  • Outlier detection and numeric distribution labeling
  • Analysis-fit guidance for feasible statistical/ML approaches
  • Server-side metrics and async job-status monitoring

Data Inputs & Transformations

The transformation layer supports practical clean-up workflows before downstream analytics. Users can run duplicate removal, missing-value treatment, text normalization, and outlier-capping operations directly through the app.

  • Upload + preview for rapid validation
  • Duplicate handling and missing strategy controls
  • Column-targeted transformation choices
  • Downloadable transformed dataset outputs
DQA profiler main interface
Main DQA interface for upload, profiling, and quality diagnostics.

Reports & AI Insights

  • Exportable CSV and PDF quality reports
  • Missing-value summaries and outlier snapshots
  • AI insights (sync + async jobs) with job tracking endpoint
  • Chat-assistant mode for dataset-focused Q&A
  • Root cause + blast radius + remediation recommendations

Architecture & API

  • Backend: FastAPI service (`backend/app.py`) for analysis, transform, and reporting
  • Frontend: static HTML/CSS/JS interface served by backend
  • Key endpoints: `/api/analyze`, `/api/transform`, `/api/report/csv`, `/api/report/pdf`
  • Async intelligence: `/api/ai-insights/async` + status endpoint
  • Operational metrics endpoint: `/api/metrics`

Deployment & Security

  • Production-ready deployment via Render blueprint or Docker container
  • Environment-driven configuration for file-size, CORS, and timeout controls
  • Optional API-key gate and bearer-session authentication
  • Rate limiting for analyze and AI routes
  • Safer export handling with spreadsheet formula neutralization

Business Impact

  • Reduced reporting and model risk through early quality detection
  • Faster decision cycles with auto-generated quality diagnostics
  • Improved trust in analytics outputs and dashboard KPIs
  • Stronger governance posture for audit and compliance conversations

Tech Stack

  • Python, FastAPI, Pandas, NumPy
  • Scikit-learn pipelines and model utilities
  • ReportLab for PDF generation and CSV export utilities
  • HTML, CSS, JavaScript frontend delivered through backend static routes
  • Docker + Render deployment workflow