AI Exam Generation & Automated Grading System

The Problem

Creating high-quality assessments is enormously time-consuming for educators. A single well-crafted exam. with varied question types, appropriate difficulty distribution, rubric-aligned marking criteria, and balanced topic coverage. can take 8-12 hours to produce. Grading essays and short answers is even more labour-intensive, and the feedback students receive is often minimal ("See me" or a bare percentage).

This was a portfolio project built for a continuing professional development (CPD) organisation running certificate programmes for working adults. Their assessment process was a bottleneck: exams took too long to produce and student feedback turnaround was 3-4 weeks.

What I Built

Syllabus-driven exam generation: educators upload course learning objectives, topic outlines, and example questions. GPT-4o generates a draft exam with customisable parameters: number of questions, question type mix (MCQ, short answer, essay, case study), difficulty distribution, and which learning objectives each question targets. The generated exam maps each question to specific learning objectives. making it straightforward to verify coverage.

Question bank: generated questions are stored in a bank, tagged by topic, difficulty, and learning objective. Future exams can draw from the bank alongside fresh generation, and educators can edit, rate, and reject individual questions. Rejected questions are used as negative examples for the generation prompts.

Automated grading: for short answers and essays, GPT-4o grades against the rubric defined by the educator. The model produces a score per rubric dimension, an overall grade, and a paragraph of specific feedback for each student. Feedback is grounded in the student's actual answer: "Your answer correctly identifies X but misses the key point about Y, which the rubric requires at distinction level."

Educator review workflow: all AI-generated grades for essays are flagged as "AI-provisional" until an educator spot-checks them. The review interface shows AI grade, AI feedback, and the student's answer side-by-side. Educators can adjust grades and override feedback with one click. Grade adjustments feed back into a calibration dataset for ongoing prompt improvement.

Student report: each student receives a detailed feedback report post-grading: overall grade, per-question performance, comparison against cohort percentiles, and identified gap areas for further study.

Technical Highlights

The grading pipeline uses a structured output schema from GPT-4o (JSON mode) to ensure every grade response has the required fields (dimension scores, total, feedback, confidence). Low-confidence grades are automatically escalated to human review.

Essay grading runs as a background task queue (FastAPI + Celery + Redis). A full cohort of 120 essays grades in approximately 18 minutes. Concurrent grading with rate-limiting respects the OpenAI API tier limits.

The platform includes a plagiarism detection pass before grading. a sliding-window text similarity check against all submissions in the same cohort, flagging submissions above a configurable similarity threshold.

Outcome

The platform launched with two certificate programmes (120 enrolled students each). Exam preparation time dropped from an average of 10 hours per exam to 1.8 hours (educator review and editing of AI-generated content). The CPD organisation reported an 82% reduction in total exam preparation labour.

Essay feedback turnaround went from 3-4 weeks to 3-5 business days (including the educator spot-check step). Student satisfaction with feedback quality improved significantly. students consistently noted in course evaluations that the written feedback was "specific" and "actually useful."

Tech used.

The Problem

What I Built

Technical Highlights

Outcome