Parth Sharma - Portfolio

LOCAL/

10:50:45

CONTACT NOW

10:50:45

CONTACT NOW

Open AI - Red Teaming

Addressing the urgent need to identify potential harms and vulnerabilities in generative AI outputs through structured red teaming.

Client

Open AI + CMU

Year

2024

VIsion & Problem Statement

Vision:
Build safer AI systems by proactively identifying and classifying harmful or manipulative outputs using human-in-the-loop red teaming strategies.

Problem Statement:
Generative AI systems, like large language models, can be exploited to produce misinformation, bias, or unsafe content. There's a growing need for a structured red teaming process that is scalable, repeatable, and able to surface edge-case failures before public deployment.

Vision:
Build safer AI systems by proactively identifying and classifying harmful or manipulative outputs using human-in-the-loop red teaming strategies.

VIsion & Problem Statement

Vision:
Build safer AI systems by proactively identifying and classifying harmful or manipulative outputs using human-in-the-loop red teaming strategies.

Product Goal

Develop a structured red teaming framework that enables accurate classification and severity scoring of adversarial prompts—empowering safety researchers to uncover model vulnerabilities and strengthen generative AI systems against real-world misuse.

Product Goal

User Stories

Title	As a/an	I want to	So that
Classify Harmful Prompts	Red Teamer	Tag prompts into relevant harm categories	I can identify and document model vulnerabilities
Score Severity of Prompts	Safety Researcher	Assign impact/severity scores to prompts	I can prioritize the most critical issues for escalation
Track Red Teaming Effectiveness	Program Manager	Visualize prompt coverage and severity distribution	I can report on safety trends and team progress
Export Tagged Data	Analyst	Export categorized prompts and scores in a structured format	I can run further analysis or share insights with stakeholders
Submit Prompts in Bulk	Developer	Upload multiple prompts for batch classification	I can accelerate the red teaming process and reduce manual input effort

Title	As a/an	I want to	So that
Classify Harmful Prompts	Red Teamer	Tag prompts into relevant harm categories	I can identify and document model vulnerabilities
Score Severity of Prompts	Safety Researcher	Assign impact/severity scores to prompts	I can prioritize the most critical issues for escalation
Track Red Teaming Effectiveness	Program Manager	Visualize prompt coverage and severity distribution	I can report on safety trends and team progress
Export Tagged Data	Analyst	Export categorized prompts and scores in a structured format	I can run further analysis or share insights with stakeholders
Submit Prompts in Bulk	Developer	Upload multiple prompts for batch classification	I can accelerate the red teaming process and reduce manual input effort

User Stories

Title	As a/an	I want to	So that
Classify Harmful Prompts	Red Teamer	Tag prompts into relevant harm categories	I can identify and document model vulnerabilities
Score Severity of Prompts	Safety Researcher	Assign impact/severity scores to prompts	I can prioritize the most critical issues for escalation
Track Red Teaming Effectiveness	Program Manager	Visualize prompt coverage and severity distribution	I can report on safety trends and team progress
Export Tagged Data	Analyst	Export categorized prompts and scores in a structured format	I can run further analysis or share insights with stakeholders
Submit Prompts in Bulk	Developer	Upload multiple prompts for batch classification	I can accelerate the red teaming process and reduce manual input effort

Core Features

Feature	Description	Priority
Prompt Categorization	Classifies prompts into one of the predefined harm categories	P1
Severity Scoring	Allows scoring of prompts based on impact and potential real-world harm	P1
Red Teaming Dashboard	Tracks the number of prompts per category, severity levels, and overall coverage	P2
Bulk Prompt Submission	Enables batch uploading of red teaming prompts for tagging	P3
Data Export	Allows CSV/JSON export of scored and categorized data for analysis	P2

Feature	Description	Priority
Prompt Categorization	Classifies prompts into one of the predefined harm categories	P1
Severity Scoring	Allows scoring of prompts based on impact and potential real-world harm	P1
Red Teaming Dashboard	Tracks the number of prompts per category, severity levels, and overall coverage	P2
Bulk Prompt Submission	Enables batch uploading of red teaming prompts for tagging	P3
Data Export	Allows CSV/JSON export of scored and categorized data for analysis	P2

Core Features

Feature	Description	Priority
Prompt Categorization	Classifies prompts into one of the predefined harm categories	P1
Severity Scoring	Allows scoring of prompts based on impact and potential real-world harm	P1
Red Teaming Dashboard	Tracks the number of prompts per category, severity levels, and overall coverage	P2
Bulk Prompt Submission	Enables batch uploading of red teaming prompts for tagging	P3
Data Export	Allows CSV/JSON export of scored and categorized data for analysis	P2

Success Metrics

Metric	Description
Category Coverage %	Proportion of prompt categories covered by the team
Severity Score Distribution	Spread of low-to-high risk prompts across all categories
Inter-rater Agreement	Consistency of scores and tags across different team members
Time to Classify	Average time from prompt submission to final tagging
Prompt Volume per Category	Total number of prompts tagged within each harm type

Metric	Description
Category Coverage %	Proportion of prompt categories covered by the team
Severity Score Distribution	Spread of low-to-high risk prompts across all categories
Inter-rater Agreement	Consistency of scores and tags across different team members
Time to Classify	Average time from prompt submission to final tagging
Prompt Volume per Category	Total number of prompts tagged within each harm type

Success Metrics

Metric	Description
Category Coverage %	Proportion of prompt categories covered by the team
Severity Score Distribution	Spread of low-to-high risk prompts across all categories
Inter-rater Agreement	Consistency of scores and tags across different team members
Time to Classify	Average time from prompt submission to final tagging
Prompt Volume per Category	Total number of prompts tagged within each harm type

Technical Stack

Prompt Sources: OpenAI Red Teaming Dataset, Custom Adversarial Inputs

Tools: Microsoft Excel, Google Sheets, Notion

Collaboration: Zoom, Slack, GitHub

Outputs: Categorized JSON files, Severity matrices, Impact dashboards, Tagging logs

Prompt Sources: OpenAI Red Teaming Dataset, Custom Adversarial Inputs

Tools: Microsoft Excel, Google Sheets, Notion

Collaboration: Zoom, Slack, GitHub

Outputs: Categorized JSON files, Severity matrices, Impact dashboards, Tagging logs

Technical Stack

Prompt Sources: OpenAI Red Teaming Dataset, Custom Adversarial Inputs

Tools: Microsoft Excel, Google Sheets, Notion

Collaboration: Zoom, Slack, GitHub

Outputs: Categorized JSON files, Severity matrices, Impact dashboards, Tagging logs

Key Results

1000+ prompts categorized across 8 distinct harm categories
Achieved >85% inter-rater consistency through collaborative scoring sessions
Developed a severity-weighted red teaming matrix
Created a reporting dashboard summarizing prompt impact and spread
Delivered final dataset and recommendations to OpenAI’s Safety Review team

Constraints, Risks, and Mitigations

Issue / Constraint	Type	Mitigation / Notes
Subjectivity in category assignment	Risk	Use consensus-driven tagging and multiple reviewers
Evolving definitions of harm	Risk	Periodically revisit and revise harm taxonomy
Manual scoring effort is time-consuming	Constraint	Explore auto-tagging with GPT-4 for initial scoring
No UI/UX for tagging built (manual JSON editing)	Constraint	Recommend building a lightweight UI for future efforts
Dataset lacks multilingual prompts	Risk	Include multilingual cases in future red teaming rounds

Business Impact

Reinforces OpenAI’s commitment to safety and responsible AI use
Serves as a prototype framework for future internal or external red teaming teams
Enables better auditability and transparency around harmful output mitigation
Provides actionable insights to improve LLM alignment and policy filters
Strengthens trust with enterprise and government partners in high-risk domains

Future Roadmap

Short-Term

Automate prompt scoring using zero-shot GPT-4 API
Integrate a lightweight UI for tagging and exporting
Visualize scoring distribution using a simple web dashboard

Mid-Term

Expand the prompt dataset to cover multilingual and multimodal inputs
Incorporate tagging consistency metrics into live feedback loops
Collaborate with external safety teams for broader dataset creation

Long-Term

Deploy red teaming as a service internally at OpenAI
Open-source toolkit for researchers and universities
Align the tagging framework with global safety compliance standards

Key Results

1000+ prompts categorized across 8 distinct harm categories
Achieved >85% inter-rater consistency through collaborative scoring sessions
Developed a severity-weighted red teaming matrix
Created a reporting dashboard summarizing prompt impact and spread
Delivered final dataset and recommendations to OpenAI’s Safety Review team

Constraints, Risks, and Mitigations

Issue / Constraint	Type	Mitigation / Notes
Subjectivity in category assignment	Risk	Use consensus-driven tagging and multiple reviewers
Evolving definitions of harm	Risk	Periodically revisit and revise harm taxonomy
Manual scoring effort is time-consuming	Constraint	Explore auto-tagging with GPT-4 for initial scoring
No UI/UX for tagging built (manual JSON editing)	Constraint	Recommend building a lightweight UI for future efforts
Dataset lacks multilingual prompts	Risk	Include multilingual cases in future red teaming rounds

Business Impact

Reinforces OpenAI’s commitment to safety and responsible AI use
Serves as a prototype framework for future internal or external red teaming teams
Enables better auditability and transparency around harmful output mitigation
Provides actionable insights to improve LLM alignment and policy filters
Strengthens trust with enterprise and government partners in high-risk domains

Future Roadmap

Short-Term

Automate prompt scoring using zero-shot GPT-4 API
Integrate a lightweight UI for tagging and exporting
Visualize scoring distribution using a simple web dashboard

Mid-Term

Expand the prompt dataset to cover multilingual and multimodal inputs
Incorporate tagging consistency metrics into live feedback loops
Collaborate with external safety teams for broader dataset creation

Long-Term

Deploy red teaming as a service internally at OpenAI
Open-source toolkit for researchers and universities
Align the tagging framework with global safety compliance standards

Key Results

1000+ prompts categorized across 8 distinct harm categories
Achieved >85% inter-rater consistency through collaborative scoring sessions
Developed a severity-weighted red teaming matrix
Created a reporting dashboard summarizing prompt impact and spread
Delivered final dataset and recommendations to OpenAI’s Safety Review team

Constraints, Risks, and Mitigations

Issue / Constraint	Type	Mitigation / Notes
Subjectivity in category assignment	Risk	Use consensus-driven tagging and multiple reviewers
Evolving definitions of harm	Risk	Periodically revisit and revise harm taxonomy
Manual scoring effort is time-consuming	Constraint	Explore auto-tagging with GPT-4 for initial scoring
No UI/UX for tagging built (manual JSON editing)	Constraint	Recommend building a lightweight UI for future efforts
Dataset lacks multilingual prompts	Risk	Include multilingual cases in future red teaming rounds

Business Impact

Reinforces OpenAI’s commitment to safety and responsible AI use
Serves as a prototype framework for future internal or external red teaming teams
Enables better auditability and transparency around harmful output mitigation
Provides actionable insights to improve LLM alignment and policy filters
Strengthens trust with enterprise and government partners in high-risk domains

Future Roadmap

Short-Term

Automate prompt scoring using zero-shot GPT-4 API
Integrate a lightweight UI for tagging and exporting
Visualize scoring distribution using a simple web dashboard

Mid-Term

Expand the prompt dataset to cover multilingual and multimodal inputs
Incorporate tagging consistency metrics into live feedback loops
Collaborate with external safety teams for broader dataset creation

Long-Term

Deploy red teaming as a service internally at OpenAI
Open-source toolkit for researchers and universities
Align the tagging framework with global safety compliance standards

More Works More Works

More Works More Works

CMU - GEN AI LAB RAG

RAG SYSTEM + SEARCH + DB

2024

CMU - GEN AI LAB RAG

RAG SYSTEM + SEARCH + DB

2024

CMU - GEN AI LAB RAG

RAG SYSTEM + SEARCH + DB

2024

CMU - GEN AI LAB RAG

RAG SYSTEM + SEARCH + DB

2024

113 INDUSTRIES - CRM

HEALTHCARE + ML + ANALYTICS

2024

113 INDUSTRIES - CRM

HEALTHCARE + ML + ANALYTICS

2024

113 INDUSTRIES - CRM

HEALTHCARE + ML + ANALYTICS

2024

113 INDUSTRIES - CRM

HEALTHCARE + ML + ANALYTICS

2024

GO BACK TO TOP

Open AI - Red Teaming

Open AI - Red Teaming

VIsion & Problem Statement

VIsion & Problem Statement

VIsion & Problem Statement

Product Goal

Product Goal

Product Goal

User Stories

User Stories

User Stories

Core Features

Core Features

Core Features

Success Metrics

Success Metrics

Success Metrics

Technical Stack

Technical Stack

Technical Stack

Key Results

Constraints, Risks, and Mitigations

Business Impact

Future Roadmap

Key Results

Constraints, Risks, and Mitigations

Business Impact

Future Roadmap

Key Results

Constraints, Risks, and Mitigations

Business Impact

Future Roadmap

More Works More Works

More Works More Works