LLM-Assisted Content Moderation

The goal of this assignment is to understand how large language models can be used for automated content moderation and to critically analyze the strengths, weaknesses, and policy implications of such systems.

Background

Content moderation at scale is one of the most challenging problems facing online platforms. Platforms must balance free expression with safety, navigate cultural differences, and make millions of decisions per day about what content violates their policies. Increasingly, platforms are exploring the use of LLMs to augment or replace traditional keyword-based and machine learning approaches to content moderation.

In this assignment, you will use Anthropic’s Claude content moderation framework as your baseline implementation, then extend and evaluate it for real-world platform policies.

Required Reading

Before starting the assignment, review these resources:

Anthropic’s Content Moderation Guide: https://docs.claude.com/en/docs/about-claude/use-case-guides/content-moderation
Building a Moderation Filter (Cookbook): https://github.com/anthropics/anthropic-cookbook/blob/main/misc/building_moderation_filter.ipynb

These resources demonstrate Claude’s approach to content moderation, including prompt engineering techniques, risk-level classification, and best practices.

Task

You will implement Anthropic’s content moderation framework and evaluate its performance on a test dataset.

Part 1: Implementation

Following Anthropic’s content moderation guide:

Set up the Claude API (free tier or educational credits available)
Implement the basic moderation function from the guide
Choose a platform and policy focus:
- Select 1 platform (e.g., Reddit, Twitter/X, YouTube, Facebook)
- Pick 2-3 content categories to focus on (e.g., hate speech, harassment, spam)
- Document the platform’s policy for these categories from their Terms of Service
Create a test dataset of 10-15 examples including:
- Clear violations
- Clear non-violations
- Borderline/context-dependent cases
Source from public datasets, create synthetic examples, or use anonymized real content.

Part 2: Testing and Analysis

Test two prompting approaches:
- Basic classification (simple prompt)
- Chain-of-thought reasoning (using <thinking> tags)
Evaluate performance:
- Record your own judgment for each test case (human baseline)
- Run both prompting strategies on your test dataset
- Compare Claude’s decisions to your judgments
- Calculate basic accuracy for both approaches
Analyze limitations:
- Identify cases where Claude struggled (sarcasm, context, etc.)
- Test if Claude’s safety training overrides your instructions on any examples
- Note any policy conflicts between Claude’s values and platform policies

Part 3: Policy Analysis

Write a 2-page analysis addressing:

Performance: How accurate was Claude? Where did it succeed/fail?
Comparison of approaches: Did chain-of-thought reasoning help? Why or why not?
Context understanding: How well did Claude handle nuance and context?
Cost feasibility: Using Anthropic’s pricing, estimate costs for moderating at scale
Policy recommendations:
- When is LLM-based moderation appropriate vs. problematic?
- What safeguards are needed (human review, appeals)?
- How should platforms balance automation with transparency?

Submission

Your submission should include:

Code and prompts:
- Your implementation based on Anthropic’s framework
- Both prompt variations (basic and chain-of-thought)
- Test dataset with your human judgments
Results documentation (1 page):
- Platform policy summary
- Claude outputs for all test cases
- Accuracy comparison between prompting strategies
- Examples of successes and failures
Policy analysis (2 pages):
- Performance evaluation
- Limitations and edge cases
- Cost analysis
- Policy recommendations
Your name

You’re encouraged to use AI tools to help with coding and analysis, but you must understand and be able to explain all aspects of your implementation.

To submit, add everything you’d like to include to your repo, commit the changes, and push to GitHub. Please do not push a compressed version (i.e., a zip file) of your submission.

Tips

Start early! Set up your Claude API access immediately
Follow Anthropic’s guide closely—it includes working code examples
Use Claude Haiku for cost-effectiveness (recommended in guide)
Keep your test set small but diverse—10-15 examples is sufficient to learn the patterns
Focus on interesting edge cases rather than obvious violations
Document what you learn about the technology’s limitations

Resources

Anthropic Content Moderation Guide: https://docs.claude.com/en/docs/about-claude/use-case-guides/content-moderation
Anthropic Cookbook (Moderation Filter): https://github.com/anthropics/anthropic-cookbook/blob/main/misc/building_moderation_filter.ipynb
Claude API Documentation: https://docs.anthropic.com/
Claude Pricing: https://www.anthropic.com/pricing (Note: Haiku is ~$2,590 per billion posts vs Sonnet at ~$31,080)
Academic hate speech datasets: Search for datasets on HuggingFace, Kaggle, or academic repositories