Sentiment Analysis and Content Moderation
This activity explores how automated systems detect and moderate toxic or harmful content online. The activity was originally motivated by Google’s Perspective API, which aimed to help online communities detect and filter out toxic content using machine learning. While Perspective API pioneered this space, the rise of large language models has opened new possibilities for sentiment analysis and content moderation.
Part 1: Sentiment Analysis with Large Language Models
Large language models like ChatGPT (OpenAI), Claude (Anthropic), and others can perform sophisticated sentiment analysis and content evaluation. In this part, you’ll explore how well these models can detect toxicity, hate speech, and other problematic content.
Activity
- Choose one or more LLMs to experiment with:
- Create test content that spans different categories:
- Clear violations (explicit hate speech, threats, harassment)
- Borderline cases (sarcasm, satire, context-dependent language)
- Words or phrases with multiple meanings
- Content in different languages
- Full sentences vs. isolated phrases
-
Design prompts that ask the LLM to evaluate content. For example:
Please evaluate the following content for toxicity, hate speech, threats, or harassment. Classify it as ALLOW or BLOCK and explain your reasoning.
Content: [INSERT TEXT HERE]
- Test and analyze:
- How do results compare across different LLMs?
- How do your results depend on how you phrase your prompt?
- Can you identify patterns in what gets flagged vs. what doesn’t?
- How does the LLM handle context-dependent content?
Discussion Questions
- What are the limitations of using LLMs for content moderation?
- How might bad actors try to circumvent LLM-based moderation?
- What biases might be embedded in these models’ moderation decisions?
Part 2 (Optional): Claude Content Moderation
Anthropic provides specific guidance and examples for using Claude for content moderation. This part explores Claude’s capabilities in more depth.
Activity
You can approach this in two ways:
Option A: Web Interface (No API Key Required)
- Use Claude directly at claude.ai
- Create systematic tests for different moderation categories
- Compare Claude’s decisions to other moderation systems you’ve tested
Option B: API with Jupyter Notebook
If you have API access (Anthropic offers free credits for students):
- Install the required libraries:
pip install anthropic jupyter notebook
- Explore Anthropic’s content moderation cookbook:
git clone https://github.com/anthropics/anthropic-cookbook.git
cd anthropic-cookbook/misc
jupyter notebook building_moderation_filter.ipynb
- Review Anthropic’s content moderation use case documentation
Experiments to Try
- Test batch processing of content
- Customize moderation categories for specific platforms
- Compare automated vs. manual moderation results
- Analyze how Claude handles edge cases and ambiguous content
Part 3 (Optional): Audio Fingerprinting with Spectral Hashing
While text moderation focuses on analyzing words, audio and video content require different approaches. One common technique for audio is spectral hashing, which matches how audio “sounds” rather than comparing files bit-by-bit.
Activity
-
Download and install Echoprint, an open-source audio fingerprinting system
-
Select an MP3 file and compute its spectral fingerprint
-
Test how robust the fingerprint is to various modifications:
- Shorten the clip (e.g., take the first 30 seconds)
- Find a different version or remix of the same song
- Change the volume
- Change the speed
- Change the pitch
Discussion Questions
- Why is spectral hashing useful for copyright enforcement?
- What are the limitations of this approach?
- How might bad actors try to evade fingerprint detection?
- How does this compare to text-based content matching?
Additional Resources