Internet Censorship Course / Book Workshop
This activity explores how automated systems detect and moderate toxic or harmful content online. The activity was originally motivated by Google’s Perspective API, which aimed to help online communities detect and filter out toxic content using machine learning. While Perspective API pioneered this space, the rise of large language models has opened new possibilities for sentiment analysis and content moderation.
Large language models like ChatGPT (OpenAI), Claude (Anthropic), and others can perform sophisticated sentiment analysis and content evaluation. In this part, you’ll explore how well these models can detect toxicity, hate speech, and other problematic content.
Design prompts that ask the LLM to evaluate content. For example:
Please evaluate the following content for toxicity, hate speech, threats, or harassment. Classify it as ALLOW or BLOCK and explain your reasoning.
Content: [INSERT TEXT HERE]
Anthropic provides specific guidance and examples for using Claude for content moderation. This part explores Claude’s capabilities in more depth.
You can approach this in two ways:
Option A: Web Interface (No API Key Required)
Option B: API with Jupyter Notebook
If you have API access (Anthropic offers free credits for students):
pip install anthropic jupyter notebook
git clone https://github.com/anthropics/anthropic-cookbook.git
cd anthropic-cookbook/misc
jupyter notebook building_moderation_filter.ipynb
While text moderation focuses on analyzing words, audio and video content require different approaches. One common technique for audio is spectral hashing, which matches how audio “sounds” rather than comparing files bit-by-bit.
Download and install Echoprint, an open-source audio fingerprinting system
Select an MP3 file and compute its spectral fingerprint
Test how robust the fingerprint is to various modifications: