Automated Content Moderation

Platforms often use some type of automation to perform content moderation. In the realm of copyright, one way of moderating content is to use some type of matching algorithm, such as matching a hash or fingerprint of the content against a known databased of infringing content. There are different ways of performing these types of matches.

Sentiment Analysis: Large Language Models

There are many tools and libraries that can be used to perform sentiment analysis. More recently, the advent of OpenAI’s GPT and various accompanying large language models have opened up new possibilities for sentiment analysis. A large language model that can be used to perform a variety of natural language processing tasks; models from OpenAI (e.g., ChatGPT) and Anthropic (e.g., Claude), for example, can perform a broad array of text generation and analysis tasks.

Try asking ChatGPT or Claude to perform sentiment analysis on the same set of phrases that you used with the Perspective API.

How do the results compare?
How do your results depend on how you prompt ChatGPT?

Follow Up: Experiment with moderating AI-generated content. For this follow-up, create or collect content that was generated by different AI models (ChatGPT, Claude, Bard/Gemini, etc.) and run it through various content moderation systems. Consider the following:

Does AI-generated content trigger content moderation flags differently than human content?
Are there specific patterns or characteristics in AI text that moderation systems consistently flag or miss?
Try creating content with AI that intentionally attempts to skirt moderation rules without violating them directly. How effective are moderation systems at detecting these edge cases?
Compare how different generations of the same AI model perform when trying to create content that might be flagged.

Follow Up: Experiment with local LLMs (e.g., UChicago Phoenix AI). For this follow-up, install and configure a local LLM like UChicago Phoenix AI or other open-source models such as Llama, Mistral, or Falcon. Then, consider some of the following questions or concerns:

Compare the sentiment analysis capabilities of local LLMs with cloud-based services like ChatGPT or Perspective API
Evaluate the performance differences in terms of accuracy, speed, and resource usage
Test how well local LLMs perform on specialized content or domain-specific text
Explore how fine-tuning a local model might improve its content moderation capabilities
Analyze the privacy and deployment advantages of using local LLMs for content moderation in production environments

Follow Up: Analyze the platform’s content moderation policy. As a follow-up activity, you could input a platform’s content moderation policy into an LLM, and subsequently input other types of content into the LLM to see how it might be classified.

As a follow-up activity, you could input a platform’s content moderation policy into an LLM, and subsequently input other types of content into the LLM to see how it might be classified. For this activity:

Select a major platform (e.g., YouTube, Twitter/X, Facebook, TikTok) and locate their official content moderation policies
Break down these policies into clear, structured guidelines that can be fed to an LLM.
Create a testing protocol where you generate or collect content examples that test different aspects of the policy.
Input both the policy and test content into an LLM, prompting it to evaluate whether the content violates specific policies.
Compare the LLM’s evaluations with how the actual platform has been known to moderate similar content.
Analyze discrepancies and discuss potential reasons for differences between LLM interpretation and actual platform enforcement.
Consider how cultural context, ambiguity in policies, and political factors might influence moderation decisions.

Content Moderation Options

Option 1: Claude Web Interface

Claude, Anthropic’s AI assistant, provides sophisticated content moderation capabilities that leverage semantic understanding and can handle multiple languages and content types. Unlike traditional rule-based systems, Claude can understand context, nuance, and intent in content moderation tasks.

Access: Use Claude directly through the web interface at claude.ai - no API key required.

Activity Steps:

Create an account at claude.ai
Design prompts that ask Claude to evaluate content for moderation categories like toxicity, hate speech, misinformation, or platform-specific violations
Test various content examples including clear violations, borderline cases, context-dependent content, and multilingual examples

Example Prompt:

Please evaluate the following content for moderation. Classify it as ALLOW or BLOCK based on these categories:
- Hate speech or harassment
- Threats or violence
- Spam or misleading information

Content: [INSERT TEXT HERE]

Provide your decision and brief reasoning.

Option 2: Claude API with Jupyter Notebook

For students with API access, experiment with Anthropic’s content moderation cookbook.

Access Requirements:

API key from console.anthropic.com
University students can apply for free credits through Anthropic’s Student Builder Program
New users receive small free credits for testing

Setup:

pip install anthropic jupyter notebook
git clone https://github.com/anthropics/anthropic-cookbook.git
cd anthropic-cookbook/misc
jupyter notebook building_moderation_filter.ipynb

Experiments: Customize moderation categories, test batch processing, and compare automated vs. manual results.

Option 3: Perspective API

The Perspective API aims to help online communities detect and filter out toxic content. It is a machine learning model that can be used to score the likelihood that a comment is toxic. The model is trained on a variety of data sources, including Wikipedia talk page comments, and is able to distinguish between different types of toxicity, such as threats, obscenity, and identity-based hate.

Download and install the Perspectives library and try it on various text input. Here are some instructions for getting started. You can also try the API directly from the website.
You might try its effectiveness on the following:
- Full sentences vs. phrases
- Words or phrases with two meanings
- Phrases in foreign languages

Bonus Activity: Spectral Hashing

One approach used for audio is to perform a so-called spectral or frequency-based, which does not match the content bit-by-bit, but rather matches how the audio “sounds”, by matching frequencies and beats through spectral analysis.

In this part of the hands-on assignment, you can download or compile the Echoprint code and perform some spectral hashes on audio files.

Download and install the Echoprint code. Setup instructions.
Select an mp3 file and compute the spectral fingerprint for that audio.
Try various modifications to see if Echoprint’s fingerprint is affected:
- Shorten the clip (e.g., take the first 30 seconds)
- Find a different version of the same song
More generally, you could try more complex manipulations, including:
- Change the volume of the audio
- Change the speed of the audio
- Change the pitch of the audio