Home
Knowledge Base
General Methodology
How to Analyze ACLED’s Notes Column using Natural Language Processing (NLP)

How to Analyze ACLED’s Notes Column using Natural Language Processing (NLP)

Beyond the structured variables in the dataset, ACLED also includes a ‘Notes’ column – a rich, qualitative narrative text that often includes additional or more specific event details not explicitly captured in other variables. This article explores how users can extract insights and granular event details from the ‘Notes’ column using Natural Language Processing (NLP) methods, covering two approaches; (1) Keyword-based search, and (2) Model-based approach.

Understanding the ACLED ‘Notes’ Column

The ‘Notes’ column provides a brief summary of the main features of the event, typically two sentences long. While the notes reflect the variables coded in other ACLED columns, they also can provide additional information regarding the events not captured in the structured variables including:

Category	Information in the ‘Notes’ Column	What can be Extracted with Keyword Searches
*Who*	Actors involved (can sometimes provide more details beyond the coded actors, e.g. Civilian presence in ‘Battle’ events).	Identify additional actors involved, not captured in coded actor fields.
*What*	A narrative version of the event details describing actor activity and interaction, beyond just the coded event type (e.g., if there are several forms of violence as part of the same event, these will be found in the ‘Notes’).	Identify additional details on tactics or weapons used, or identifying multiple forms of violence present in a coded event, (e.g., use of grenades during ‘Armed Clash’ events)
*Where*	The most precise location available from sources if there is more granular information available beyond the coded village/town level	Extract granular location details to filter events by specific places (e.g., “market”, “school”).
*Why*	The immediate cause of the event (e.g., motive for an attack or protest, when this information is available).	Identify the immediate cause or motive behind an event (e.g., election-related violence, protest motivations).
*Fatalities & Injuries*	ACLED records a conservative fatality estimate as its own variable (see also: Fatalities), however, the ‘Notes’ may provide details on injuries, varying fatality estimates from multiple sources, and the identity or affiliation of those killed or injured, when this information is available.	Extract injury counts, and ranges of reported fatalities which are not recorded in structured columns.

Examples of ‘Notes’ Entries

Example 1 – Event Type: Explosions/Remote Violence
“On 28 May 2024, an Algerian army drone targeted Saharawi and Mauritanian gold miners operating near Dakhla refugee camp (Tindouf, Tindouf). Between 10 and 12 people were killed.“

Example 2 – Event Type: Battles
“On 30 May 2024, JNIM militants ambushed a patrol of gendarmes near the village of Gassel (Bani, Seno). Nine gendarmes were killed, 22 were injured, and one was reported missing.“

Extracting Information from ‘Notes’ Using NLP

Because the ‘Notes’ column is free-text, systematic analysis requires keyword searches and Natural Language Processing (NLP). These methods help:

Filter events by keywords (e.g., “injured,” “landmine”).
Extract specific details (e.g., neighborhood locations, protest motivations).
Classify events automatically using machine learning.

The next section covers how to apply keyword searches in combination with basic text pre-processing techniques. For advanced users, we explore training a classification model to extract structured insights from the Notes. See section on Using Model-Based Approaches to Analyze ACLED’s ‘Notes’.

Extracting Insights from ACLED ‘Notes’ with Keyword Searches

The ‘Notes’ column contains additional details beyond the structured variables in the dataset, making it useful for targeted keyword searches. However, since most core event details (date, actors, location, event type) are already coded in separate columns, keyword searches are best suited for extracting information beyond the coded variables as noted in the above table.

Example Use Cases for Keyword Searches

Identifying Battle and Explosions/Remote Violence events targeting educational infrastructure or occurring near such infrastructure
- Question: How many ‘Battle’ and ‘Explosions/Remote Violence’ events in Ukraine between January 2022 and January 2025 targeted or occurred near educational infrastructure?
- Example Keywords: “kindergarten”, “school”, “classroom”, “college”, “university”, “education”, “academic”, “orphanage”.
Identifying Specific Weapons or Tactics
- Question: How many ‘Explosions/Remote Violence’ events in Yemen since 2023 involved missiles, and how many specifically used Katyusha missiles?
- Example Keywords: “drone”, “missile”, “rocket”, “Katyusha”.
Identifying Events Near Specific Landmarks
- Question: How many ‘Explosions/Remote Violence’ events occurred near a market in Yemen since 2023?
- Example Keywords: “market”

How to Build an Effective Keyword List

A strong keyword list should be carefully designed to capture relevant events while minimizing false positives.

Step 1: Review Sample Notes

Scan event Notes for regional phrasing variations.
Ensure synonyms and localized terms are included (e.g., ‘lathi’ for batons in India).

Step 2: Apply Standardization Techniques

Technique	Purpose	Example
*Lowercasing*	Standardizes capitalization differences.	“Election” → “election”
*Stemming*	Reduces words to their root form.	“Voting”, “voter” → “vot”
*Lemmatization*	Uses linguistic rules to find the base form of words.	“elected” → “elect”, “children” → “child”

Example Keyword Search and Results

Question: Which ‘Violence against Civilians’ events in Mexico in 2024 were election-related?

Keyword Lists:

Method	Keyword List
*Basic Matching*	“poll”, “vote”, “ballot”, “INE”, “election”, “candidate”, “electoral”, “campaign”, “rally”, “nominee”, “Instituto Nacional Electoral”
*With Stemming*	‘poll’, ‘vote’, ‘ballot’, ‘INE’, ‘elect’, ‘elector’, ‘candid’, ‘campaign’, ‘ralli’, ‘nomine’, ‘Instituto Nacional Electoral’
*With Lemmatization*	‘poll’, ‘vote’, ‘ballot’, ‘voting’, ‘INE’, ‘election’, ‘electoral’, ‘candidate’, ‘elected’, ‘campaign’, ‘rally’, ‘nominee’, ‘Instituto Nacional Electoral’

Search Output: 206 election-related events found among 4,019 VAC events in Mexico (2024-01-01 to 2024-10-11).

Matched Events:

✅ Relevant Match:
“On 11 January 2024, in Jacona de Plancarte, Michoacán, a municipal commissioner of Citizen Movement (MC) was shot and killed by armed men in her business. The victim was a transgender woman and intended to be a candidate for councilor in the next electoral process. (1 fatality)”
❌ Irrelevant Match:
“On 30 September 2024, in Culiacán Rosales, Sinaloa, unidentified gunmen killed the leader of the Regional Livestock Union of Sinaloa. He was a former candidate in 2021.”
🚫 Irrelevant because the victim was a candidate in 2021, not in 2024.

Advantages and Limitations of Using Keyword Searches on ACLED ‘Notes’

Keyword searches are a quick and customizable way to analyze the ‘Notes’ column, but they also have limitations that require careful consideration.

✅ Advantages of Keyword Searches

Fast and Simple: Quickly identify trends, variations, and patterns in violence related to a specific topic.
Customizable Granularity: Users can broaden or narrow their search as needed.
- A broad search for election-related violence could include “election”, while a more specific search could focus on events involving electoral materials (e.g., “voting ballots,” “electoral documentation,” “electoral packages”).
Standardization: Since ACLED ‘Notes’ follow a structured style, granular searches can often return comprehensive results.

❌ Limitations of Keyword Searches

Missed Events Due to Variations in Phrasing:
- If the search is too narrow, it may fail to capture synonyms or alternative wording (e.g., searching for “electoral materials” but missing “ballots”).
Inconsistent Levels of Detail:
- Since multiple researchers contribute to the ‘Notes’ column, phrasing and terminology can vary, some may include extensive detail in ‘Notes’, while others may keep descriptions brief.
Regional Variability
- The level of detail in ‘Notes’ may vary across different countries or regions due to variations in the information environments, making cross-regional comparisons challenging.
False Positives (Irrelevant Matches):
- Searches may return unrelated events.
- Example: Searching “election” could retrieve ‘Strategic Development’ events discussing election results rather than election-related violence, or events related to non-political elections, e.g., Labor Union Elections.
Requires Post-Processing:
- Refinements like negative keyword filtering or manual review may be needed to remove irrelevant events.

Using Model-Based Approaches to Analyze ACLED ‘Notes’

ACLED has tested this approach using a Few-Shot model (SetFit), which requires only a small dataset to achieve strong results. Below is a simplified workflow example:

Step 1: Create an Initial Keyword List

Perform a keyword search to collect a subset of events.
This will serve as a starting point for labeling training data.

Step 2: Label Training Data

Manually review and categorize the matched events.
Label types:
- Binary labels (e.g., Election Violence = 1, Other Events = 0).
- Categorical (e.g., Election/Polls) if a multi-label model is applied.
- Negative labels (e.g., irrelevant matches or ‘edge cases’ to help the model learn what NOT to classify.
It is advisable to create a training dataset that is balanced across classes (e.g., negative and positive events when using a binary model, or each class + negative labeled events in a multi-label model)
It is advisable to start with a training dataset that covers the data’s diversity but aims to be relatively small and expand this iteratively after evaluating performance – this avoids model overfitting. For Few-Shot models, even a small, well-curated dataset (~100 examples per category) can yield high accuracy.

Step 3: Train and Evaluate the Model

Split the dataset: 60% training, 40% testing.
Train the model and evaluate performance.
- Model Confidence scores can be considered to identify uncertain classifications.
- Review misclassified cases and refine training data as needed.

Step 4: Apply and Iterate

Once the model has a satisfactory level of accuracy:
- Apply the model to a random sample of events and manually verify results.
- If useful events were missed in the initial keyword search, add them to the keyword list and repeat steps 1-4.
- Continue iterating until no new keywords are found and model performance is satisfactory.

Choosing the Right Classification Model

Model Type	Best Use Case	Challenges
BERT (Monolingual or Multilingual), RoBERTa, DistilBERT	High-accuracy classification for large datasets	Requires hundreds of training examples for high performance
Few-Shot Models (SetFit)	Works well with smaller datasets	Requires some manual labeling but far less than traditional modelsProvides a multilingual and multi-label version
Zero-Shot Models	Classify events without training data	Often too inaccurate for use with ACLED data

✅ Best Practical Choice: Few-Shot models (e.g., SetFit)¹Learn more about SetFit: Hugging Face SetFit Documentation[/mfn] offer a balance between efficiency and accuracy, requiring only a small dataset while performing well even on complex cases. In ACLED’s own application of this model, the maximum training data size per class did not exceed 100 data points (in its final version after various iterations and refinements).

Best Practices for Applying NLP to ACLED ‘Notes’

✅ Do’s

✔️Normalize text (lowercasing, stemming, lemmatization) before keyword searches.
✔️Use fuzzy matching to catch variations (e.g., “poll” should also match “polling”).
✔️Manually review a sample of the ‘Notes’ column before creating a keyword list (e.g., of the region and ‘Event type’ of interest, to understand the level of detail available in the ‘Notes’, keywords, and phrasing used)
✔️Consider model-based classification when keyword searches produce too many false positives. Combining a keyword search with a classification model can lead to more accurate results.
✔️Include diverse training examples and cross-check results when training a classification model.

❌ Don’ts

🚫 Avoid exact keyword matches for all variations (“election”, “Election”, “elections”, “Elections”). Use normalization or fuzzy matching instead.
🚫 Don’t assume uniform phrasing in ‘Notes’—wording varies by region and researcher. Familiarize yourself with the ‘Notes’ column to ensure what you are looking for is indeed captured in the ‘Notes’
🚫 Don’t rely solely on keyword searches when event phrasing is highly diverse—consider classification models instead.
🚫 Don’t overcomplicate keyword searches with excessive regex which will likely miss events – NLP models can often provide better results.

Key Takeaways

Keyword searches are useful for quick insights but can miss nuances and events, and require manual filtering.
Classification models provide more precise and scalable results.
Few-Shot models (SetFit) offer high accuracy with minimal training data, making them an ideal choice for analysis on ACLED’s ‘Notes’ column.

Annex: Code snippets

Python

Fuzzy Matching

# import packages
import pandas as pd

keywords = ["polling", "poll", "vote", "ballot", "voting", "voter", "voters", "election", "elections", "electoral", "candidate", "elected",  "campaign", "rally", "nominee", "instituto nacional electoral", " ine "]

# create one string from the keyword list for matching
pattern = '|'.join(keywords)

# perform fuzzy matching of event notes with the keyword list and create a binary column indicating 1 in case of match
# include case=False in the string search for a case-insensitive keyword search

df['contains_election'] = df['notes'].str.contains(pattern, case=False)

# filter rows based on match
df_elections = df[df['contains_election']==1]

Exact Matching

# import packages
import pandas as pd

keywords = ["polling", "poll", "vote", "ballot", "voting", "voter", "voters", "election", "elections", "electoral", "candidate", "elected",  "campaign", "rally", "nominee", "instituto nacional electoral", " ine "]

# create a regular expression pattern with word boundaries
pattern = r'\b(' + '|'.join(keywords) + r')\b'

# perform exact matching of event notes with the keyword list and create a binary column indicating 1 in case of match
# include case=False in the string search for a case-insensitive keyword search

df['contains_election'] = df['notes'].str.contains(pattern, case=False)

# filter rows based on match
df_elections = df[df['contains_election']==1]

Match after pre-processing

from nltk.stem import PorterStemmer
import pandas as pd

# Initialize the stemmer
stemmer = PorterStemmer()

keywords = ["polling", "poll", "vote", "ballot", "voting", "voter", "voters", "election", "elections", "electoral", "candidate", "elected",  "campaign", "rally", "nominee", "instituto nacional electoral", " ine "]

# stem the keywords
stemmed_keywords = [stemmer.stem(word) for word in keywords]

# Combine stemmed keywords into a pattern
pattern = '|'.join(stemmed_keywords)

# Apply stemming to the 'notes' column
df['stemmed_notes'] = df['notes'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

# Search for keywords in the stemmed 'notes' column
df['contains_election'] = df['stemmed_notes'].str.contains(pattern, case=False)

# filter rows based on match
df_elections = df[df['contains_election']==1]

R

Fuzzy Matching

# Load necessary library
library(dplyr)
library(stringr)

# Define keywords
keywords <- c("polling", "poll", "vote", "ballot", "voting", "voter", "voters", "election", "elections", "electoral","candidate", "elected", "campaign", "rally", "nominee", "instituto nacional electoral", " ine ")

# Create a regex pattern with word boundaries for exact matches
pattern <- paste0("(", paste(keywords, collapse = "|"), ")")

# Add a binary column indicating whether any keyword matches (case-insensitive)
df <- df %>%
  mutate(contains_election = str_detect(notes, regex(pattern, ignore_case = TRUE)))

# Filter rows where the binary column is TRUE
df_elections <- df %>% filter(contains_election)

# filter rows based on match
df_elections <- df %>% filter(contains_election)

Exact Matching

# Load necessary library
library(dplyr)
library(stringr)

# Define keywords
keywords <- c("polling", "poll", "vote", "ballot", "voting", "voter", "voters", "election", "elections", "electoral","candidate", "elected", "campaign", "rally", "nominee", "instituto nacional electoral", " ine ")

# Create a regex pattern with word boundaries for exact matches
pattern <- paste0("\\b(", paste(keywords, collapse = "|"), ")\\b")

# Add a binary column indicating whether any keyword matches (case-insensitive)
df <- df %>%
  mutate(contains_election = str_detect(notes, regex(pattern, ignore_case = TRUE)))

# Filter rows where the binary column is TRUE
df_elections <- df %>% filter(contains_election)

# filter rows based on match
df_elections <- df %>% filter(contains_election)

Match after pre-processing

# Load necessary libraries
library(SnowballC)
library(dplyr)
library(stringr)

# Example dataframe
df <- data.frame(
  notes = c(
    "The voters are casting their votes.",
    "The electoral campaign rally was successful.",
    "Polling booths are set up for the elections.",
    "The poll results are out."
  ),
  stringsAsFactors = FALSE
)

# Define keywords
keywords <- c("polling", "poll", "vote", "ballot", "voting", "voter", "voters",
              "election", "elections", "electoral", "candidate", "elected", 
              "campaign", "rally", "nominee", "instituto nacional electoral", " ine ")

# Stem the keywords
stemmed_keywords <- sapply(keywords, SnowballC::wordStem)

# Combine stemmed keywords into a pattern
pattern <- paste0("(", paste(stemmed_keywords, collapse = "|"), ")")

# Apply stemming to the 'notes' column
df <- df %>%
  mutate(stemmed_notes = sapply(strsplit(notes, "\\s+"), function(words) {
    paste(SnowballC::wordStem(words), collapse = " ")
  }))

# Search for stemmed keywords in the stemmed 'notes' column
df <- df %>%
  mutate(contains_election = str_detect(stemmed_notes, regex(pattern, ignore_case = TRUE)))

# Filter rows based on match
df_elections <- df %>% filter(contains_election)

Classification Model (Python)

Model Training

from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset
from sentence_transformers.losses import CosineSimilarityLoss
from sklearn.metrics import accuracy_score

def train_model_binary(df):
    
    """
    Trains binary model with the selected training data.
    It provides accuracy which is computed using labelled test data in the training dataset.
    The model is saved in the folder Trained_Models.
    The function returns trained model and test data with predictions and confidence scores.
    
    """
    
    model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2") 
    
    # bring train and test data in right format to feed into model
    data_train = df[df['sample'].str.startswith("train")]
    data_test = df[df['sample'].str.startswith("test")]
    data_train = Dataset.from_pandas(data_train)
    data_test = Dataset.from_pandas(data_test)
    
    # accuracy score function
    def compute_metrics(y_pred, y_test):
        accuracy = accuracy_score(y_test, y_pred)
        return {"accuracy": accuracy}
    
    
    # create trainer
    trainer = SetFitTrainer(
    model=model,
    train_dataset=data_train,
    eval_dataset=data_test,
    loss_class=CosineSimilarityLoss,
    metric=compute_metrics,
    num_iterations=20,
    num_epochs=1,
    column_mapping={
        "notes": "text",
        selected_sector: "label",
    },)
    # Train and evaluate
    trainer.train()
    metrics = trainer.evaluate()
    print(metrics)
    
    model.save_pretrained(f"trained_model")
    
    # test data with predicted tags
    out_test = pd.DataFrame(data_test)
    out_test["pred_tag"] = model(data_test["notes"])
    results = model.predict_proba(data_test["notes"])
    probs, preds = results.max(dim=1)
    out_test["pred_prob"] = probs
    
    return model, out_test

# train model with above function

model, out_test = train_model_binary(df)

def apply_model(model, df_large, keywords_lst):
    
    """
    Applies trained model to the complete dataset.
    First, a keycheck column (binary) is created, identifying events with a keyword match.
    The model is then applied to the dataset, and a binary prediction and a continuous confidence
    score column are created.
    The dataset with the new columns is returned.
    """
    df_large['notes_lower'] = df_large['notes'].str.lower()
    notes = df_large['notes_lower'].to_list()
    
    def check_keywords(note, keywords_lst):
        for keyword in keywords_lst:
            if keyword in note.lower():
                return 1
        return 0
    
    # create keycheck column to stratify sample along given sector's keywords
    df_large['keycheck'] = df_large['notes_lower'].apply(lambda x: check_keywords(x,keywords_lst))
        
    results = model.predict_proba(df_large["notes"].to_list())
    probs, preds = results.max(dim=1)
    df_large["pred_prob"] = probs
    df_large["pred"] = preds

    return df_large

# apply the trained model with above function; keywords is the keyword list from above

df_preds = apply_model(model, df_complete, keywords)

How to Analyze ACLED’s Notes Column using Natural Language Processing (NLP)

Understanding the ACLED ‘Notes’ Column

Examples of ‘Notes’ Entries

Extracting Information from ‘Notes’ Using NLP

Extracting Insights from ACLED ‘Notes’ with Keyword Searches

Example Use Cases for Keyword Searches

How to Build an Effective Keyword List

Advantages and Limitations of Using Keyword Searches on ACLED ‘Notes’

Using Model-Based Approaches to Analyze ACLED ‘Notes’

Choosing the Right Classification Model

Best Practices for Applying NLP to ACLED ‘Notes’

Key Takeaways

Annex: Code snippets

Python

Fuzzy Matching

Exact Matching

Match after pre-processing

R

Fuzzy Matching

Exact Matching

Match after pre-processing

Classification Model (Python)

Model Training

Contents

Related Articles