Inefficient Regular Expression Complexity in nltk/nltk
Dec 7th 2021
nltk is vulnerable to ReDoS attack because of
^-?[0-9]+(.[0-9]+)?$ regex. If attacker succeeds to use malicious payload against
RegexpTagger used in function get_pos_tagger and malt_regex_tagger, it will cause a nasty DoS.
Proof of Concept
// PoC.py import re, time pattern = re.compile("^-?[0-9]+(.[0-9]+)?$") s = "-" s += "0" * 50000 s += "q" t = time.time() print("searching...") re.search(pattern, s) print(time.time() - t)
On my new machine I needed only 50k characters to cause a 23+ seconds matching. For instance, in similar report to this project 160k characters were processed just in 3+ seconds.
The issue here is that in
[0-9]+(.[0-9]+) match each other, which causes a nasty backtracking in case of failure.
This vulnerability is capable of causing DoS due to CPU resources consumption.