Реализация Regex, которая может обрабатывать сгенерированные компьютером регулярные выражения: * non-backtracking *, O (n)?

Question

Jul 20, 2016, 10:22 PM

Реализация Regex, которая может обрабатывать сгенерированные компьютером регулярные выражения: * non-backtracking *, O (n)?

Edit 2: Для практической демонстрации того, почему это остается важным, смотрите не дальше, чемСобственное отключение, вызванное регулярным выражением, в stackoverflow сегодня (2016-07-20)!

Edit: Этот вопрос значительно развился с тех пор, как я его впервые задал. Смотрите ниже две быстрые + совместимые, но не полностью полнофункциональные реализации. Если вам известны более или лучшие реализации, пожалуйста, упомяните их, здесь пока еще нет идеальной реализации!

Where can I find reliably fast Regex implementation?

Кто-нибудь знает нормальныйnon-backtracking (System.Text.RegularExpressions Возврат) Реализация линейного регулярного выражения для .NET или нативная и разумно применимая из .NET? Чтобы быть полезным, нужно:

have a worst case time-complexity of regex evaluation of O(m*n) where m is the length of the regex, and n the length of the input. have a normal time-complexity of O(n), since almost no regular expressions actually trigger the exponential state-space, or, if they can, only do so on a minute subset of the input. have a reasonable construction speed (i.e. no potentially exponential DFA's) be intended for use by human beings, not mathematicians - e.g. I don't want to reimplement unicode character classes: .NET or PCRE style character classes are a plus. Bonus Points: bonus points for practicality if it implements stack-based features which let it handle nesting at the expense of consuming O(n+m) memory rather than O(m) memory. bonus points for either capturing subexpressions or replacements (if there are an exponential number of possible subexpression matches, then enumerating all of them is inherently exponential - but enumerating the first few shouldn't be, and similarly for replacements). You can workaround missing either feature by using the other, so having either one is sufficient. lotsa bonus points for treating regexes as first class values (so you can take the union, intersection, concatenation, negation - in particular negation and intersection as those are very hard to do by string manipulation of the regex definition) lazy matching i.e. matching on unlimited streams without putting it all in memory is a plus. If the streams don't support seeking, capturing subexpressions and/or replacements aren't (in general) possible in a single pass. Backreferences are out, they are fundamentally unreliable; i.e. can always exhibit exponential behavior given pathological input cases.

Такие алгоритмы существуют (это базовая теория автоматов ...) - но есть ли практически применимыеimplementations доступны из .NET?

Background: (you can skip this)

Мне нравится использовать Regex для быстрой и грязной очистки текста, но я неоднократно сталкиваюсь с проблемами, когда общая реализация обратного отслеживания NFA, используемая perl / java / python / .NET, демонстрирует экспоненциальное поведение. К сожалению, эти случаи довольно легко вызвать, как только вы начнете автоматически генерировать регулярные выражения. Даже неэкспоненциальная производительность может стать крайне плохой, когда вы чередуете регулярные выражения, которые совпадают с одним и тем же префиксом - например, в самом простом примере, если вы берете словарь и превращаете его в регулярное выражение, ожидайте ужасную производительность.

For a quick overview of why better implementations exist and have since the 60s, see Сопоставление регулярных выражений может быть простым и быстрым.

Not quite practical options: Almost ideal: FSA toolkit. Can compile regexes to fast C implementations of DFA's+NFA's, allows transducers(!) too, has first class regexes (encapsulation yay!) including syntax for intersection and parametrization. But it's in prolog... (why is something with this kind of practical features not available in a mainstream language???) Fast but impractical: a full parser, such as the excellent ANTLR generally supports reliably fast regexes. However, antlr's syntax is far more verbose, and of course permits constructs that may not generate valid parsers, so you'd need to find some safe subset. Good implementations: RE2 - a google open source library aiming for reasonable PCRE compatibility minus backreferences. I think this is the successor to the unix port of plan9's regex lib, given the author. TRE - also mostly compatible with PCRE and even does backreferences, although using those you lose speed guarantees. And it has a mega-nifty approximate matching mode!

К сожалению, обе реализации являются C ++ и требуют взаимодействия для использования из .NET.

Реализация Regex, которая может обрабатывать сгенерированные компьютером регулярные выражения: * non-backtracking *, O (n)?

Ответы на вопрос(5)

Ваш ответ на вопрос

Популярные вопросы

Вы очень активны! Это здорово!

Реализация Regex, которая может обрабатывать сгенерированные компьютером регулярные выражения: * non-backtracking *, O (n)?

Ответы на вопрос(5)

Ваш ответ на вопрос

Популярные вопросы