Module regex::bytes 
                   
                       [−]
                   
               [src]
Match regular expressions on arbitrary bytes.
This module provides a nearly identical API to the one found in the top-level of this crate. There are two important differences:
- Matching is done on &[u8]instead of&str. Additionally,Vec<u8>is used whereStringwould have been used.
- Regular expressions are compiled with Unicode support disabled by
default. This means that while Unicode regular expressions can only match valid
UTF-8, regular expressions in this module can match arbitrary bytes. Unicode
support can be selectively enabled via the uflag in regular expressions provided by this sub-module.
Example: match null terminated string
This shows how to find all null-terminated strings in a slice of bytes:
let re = Regex::new(r"(?P<cstr>[^\x00]+)\x00").unwrap(); let text = b"foo\x00bar\x00baz\x00"; // Extract all of the strings without the null terminator from each match. // The unwrap is OK here since a match requires the `cstr` capture to match. let cstrs: Vec<&[u8]> = re.captures_iter(text) .map(|c| c.name("cstr").unwrap()) .collect(); assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
Example: selectively enable Unicode support
This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded string (e.g., to extract a title from a Matroska file):
let re = Regex::new(r"\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))").unwrap(); let text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65"; let caps = re.captures(text).unwrap(); // Notice that despite the `.*` at the end, it will only match valid UTF-8 // because Unicode mode was enabled with the `u` flag. Without the `u` flag, // the `.*` would match the rest of the bytes. assert_eq!((7, 10), caps.pos(1).unwrap()); // If there was a match, Unicode mode guarantees that `title` is valid UTF-8. let title = str::from_utf8(caps.at(1).unwrap()).unwrap(); assert_eq!("☃", title);
In general, if the Unicode flag is enabled in a capture group and that capture is part of the overall match, then the capture is guaranteed to be valid UTF-8.
Syntax
The supported syntax is pretty much the same as the syntax for Unicode regular expressions with a few changes that make sense for matching arbitrary bytes:
- The uflag is disabled by default, but can be selectively enabled. (The opposite is true for the mainRegextype.) Disabling theuflag is said to invoke "ASCII compatible" mode.
- In ASCII compatible mode, neither Unicode codepoints nor Unicode character classes are allowed.
- In ASCII compatible mode, Perl character classes (\w,\dand\s) revert to their typical ASCII definition.\wmaps to[[:word:]],\dmaps to[[:digit:]]and\smaps to[[:space:]].
- In ASCII compatible mode, word boundaries use the ASCII compatible \wto determine whether a byte is a word byte or not.
- Hexadecimal notation can be used to specify arbitrary bytes instead of
Unicode codepoints. For example, in ASCII compatible mode, \xFFmatches the literal byte\xFF, while in Unicode mode,\xFFis a Unicode codepoint that matches its UTF-8 encoding of\xC3\xBF. Similarly for octal notation.
- .matches any byte except for- \ninstead of any codepoint. When the- sflag is enabled,- .matches any byte.
Performance
In general, one should expect performance on &[u8] to be roughly similar to
performance on &str.
Structs
| CaptureNames | An iterator over the names of all possible captures. | 
| Captures | Captures represents a group of captured byte strings for a single match. | 
| FindCaptures | An iterator that yields all non-overlapping capture groups matching a particular regular expression. | 
| FindMatches | An iterator over all non-overlapping matches for a particular string. | 
| NoExpand | NoExpand indicates literal byte string replacement. | 
| Regex | A compiled regular expression for matching arbitrary bytes. | 
| RegexBuilder | A configurable builder for a regular expression. | 
| RegexSet | Match multiple (possibly overlapping) regular expressions in a single scan. | 
| SetMatches | A set of matches returned by a regex set. | 
| SetMatchesIntoIter | An owned iterator over the set of matches from a regex set. | 
| SetMatchesIter | A borrowed iterator over the set of matches from a regex set. | 
| Splits | Yields all substrings delimited by a regular expression match. | 
| SplitsN | Yields at most  | 
| SubCaptures | An iterator over capture groups for a particular match of a regular expression. | 
| SubCapturesNamed | An Iterator over named capture groups as a tuple with the group name and the value. | 
| SubCapturesPos | An iterator over capture group positions for a particular match of a regular expression. | 
Traits
| Replacer | Replacer describes types that can be used to replace matches in a byte string. |