|
| 1 | +# DSL to express constraints |
| 2 | + |
| 3 | +This library provides a Domain-Specific Language (DSL) to construct regular expressions in a more intuitive and modular way. It allows you to create complex regexes using simple building blocks that represent literal strings, patterns, and various quantifiers. Additionally, these custom regex types can be used directly as types in [Pydantic](https://pydantic-docs.helpmanual.io/) schemas to enforce pattern constraints during text generation. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Why Use This DSL? |
| 8 | + |
| 9 | +1. **Modularity & Readability**: Instead of writing cryptic regular expression strings, you compose a regex as a tree of objects. |
| 10 | +2. **Enhanced Debugging**: Each expression can be visualized as an ASCII tree, making it easier to understand and debug complex regexes. |
| 11 | +3. **Pydantic Integration**: Use your DSL-defined regex as types in Pydantic models. The DSL seamlessly converts to JSON Schema with proper pattern constraints. |
| 12 | +4. **Extensibility**: Easily add or modify quantifiers and other regex components by extending the provided classes. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Building Blocks |
| 17 | + |
| 18 | + |
| 19 | +Every regex component in this DSL is a **Term**. Here are two primary types: |
| 20 | + |
| 21 | +- **`String`**: Represents a literal string. It escapes the characters that have a special meaning in regular expressions. |
| 22 | +- **`Regex`**: Represents an existing regex pattern string. |
| 23 | + |
| 24 | +```python |
| 25 | +from outlines.types import String, Regex |
| 26 | + |
| 27 | +# A literal string "hello" |
| 28 | +literal = String("hello") # Internally represents "hello" |
| 29 | + |
| 30 | +# A regex pattern to match one or more digits |
| 31 | +digit = Regex(r"[0-9]+") # Internally represents the pattern [0-9]+ |
| 32 | + |
| 33 | +# Converting to standard regex strings: |
| 34 | +from outlines.types.dsl import to_regex |
| 35 | + |
| 36 | +print(to_regex(literal)) # Output: hello |
| 37 | +print(to_regex(digit)) # Output: [0-9]+ |
| 38 | +``` |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## Early Introduction to Quantifiers & Operators |
| 43 | + |
| 44 | +The DSL supports common regex quantifiers as methods on every `Term`. These methods allow you to specify how many times a pattern should be matched. They include: |
| 45 | + |
| 46 | +- **`times(count)`**: Matches the term exactly `count` times. |
| 47 | +- **`optional()`**: Matches the term zero or one time. |
| 48 | +- **`one_or_more()`**: Matches the term one or more times (Kleene Plus). |
| 49 | +- **`zero_or_more()`**: Matches the term zero or more times (Kleene Star). |
| 50 | +- **`repeat(min_count, max_count)`**: Matches the term between `min_count` and `max_count` times (or open-ended if one value is omitted). |
| 51 | + |
| 52 | +Let’s see these quantifiers side by side with examples. |
| 53 | + |
| 54 | +### Quantifiers in Action |
| 55 | + |
| 56 | +#### `times(count)` |
| 57 | + |
| 58 | +This method restricts the term to appear exactly `count` times. |
| 59 | + |
| 60 | +```python |
| 61 | +# Example: exactly 5 digits |
| 62 | +five_digits = Regex(r"\d").times(5) |
| 63 | +print(to_regex(five_digits)) # Output: (\d){5} |
| 64 | +``` |
| 65 | + |
| 66 | +You can also use the `times` function: |
| 67 | + |
| 68 | +```python |
| 69 | +from outlines.types import times |
| 70 | + |
| 71 | +# Example: exactly 5 digits |
| 72 | +five_digits = times(Regex(r"\d"), 5) |
| 73 | +print(to_regex(five_digits)) # Output: (\d){5} |
| 74 | +``` |
| 75 | + |
| 76 | +#### `optional()` |
| 77 | + |
| 78 | +The `optional()` method makes a term optional, meaning it may occur zero or one time. |
| 79 | + |
| 80 | +```python |
| 81 | +# Example: an optional "s" at the end of a word |
| 82 | +maybe_s = String("s").optional() |
| 83 | +print(to_regex(maybe_s)) # Output: (s)? |
| 84 | +``` |
| 85 | + |
| 86 | +You can also use the `optional` function, the string will automatically be converted to a `String` object: |
| 87 | + |
| 88 | +```python |
| 89 | +from outlines.types import optional |
| 90 | + |
| 91 | +# Example: an optional "s" at the end of a word |
| 92 | +maybe_s = optional("s") |
| 93 | +print(to_regex(maybe_s)) # Output: (s)? |
| 94 | +``` |
| 95 | + |
| 96 | +#### `one_or_more()` |
| 97 | + |
| 98 | +This method indicates that the term must appear at least once. |
| 99 | + |
| 100 | +```python |
| 101 | +# Example: one or more alphabetic characters |
| 102 | +letters = Regex(r"[A-Za-z]").one_or_more() |
| 103 | +print(to_regex(letters)) # Output: ([A-Za-z])+ |
| 104 | +``` |
| 105 | + |
| 106 | +You can also use the `one_or_more` function: |
| 107 | + |
| 108 | +```python |
| 109 | +from outlines.types import one_or_more |
| 110 | + |
| 111 | +# Example: one or more alphabetic characters |
| 112 | +letters = one_or_more(Regex(r"[A-Za-z]")) |
| 113 | +print(to_regex(letters)) # Output: ([A-Za-z])+ |
| 114 | + |
| 115 | +``` |
| 116 | + |
| 117 | +#### `zero_or_more()` |
| 118 | + |
| 119 | +This method means that the term can occur zero or more times. |
| 120 | + |
| 121 | +```python |
| 122 | +# Example: zero or more spaces |
| 123 | +spaces = String(" ").zero_or_more() |
| 124 | +print(to_regex(spaces)) # Output: ( )* |
| 125 | +``` |
| 126 | + |
| 127 | +You can also use the `zero_or_more` function, the string will automatically be converted to a `String` instance: |
| 128 | + |
| 129 | +```python |
| 130 | +from outlines.types import zero_or_more |
| 131 | + |
| 132 | +# Example: zero or more spaces |
| 133 | +spaces = zero_or_more(" ") |
| 134 | +print(to_regex(spaces)) # Output: ( )* |
| 135 | +``` |
| 136 | + |
| 137 | +#### `repeat(min_count, max_count)` |
| 138 | + |
| 139 | +The `repeat` method provides flexibility to set a lower and/or upper bound on the number of occurrences. |
| 140 | + |
| 141 | +```python |
| 142 | +# Example: Between 2 and 4 word characters |
| 143 | +word_chars = Regex(r"\w").repeat(2, 4) |
| 144 | +print(to_regex(word_chars)) # Output: (\w){2,4} |
| 145 | + |
| 146 | +# Example: At least 3 digits (min specified, max left open) |
| 147 | +at_least_three = Regex(r"\d").repeat(3, None) |
| 148 | +print(to_regex(at_least_three)) # Output: (\d){3,} |
| 149 | + |
| 150 | +# Example: Up to 2 punctuation marks (max specified, min omitted) |
| 151 | +up_to_two = Regex(r"[,.]").repeat(None, 2) |
| 152 | +print(to_regex(up_to_two)) # Output: ([,.]){,2} |
| 153 | +``` |
| 154 | + |
| 155 | +You can also use the `repeat` function: |
| 156 | + |
| 157 | +```python |
| 158 | +from outlines import repeat |
| 159 | + |
| 160 | +# Example: Between 2 and 4 word characters |
| 161 | +word_chars = repeat(Regex(r"\w"), 2, 4) |
| 162 | +print(to_regex(word_chars)) # Output: (\w){2,4} |
| 163 | + |
| 164 | +# Example: At least 3 digits (min specified, max left open) |
| 165 | +at_least_three = repeat(Regex(r"\d"), 3, None) |
| 166 | +print(to_regex(at_least_three)) # Output: (\d){3,} |
| 167 | + |
| 168 | +# Example: Up to 2 punctuation marks (max specified, min omitted) |
| 169 | +up_to_two = repeat(Regex(r"[,.]"), None, 2) |
| 170 | +print(to_regex(up_to_two)) # Output: ([,.]){,2} |
| 171 | +``` |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## Combining Terms |
| 176 | + |
| 177 | +The DSL allows you to combine basic terms into more complex patterns using concatenation and alternation. |
| 178 | + |
| 179 | +### Concatenation (`+`) |
| 180 | + |
| 181 | +The `+` operator (and its reflected variant) concatenates terms, meaning that the terms are matched in sequence. |
| 182 | + |
| 183 | +```python |
| 184 | +# Example: Match "hello world" |
| 185 | +pattern = String("hello") + " " + Regex(r"\w+") |
| 186 | +print(to_regex(pattern)) # Output: hello\ (\w+) |
| 187 | +``` |
| 188 | + |
| 189 | +### Alternation (`|`) |
| 190 | + |
| 191 | +The `|` operator creates alternatives, allowing a match for one of several patterns. |
| 192 | + |
| 193 | +```python |
| 194 | +# Example: Match either "cat" or "dog" |
| 195 | +animal = String("cat") | "dog" |
| 196 | +print(to_regex(animal)) # Output: (cat|dog) |
| 197 | +``` |
| 198 | + |
| 199 | +*Note:* When using operators with plain strings (such as `"dog"`), the DSL automatically wraps them in a `String` object and escapes the characters that have a special meaning in regular expressions. |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## Custom types |
| 204 | + |
| 205 | +The DSL comes "batteries included" with types that represent common text constructs: |
| 206 | + |
| 207 | +- `integer` represents an integer number as recognized by `int` |
| 208 | +- `boolean` represents a boolean, "True" or "False" as recognized by `bool` |
| 209 | +- `number` represents a floating-point number recognize by Python's `float` |
| 210 | +- `date` represents a date as understood by `datetime.date` |
| 211 | +- `time` represents a time as undestood by `datetime.time` |
| 212 | +- `datetime` represents a time as understoof by `datetime.datetime` |
| 213 | +- `digit` represents a single digit |
| 214 | +- `char` represents a single character |
| 215 | +- `newline` represents a new line character |
| 216 | +- `whitespace` represents a white space |
| 217 | +- `sentence` represents a sentence |
| 218 | +- `paragraph` reprensents a pagraph (one or more sentences separated by one or more ilne breaks) |
| 219 | + |
| 220 | + |
| 221 | +For instance you can describe the answers in the GSM8K dataset using the following pattern: |
| 222 | + |
| 223 | +```python |
| 224 | +from outlines.types import sentence, digit |
| 225 | + |
| 226 | +answer = "A: " + sentence.repeat(2,4) + " So the answer is: " + digit.repeat(1,4) |
| 227 | +``` |
| 228 | + |
| 229 | +--- |
| 230 | + |
| 231 | +## Practical Examples |
| 232 | + |
| 233 | +### Example 1: Matching a Custom ID Format |
| 234 | + |
| 235 | +Suppose you want to create a regex that matches an ID format like "ID-12345", where: |
| 236 | +- The literal "ID-" must be at the start. |
| 237 | +- Followed by exactly 5 digits. |
| 238 | + |
| 239 | +```python |
| 240 | +id_pattern = "ID-" + Regex(r"\d").times(5) |
| 241 | +print(to_regex(id_pattern)) # Output: ID-(\d){5} |
| 242 | +``` |
| 243 | + |
| 244 | +### Example 2: Email Validation with Pydantic |
| 245 | + |
| 246 | +You can define a regex for email validation and use it as a type in a Pydantic model. |
| 247 | + |
| 248 | +```python |
| 249 | +from pydantic import BaseModel, ValidationError |
| 250 | + |
| 251 | +# Define an email regex term (this is a simplified version) |
| 252 | +email_regex = Regex(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+") |
| 253 | + |
| 254 | +class User(BaseModel): |
| 255 | + name: str |
| 256 | + email: email_regex # Use our DSL regex as a field type |
| 257 | + |
| 258 | +# Valid input |
| 259 | +user = User( name="Alice", email="[email protected]") |
| 260 | +print(user) |
| 261 | + |
| 262 | +# Invalid input (raises a ValidationError) |
| 263 | +try: |
| 264 | + User(name="Bob", email="not-an-email") |
| 265 | +except ValidationError as e: |
| 266 | + print(e) |
| 267 | +``` |
| 268 | + |
| 269 | +When used in a Pydantic model, the email field is automatically validated against the regex pattern and its JSON Schema includes the `pattern` constraint. |
| 270 | + |
| 271 | +### Example 3: Building a Complex Pattern |
| 272 | + |
| 273 | +Consider a pattern to match a simple date format: `YYYY-MM-DD`. |
| 274 | + |
| 275 | +```python |
| 276 | +year = Regex(r"\d").times(4) # Four digits for the year |
| 277 | +month = Regex(r"\d").times(2) # Two digits for the month |
| 278 | +day = Regex(r"\d").times(2) # Two digits for the day |
| 279 | + |
| 280 | +# Combine with literal hyphens |
| 281 | +date_pattern = year + "-" + month + "-" + day |
| 282 | +print(to_regex(date_pattern)) |
| 283 | +# Output: (\d){4}\-(\d){2}\-(\d){2} |
| 284 | +``` |
| 285 | + |
| 286 | +--- |
| 287 | + |
| 288 | +## Visualizing Your Pattern |
| 289 | + |
| 290 | +One of the unique features of this DSL is that each term can print its underlying structure as an ASCII tree. This visualization can be particularly helpful when dealing with complex expressions. |
| 291 | + |
| 292 | +```python |
| 293 | +# A composite pattern using concatenation and quantifiers |
| 294 | +pattern = "a" + String("b").one_or_more() + "c" |
| 295 | +print(pattern) |
| 296 | +``` |
| 297 | + |
| 298 | +*Expected Output:* |
| 299 | + |
| 300 | +``` |
| 301 | +└── Sequence |
| 302 | + ├── String('a') |
| 303 | + ├── KleenePlus(+) |
| 304 | + │ └── String('b') |
| 305 | + └── String('c') |
| 306 | +``` |
| 307 | + |
| 308 | +This tree representation makes it easy to see the hierarchy and order of operations in your regular expression. |
| 309 | + |
| 310 | +--- |
| 311 | + |
| 312 | +## Final Words |
| 313 | + |
| 314 | +This DSL is designed to simplify the creation and management of regular expressions—whether you're validating inputs in a web API, constraining the output of an LLM, or just experimenting with regex patterns. With intuitive methods for common quantifiers and operators, clear visual feedback, and built-in integration with Pydantic, you can build robust and maintainable regex-based validations with ease. |
| 315 | + |
| 316 | +Feel free to explore the library further and adapt the examples to your use cases. Happy regexing! |
0 commit comments