Skip to content

Commit da1b0db

Browse files
committed
Add regex DSL and re-organize custom types
1 parent d8ccb84 commit da1b0db

22 files changed

+1233
-228
lines changed

docs/reference/functions.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

docs/reference/regex_dsl.md

Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
# DSL to express constraints
2+
3+
This library provides a Domain-Specific Language (DSL) to construct regular expressions in a more intuitive and modular way. It allows you to create complex regexes using simple building blocks that represent literal strings, patterns, and various quantifiers. Additionally, these custom regex types can be used directly as types in [Pydantic](https://pydantic-docs.helpmanual.io/) schemas to enforce pattern constraints during text generation.
4+
5+
---
6+
7+
## Why Use This DSL?
8+
9+
1. **Modularity & Readability**: Instead of writing cryptic regular expression strings, you compose a regex as a tree of objects.
10+
2. **Enhanced Debugging**: Each expression can be visualized as an ASCII tree, making it easier to understand and debug complex regexes.
11+
3. **Pydantic Integration**: Use your DSL-defined regex as types in Pydantic models. The DSL seamlessly converts to JSON Schema with proper pattern constraints.
12+
4. **Extensibility**: Easily add or modify quantifiers and other regex components by extending the provided classes.
13+
14+
---
15+
16+
## Building Blocks
17+
18+
19+
Every regex component in this DSL is a **Term**. Here are two primary types:
20+
21+
- **`String`**: Represents a literal string. It escapes the characters that have a special meaning in regular expressions.
22+
- **`Regex`**: Represents an existing regex pattern string.
23+
24+
```python
25+
from outlines.types import String, Regex
26+
27+
# A literal string "hello"
28+
literal = String("hello") # Internally represents "hello"
29+
30+
# A regex pattern to match one or more digits
31+
digit = Regex(r"[0-9]+") # Internally represents the pattern [0-9]+
32+
33+
# Converting to standard regex strings:
34+
from outlines.types.dsl import to_regex
35+
36+
print(to_regex(literal)) # Output: hello
37+
print(to_regex(digit)) # Output: [0-9]+
38+
```
39+
40+
---
41+
42+
## Early Introduction to Quantifiers & Operators
43+
44+
The DSL supports common regex quantifiers as methods on every `Term`. These methods allow you to specify how many times a pattern should be matched. They include:
45+
46+
- **`times(count)`**: Matches the term exactly `count` times.
47+
- **`optional()`**: Matches the term zero or one time.
48+
- **`one_or_more()`**: Matches the term one or more times (Kleene Plus).
49+
- **`zero_or_more()`**: Matches the term zero or more times (Kleene Star).
50+
- **`repeat(min_count, max_count)`**: Matches the term between `min_count` and `max_count` times (or open-ended if one value is omitted).
51+
52+
Let’s see these quantifiers side by side with examples.
53+
54+
### Quantifiers in Action
55+
56+
#### `times(count)`
57+
58+
This method restricts the term to appear exactly `count` times.
59+
60+
```python
61+
# Example: exactly 5 digits
62+
five_digits = Regex(r"\d").times(5)
63+
print(to_regex(five_digits)) # Output: (\d){5}
64+
```
65+
66+
You can also use the `times` function:
67+
68+
```python
69+
from outlines.types import times
70+
71+
# Example: exactly 5 digits
72+
five_digits = times(Regex(r"\d"), 5)
73+
print(to_regex(five_digits)) # Output: (\d){5}
74+
```
75+
76+
#### `optional()`
77+
78+
The `optional()` method makes a term optional, meaning it may occur zero or one time.
79+
80+
```python
81+
# Example: an optional "s" at the end of a word
82+
maybe_s = String("s").optional()
83+
print(to_regex(maybe_s)) # Output: (s)?
84+
```
85+
86+
You can also use the `optional` function, the string will automatically be converted to a `String` object:
87+
88+
```python
89+
from outlines.types import optional
90+
91+
# Example: an optional "s" at the end of a word
92+
maybe_s = optional("s")
93+
print(to_regex(maybe_s)) # Output: (s)?
94+
```
95+
96+
#### `one_or_more()`
97+
98+
This method indicates that the term must appear at least once.
99+
100+
```python
101+
# Example: one or more alphabetic characters
102+
letters = Regex(r"[A-Za-z]").one_or_more()
103+
print(to_regex(letters)) # Output: ([A-Za-z])+
104+
```
105+
106+
You can also use the `one_or_more` function:
107+
108+
```python
109+
from outlines.types import one_or_more
110+
111+
# Example: one or more alphabetic characters
112+
letters = one_or_more(Regex(r"[A-Za-z]"))
113+
print(to_regex(letters)) # Output: ([A-Za-z])+
114+
115+
```
116+
117+
#### `zero_or_more()`
118+
119+
This method means that the term can occur zero or more times.
120+
121+
```python
122+
# Example: zero or more spaces
123+
spaces = String(" ").zero_or_more()
124+
print(to_regex(spaces)) # Output: ( )*
125+
```
126+
127+
You can also use the `zero_or_more` function, the string will automatically be converted to a `String` instance:
128+
129+
```python
130+
from outlines.types import zero_or_more
131+
132+
# Example: zero or more spaces
133+
spaces = zero_or_more(" ")
134+
print(to_regex(spaces)) # Output: ( )*
135+
```
136+
137+
#### `repeat(min_count, max_count)`
138+
139+
The `repeat` method provides flexibility to set a lower and/or upper bound on the number of occurrences.
140+
141+
```python
142+
# Example: Between 2 and 4 word characters
143+
word_chars = Regex(r"\w").repeat(2, 4)
144+
print(to_regex(word_chars)) # Output: (\w){2,4}
145+
146+
# Example: At least 3 digits (min specified, max left open)
147+
at_least_three = Regex(r"\d").repeat(3, None)
148+
print(to_regex(at_least_three)) # Output: (\d){3,}
149+
150+
# Example: Up to 2 punctuation marks (max specified, min omitted)
151+
up_to_two = Regex(r"[,.]").repeat(None, 2)
152+
print(to_regex(up_to_two)) # Output: ([,.]){,2}
153+
```
154+
155+
You can also use the `repeat` function:
156+
157+
```python
158+
from outlines import repeat
159+
160+
# Example: Between 2 and 4 word characters
161+
word_chars = repeat(Regex(r"\w"), 2, 4)
162+
print(to_regex(word_chars)) # Output: (\w){2,4}
163+
164+
# Example: At least 3 digits (min specified, max left open)
165+
at_least_three = repeat(Regex(r"\d"), 3, None)
166+
print(to_regex(at_least_three)) # Output: (\d){3,}
167+
168+
# Example: Up to 2 punctuation marks (max specified, min omitted)
169+
up_to_two = repeat(Regex(r"[,.]"), None, 2)
170+
print(to_regex(up_to_two)) # Output: ([,.]){,2}
171+
```
172+
173+
---
174+
175+
## Combining Terms
176+
177+
The DSL allows you to combine basic terms into more complex patterns using concatenation and alternation.
178+
179+
### Concatenation (`+`)
180+
181+
The `+` operator (and its reflected variant) concatenates terms, meaning that the terms are matched in sequence.
182+
183+
```python
184+
# Example: Match "hello world"
185+
pattern = String("hello") + " " + Regex(r"\w+")
186+
print(to_regex(pattern)) # Output: hello\ (\w+)
187+
```
188+
189+
### Alternation (`|`)
190+
191+
The `|` operator creates alternatives, allowing a match for one of several patterns.
192+
193+
```python
194+
# Example: Match either "cat" or "dog"
195+
animal = String("cat") | "dog"
196+
print(to_regex(animal)) # Output: (cat|dog)
197+
```
198+
199+
*Note:* When using operators with plain strings (such as `"dog"`), the DSL automatically wraps them in a `String` object and escapes the characters that have a special meaning in regular expressions.
200+
201+
---
202+
203+
## Custom types
204+
205+
The DSL comes "batteries included" with types that represent common text constructs:
206+
207+
- `integer` represents an integer number as recognized by `int`
208+
- `boolean` represents a boolean, "True" or "False" as recognized by `bool`
209+
- `number` represents a floating-point number recognize by Python's `float`
210+
- `date` represents a date as understood by `datetime.date`
211+
- `time` represents a time as undestood by `datetime.time`
212+
- `datetime` represents a time as understoof by `datetime.datetime`
213+
- `digit` represents a single digit
214+
- `char` represents a single character
215+
- `newline` represents a new line character
216+
- `whitespace` represents a white space
217+
- `sentence` represents a sentence
218+
- `paragraph` reprensents a pagraph (one or more sentences separated by one or more ilne breaks)
219+
220+
221+
For instance you can describe the answers in the GSM8K dataset using the following pattern:
222+
223+
```python
224+
from outlines.types import sentence, digit
225+
226+
answer = "A: " + sentence.repeat(2,4) + " So the answer is: " + digit.repeat(1,4)
227+
```
228+
229+
---
230+
231+
## Practical Examples
232+
233+
### Example 1: Matching a Custom ID Format
234+
235+
Suppose you want to create a regex that matches an ID format like "ID-12345", where:
236+
- The literal "ID-" must be at the start.
237+
- Followed by exactly 5 digits.
238+
239+
```python
240+
id_pattern = "ID-" + Regex(r"\d").times(5)
241+
print(to_regex(id_pattern)) # Output: ID-(\d){5}
242+
```
243+
244+
### Example 2: Email Validation with Pydantic
245+
246+
You can define a regex for email validation and use it as a type in a Pydantic model.
247+
248+
```python
249+
from pydantic import BaseModel, ValidationError
250+
251+
# Define an email regex term (this is a simplified version)
252+
email_regex = Regex(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+")
253+
254+
class User(BaseModel):
255+
name: str
256+
email: email_regex # Use our DSL regex as a field type
257+
258+
# Valid input
259+
user = User(name="Alice", email="[email protected]")
260+
print(user)
261+
262+
# Invalid input (raises a ValidationError)
263+
try:
264+
User(name="Bob", email="not-an-email")
265+
except ValidationError as e:
266+
print(e)
267+
```
268+
269+
When used in a Pydantic model, the email field is automatically validated against the regex pattern and its JSON Schema includes the `pattern` constraint.
270+
271+
### Example 3: Building a Complex Pattern
272+
273+
Consider a pattern to match a simple date format: `YYYY-MM-DD`.
274+
275+
```python
276+
year = Regex(r"\d").times(4) # Four digits for the year
277+
month = Regex(r"\d").times(2) # Two digits for the month
278+
day = Regex(r"\d").times(2) # Two digits for the day
279+
280+
# Combine with literal hyphens
281+
date_pattern = year + "-" + month + "-" + day
282+
print(to_regex(date_pattern))
283+
# Output: (\d){4}\-(\d){2}\-(\d){2}
284+
```
285+
286+
---
287+
288+
## Visualizing Your Pattern
289+
290+
One of the unique features of this DSL is that each term can print its underlying structure as an ASCII tree. This visualization can be particularly helpful when dealing with complex expressions.
291+
292+
```python
293+
# A composite pattern using concatenation and quantifiers
294+
pattern = "a" + String("b").one_or_more() + "c"
295+
print(pattern)
296+
```
297+
298+
*Expected Output:*
299+
300+
```
301+
└── Sequence
302+
├── String('a')
303+
├── KleenePlus(+)
304+
│ └── String('b')
305+
└── String('c')
306+
```
307+
308+
This tree representation makes it easy to see the hierarchy and order of operations in your regular expression.
309+
310+
---
311+
312+
## Final Words
313+
314+
This DSL is designed to simplify the creation and management of regular expressions—whether you're validating inputs in a web API, constraining the output of an LLM, or just experimenting with regex patterns. With intuitive methods for common quantifiers and operators, clear visual feedback, and built-in integration with Pydantic, you can build robust and maintainable regex-based validations with ease.
315+
316+
Feel free to explore the library further and adapt the examples to your use cases. Happy regexing!

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ nav:
133133
- Generation:
134134
- Overview: reference/generation/generation.md
135135
- Chat templating: reference/chat_templating.md
136+
- Regex DSL: reference/regex_dsl.md
136137
- Text: reference/text.md
137138
- Samplers: reference/samplers.md
138139
- Structured generation:

0 commit comments

Comments
 (0)