Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,151 @@ The library handles DNS resolution differently depending on the connection type:

**Important for Privacy**: When using `socks5h://` or other remote DNS proxies, your local DNS servers will not see any queries for the target domains, maintaining better privacy and anonymity.


### Metrics and Observability

`gowarc` provides a `StatsRegistry` interface that allows you to integrate your own metrics collection system (Prometheus, Datadog, etc.). The library tracks various metrics including data written, deduplication statistics, and proxy usage.

#### Using the StatsRegistry Interface

The `StatsRegistry` interface can be found in [`stats.go`](stats.go). To implement your own metrics collection:

```go
// Implement the StatsRegistry interface
type MyPrometheusRegistry struct {
// Your Prometheus registry fields
}

func (r *MyPrometheusRegistry) RegisterCounter(name, help string, labelNames []string) warc.Counter {
// Return a Counter that wraps your Prometheus counter
// The Counter interface requires WithLabels() method for dimensional metrics
}

func (r *MyPrometheusRegistry) RegisterGauge(name, help string, labelNames []string) warc.Gauge {
// Return a Gauge that wraps your Prometheus gauge
}

func (r *MyPrometheusRegistry) RegisterHistogram(name, help string, buckets []int64, labelNames []string) warc.Histogram {
// Return a Histogram that wraps your Prometheus histogram
}

// Pass your registry to the HTTP client
clientSettings := warc.HTTPClientSettings{
StatsRegistry: &MyPrometheusRegistry{},
// ... other settings
}
```

#### Available Metrics

The library tracks the following metrics:

- **`total_data_written`**: Total bytes written to WARC files
- **`local_deduped_bytes_total`**: Bytes saved through local deduplication
- **`local_deduped_total`**: Number of records deduplicated locally
- **`doppelganger_deduped_bytes_total`**: Bytes saved through Doppelganger deduplication
- **`doppelganger_deduped_total`**: Number of records deduplicated via Doppelganger
- **`cdx_deduped_bytes_total`**: Bytes saved through CDX deduplication
- **`cdx_deduped_total`**: Number of records deduplicated via CDX
- **`proxy_requests_total`**: Total requests through each proxy (with `proxy` label)
- **`proxy_errors_total`**: Total errors for each proxy (with `proxy` label)
- **`proxy_last_used_nanoseconds`**: Last usage timestamp for each proxy (with `proxy` label)

#### Label Support

Metrics support Prometheus-style labels for dimensional data:

```go
// Register a counter with label dimensions
counter := registry.RegisterCounter("http_requests_total", "Total HTTP requests", []string{"method", "status"})

// Record metrics with specific label values
counter.WithLabels(warc.Labels{"method": "GET", "status": "200"}).Inc()
counter.WithLabels(warc.Labels{"method": "POST", "status": "201"}).Add(5)

// Each unique label combination creates a separate metric series
```

**Interface Details**: See the complete interface contract in [`stats.go`](stats.go) for full implementation requirements.

### Logging

`gowarc` provides a `LogBackend` interface that allows you to integrate your logging solution (slog, zap, logrus, etc.). The library logs key events including connection establishment, DNS resolution, proxy selection, TLS handshakes, WARC file operations, and errors.

#### Using the LogBackend Interface

The `LogBackend` interface can be found in [`logging.go`](logging.go). The interface matches `slog.Logger` method signatures for easy integration:

```go
// Example: Wrapping slog.Logger to implement LogBackend
type SlogAdapter struct {
logger *slog.Logger
}

func (s *SlogAdapter) Debug(msg string, args ...any) {
s.logger.Debug(msg, args...)
}

func (s *SlogAdapter) Info(msg string, args ...any) {
s.logger.Info(msg, args...)
}

func (s *SlogAdapter) Warn(msg string, args ...any) {
s.logger.Warn(msg, args...)
}

func (s *SlogAdapter) Error(msg string, args ...any) {
s.logger.Error(msg, args...)
}

func (s *SlogAdapter) Log(ctx context.Context, level slog.Level, msg string, args ...any) {
s.logger.Log(ctx, level, msg, args...)
}

// Configure with your logger
handler := slog.NewJSONHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelInfo})
logger := slog.New(handler)

clientSettings := warc.HTTPClientSettings{
LogBackend: &SlogAdapter{logger: logger},
// ... other settings
}
```

#### Log Events

The library logs structured events with contextual key-value pairs:

**Connection & Network:**
- Proxy selection and connection status
- Direct connection establishment
- DNS resolution results and failures
- TLS handshake success and failures

**WARC Operations:**
- WARC record writing and file rotation
- Data written and compression events
- File creation and closure

**Errors:**
- Connection failures (proxy and direct)
- DNS resolution errors
- TLS handshake failures
- WARC write errors

#### Log Levels

- **Debug**: Verbose operational details (DNS lookups, successful connections, record writes)
- **Info**: Important state changes (file rotation, new WARC files)
- **Warn**: Recoverable issues and fallbacks
- **Error**: Failures and exceptions (connection errors, DNS failures, write errors)

Users control which log levels are recorded by configuring their logger implementation's level threshold.

**Note**: The `LogBackend` interface is intended to eventually replace the `ErrChan` error reporting mechanism. For now, both are maintained for backward compatibility.

**Interface Details**: See the complete interface contract in [`logging.go`](logging.go) for full implementation requirements.

## CLI Tools

In addition to the Go library, gowarc provides several command-line utilities for working with WARC files:
Expand Down
91 changes: 66 additions & 25 deletions client.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ import (
"net/http"
"os"
"sync"
"sync/atomic"
"time"
)

Expand All @@ -13,9 +12,52 @@ type Error struct {
Func string
}

// ProxyNetwork defines the network layer (IPv4/IPv6) a proxy can support
type ProxyNetwork int

const (
// ProxyNetworkUnset is the zero value and must not be used - forces explicit selection
ProxyNetworkUnset ProxyNetwork = iota
// ProxyNetworkAny means the proxy can be used for both IPv4 and IPv6 connections
ProxyNetworkAny
// ProxyNetworkIPv4 means the proxy should only be used for IPv4 connections
ProxyNetworkIPv4
// ProxyNetworkIPv6 means the proxy should only be used for IPv6 connections
ProxyNetworkIPv6
)

// ProxyType defines the infrastructure type of a proxy
type ProxyType int

const (
// ProxyTypeAny means the proxy can be used for any type of request
ProxyTypeAny ProxyType = iota
// ProxyTypeMobile means the proxy uses mobile network infrastructure
ProxyTypeMobile
// ProxyTypeResidential means the proxy uses residential IP addresses
ProxyTypeResidential
// ProxyTypeDatacenter means the proxy uses datacenter infrastructure
ProxyTypeDatacenter
)

// ProxyConfig defines the configuration for a single proxy
type ProxyConfig struct {
// URL is the proxy URL (e.g., "socks5://proxy.example.com:1080")
URL string
// Network specifies if this proxy supports IPv4, IPv6, or both
Network ProxyNetwork
// Type specifies the infrastructure type (Mobile, Residential, Datacenter, or Any)
Type ProxyType
// AllowedDomains is a list of glob patterns for domains this proxy should handle
// Examples: "*.example.com", "api.*.org"
// If empty, the proxy can be used for any domain
AllowedDomains []string
}

type HTTPClientSettings struct {
RotatorSettings *RotatorSettings
Proxy string
Proxies []ProxyConfig
AllowDirectFallback bool
TempDir string
DiscardHook DiscardHook
DNSServers []string
Expand All @@ -39,13 +81,14 @@ type HTTPClientSettings struct {
DisableIPv6 bool
IPv6AnyIP bool
DigestAlgorithm DigestAlgorithm
StatsRegistry StatsRegistry
LogBackend LogBackend
}

type CustomHTTPClient struct {
interfacesWatcherStop chan bool
WaitGroup *WaitGroupWithCount
dedupeHashTable *sync.Map
ErrChan chan *Error
WARCWriter chan *RecordBatch
interfacesWatcherStarted chan bool
http.Client
Expand All @@ -64,15 +107,9 @@ type CustomHTTPClient struct {
// If set to <= 0, the default value is DefaultMaxRAMUsageFraction.
MaxRAMUsageFraction float64
randomLocalIP bool
DataTotal *atomic.Int64

CDXDedupeTotalBytes *atomic.Int64
DoppelgangerDedupeTotalBytes *atomic.Int64
LocalDedupeTotalBytes *atomic.Int64

CDXDedupeTotal *atomic.Int64
DoppelgangerDedupeTotal *atomic.Int64
LocalDedupeTotal *atomic.Int64
statsRegistry StatsRegistry
logBackend LogBackend
}

func (c *CustomHTTPClient) Close() error {
Expand All @@ -91,7 +128,6 @@ func (c *CustomHTTPClient) Close() error {
}

wg.Wait()
close(c.ErrChan)

if c.randomLocalIP {
c.interfacesWatcherStop <- true
Expand All @@ -106,16 +142,24 @@ func (c *CustomHTTPClient) Close() error {
func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error) {
httpClient = new(CustomHTTPClient)

// Initialize counters
httpClient.DataTotal = &DataTotal

httpClient.CDXDedupeTotalBytes = &CDXDedupeTotalBytes
httpClient.DoppelgangerDedupeTotalBytes = &DoppelgangerDedupeTotalBytes
httpClient.LocalDedupeTotalBytes = &LocalDedupeTotalBytes
// Initialize stats registry
if HTTPClientSettings.StatsRegistry != nil {
httpClient.statsRegistry = HTTPClientSettings.StatsRegistry
HTTPClientSettings.RotatorSettings.StatsRegistry = HTTPClientSettings.StatsRegistry
} else {
localStatsRegistry := newLocalRegistry()
httpClient.statsRegistry = localStatsRegistry
HTTPClientSettings.RotatorSettings.StatsRegistry = localStatsRegistry
}

httpClient.CDXDedupeTotal = &CDXDedupeTotal
httpClient.DoppelgangerDedupeTotal = &DoppelgangerDedupeTotal
httpClient.LocalDedupeTotal = &LocalDedupeTotal
// Initialize log backend
if HTTPClientSettings.LogBackend != nil {
httpClient.logBackend = HTTPClientSettings.LogBackend
HTTPClientSettings.RotatorSettings.LogBackend = HTTPClientSettings.LogBackend
} else {
httpClient.logBackend = &noopLogger{}
HTTPClientSettings.RotatorSettings.LogBackend = &noopLogger{}
}

// Configure random local IP
httpClient.randomLocalIP = HTTPClientSettings.RandomLocalIP
Expand All @@ -142,9 +186,6 @@ func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient
// Set a hook to determine if we should discard a response
httpClient.DiscardHook = HTTPClientSettings.DiscardHook

// Create an error channel for sending WARC errors through
httpClient.ErrChan = make(chan *Error)

// Toggle verification of certificates
// InsecureSkipVerify expects the opposite of the verifyCerts flag, as such we flip it.
httpClient.verifyCerts = !HTTPClientSettings.VerifyCerts
Expand Down Expand Up @@ -216,7 +257,7 @@ func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient
httpClient.ConnReadDeadline = HTTPClientSettings.ConnReadDeadline

// Configure custom dialer / transport
customDialer, err := newCustomDialer(httpClient, HTTPClientSettings.Proxy, HTTPClientSettings.DialTimeout, HTTPClientSettings.DNSRecordsTTL, HTTPClientSettings.DNSResolutionTimeout, HTTPClientSettings.DNSCacheSize, HTTPClientSettings.DNSServers, HTTPClientSettings.DNSConcurrency, HTTPClientSettings.DisableIPv4, HTTPClientSettings.DisableIPv6)
customDialer, err := newCustomDialer(httpClient, HTTPClientSettings.Proxies, HTTPClientSettings.AllowDirectFallback, HTTPClientSettings.DialTimeout, HTTPClientSettings.DNSRecordsTTL, HTTPClientSettings.DNSResolutionTimeout, HTTPClientSettings.DNSCacheSize, HTTPClientSettings.DNSServers, HTTPClientSettings.DNSConcurrency, HTTPClientSettings.DisableIPv4, HTTPClientSettings.DisableIPv6)
if err != nil {
return nil, err
}
Expand Down
Loading
Loading