Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions docs/UNICODE_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Unicode String Length Handling

## Overview

This document explains how string length validation works in `pubky-app-specs` and the important differences between JavaScript's native string length and Rust's character counting.

## The Problem

JavaScript and Rust count string length differently for certain Unicode characters:

| Character | Type | Rust `.chars().count()` | JS `.length` |
|-----------|------|-------------------------|--------------|
| `"Hello"` | ASCII | 5 | 5 |
| `"中文"` | Chinese | 2 | 2 |
| `"café"` | Accented | 4 | 4 |
| `"🔥"` | Emoji | **1** | **2** |
| `"𒅃"` | Cuneiform | **1** | **2** |
| `"𓀀"` | Hieroglyph | **1** | **2** |

### Why the Difference?

- **JavaScript** uses **UTF-16** encoding internally. The `.length` property counts **UTF-16 code units**.
- **Rust** `.chars().count()` counts **Unicode code points** (scalar values).

Characters in the **Basic Multilingual Plane (BMP)** (U+0000 to U+FFFF) use 1 UTF-16 code unit.
Characters **outside the BMP** (U+10000 and above) require a **surrogate pair** (2 UTF-16 code units).

### Characters Outside BMP (Affected by This Difference)

| Category | Examples | UTF-16 Units per Char |
|----------|----------|----------------------|
| Emoji | 🔥 🚀 😀 👋 🌍 | 2 |
| Cuneiform (Sumerian) | 𒅃 𒀀 𒁀 | 2 |
| Egyptian Hieroglyphs | 𓀀 𓆉 𓍄 | 2 |
| Musical Symbols | 𝄞 𝄢 | 2 |
| Mathematical Alphanumeric | 𝔸 𝕏 | 2 |
| Historic Scripts | Various | 2 |

**Note**: Characters in the BMP (ASCII, Chinese, Japanese, Korean, Arabic, Hebrew, Cyrillic, Greek, Thai, etc.) all use 1 UTF-16 unit and are **unaffected** by this difference.

## Our Solution: WASM-Based Validation

All validation in `pubky-app-specs` happens **inside the WASM module** (Rust), not in JavaScript.

### Architecture

```
┌─────────────────────────────────────────────────────────┐
│ JavaScript Client │
│ │
│ const user = PubkyAppUser.fromJson({ │
│ name: "Alice🔥", │
│ bio: "Hello 𓀀" │
│ }); │
└─────────────────────┬───────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ WASM Module (Rust) │
│ │
│ 1. Deserialize JSON │
│ 2. Sanitize (trim whitespace, normalize) │
│ 3. Validate (using .chars().count()) ◄── Single │
│ 4. Return Result Source │
│ of Truth │
└─────────────────────────────────────────────────────────┘
```

### Why This Works

1. **Single Source of Truth**: All validation uses Rust's `.chars().count()` (Unicode code points)
2. **No JS Validation Needed**: JavaScript delegates entirely to WASM
3. **Consistent Results**: Same behavior for emoji, Chinese, cuneiform, etc.

### Example: Username Validation

```rust
// In Rust (WASM)
const MAX_USERNAME_LENGTH: usize = 50;

fn validate(&self, _id: Option<&str>) -> Result<(), String> {
let name_length = self.name.chars().count(); // Unicode code points
if name_length > MAX_USERNAME_LENGTH {
return Err("Validation Error: Invalid name length".into());
}
Ok(())
}
```

| Input | `.chars().count()` | Valid? (max 50) |
|-------|-------------------|-----------------|
| `"Alice"` | 5 | ✅ |
| `"🔥".repeat(50)` | 50 | ✅ |
| `"🔥".repeat(51)` | 51 | ❌ |
| `"𓀀".repeat(50)` | 50 | ✅ |

## Client-Side Validation

For client-side validation (for UX feedback), we recommend relying on the existing pubky-app-specs validation in the WASM module.

### How to Validate in Your Application

The WASM module automatically validates all objects when you create them or parse them from JSON. Use these methods for validation:

```javascript
import { PubkySpecsBuilder, PubkyAppUser } from "pubky-app-specs";

// Method 1: Using builder
try {
const builder = new PubkySpecsBuilder(userId);
const { user } = builder.createUser(
"Alice🔥", // Emoji counts as 1 character
"Bio with 𓀀", // Hieroglyph counts as 1 character
null, null, null
);
console.log("User is valid!");
} catch (error) {
showError(error.message); // Validation failed
}

// Method 2: From JSON
try {
const user = PubkyAppUser.fromJson({
name: "Alice🔥",
bio: "Bio with 𓀀",
image: null,
links: null,
status: null
});
console.log("User is valid!");
} catch (error) {
showError(error.message); // Validation failed
}

// Both methods throw on validation failure - no manual checks needed!
```

### JavaScript Length Methods Comparison

If you need client-side length validation for real-time input feedback (e.g., character counters) or custom validation, you should use methods that count Unicode code points to match Rust's `.chars().count()` behavior:

```javascript
const str = "Hi🔥";

// ❌ WRONG - counts UTF-16 code units, not Unicode code points
str.length // 4 (will reject valid input)
if (username.length > MAX_USERNAME_LENGTH) {
showError("Username too long");
}
// This would incorrectly reject "🔥".repeat(25)
// because JS sees 50 code units, but Rust sees 25 code points (valid!)

// ✅ CORRECT - counts Unicode code points (matches Rust)
// These methods correctly handle characters outside BMP (emoji, etc.)
[...str].length // 3 (Unicode code points) - counts 🔥 as 1
Array.from(str).length // 3 (also works)
```

### When to Validate

- **On form submit**: Always - catch errors before network calls
- **Real-time feedback**: Optional - use `[...str].length` for input counters
- **On input change**: Usually not needed - can impact UX with emoji autocomplete

### Edge Cases: Grapheme Clusters (Advanced)

⚠️ **This is informational** - current validation doesn't handle grapheme clusters, and that's acceptable for most use cases.

Even `.chars().count()` doesn't handle complex **grapheme clusters** (what users perceive as single characters):

| String | Visual | Code Points | User Perception |
|--------|--------|-------------|----------------|
| `"👨‍👩‍👧‍👦"` | family emoji | 7 | 1 |
| `"🇺🇸"` | flag | 2 | 1 |
| `"é"` (e + ◌́) | accented e | 2 | 1 |

**Impact**: A username with 50 flag emojis would actually be 100 code points and fail validation.

**Decision**: For usernames, tags, and bios, code point counting is sufficient. True grapheme counting would add complexity and dependencies without significant benefit for this use case.

## Summary

| Aspect | Approach |
|--------|----------|
| **Validation Location** | WASM (Rust) only |
| **Length Method** | `.chars().count()` (Unicode code points) |
| **JS Client** | Use `[...str].length` if local validation needed |
| **Affected Characters** | Emoji, ancient scripts, musical symbols |
| **Unaffected Characters** | ASCII, Chinese, Japanese, Arabic, etc. |
| **Performance** | <1ms for typical inputs |

## References

- [Unicode Standard](https://unicode.org/)
- [UTF-16 on Wikipedia](https://en.wikipedia.org/wiki/UTF-16)
- [JavaScript String length](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length)
- [Rust chars() documentation](https://doc.rust-lang.org/std/primitive.str.html#method.chars)
77 changes: 71 additions & 6 deletions pkg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,17 +105,16 @@ import { Client } from "@synonymdev/pubky";
import { PubkySpecsBuilder, PubkyAppPostKind } from "pubky-app-specs";

async function createPost(pubkyId, content) {
// fileData can be a File (browser) or a raw Blob/Buffer (Node).
const client = new Client();
const specs = new PubkySpecsBuilder(pubkyId);

// Create the Post object referencing your (optional) attachment
// Create the Post object
const {post, meta} = specs.createPost(
content,
PubkyAppPostKind.Short,
null, // parent post
null, // embed
null // attachments list of urls
null, // parent post URI (for replies)
null, // embed object (for reposts)
null // attachments (array of file URLs, max 3)
);

// Store the post
Expand All @@ -130,7 +129,39 @@ async function createPost(pubkyId, content) {
}
```

### 3) Following a User
### 3) Creating a Post with Attachments

```js
import { Client } from "@synonymdev/pubky";
import { PubkySpecsBuilder, PubkyAppPostKind } from "pubky-app-specs";

async function createPostWithAttachments(pubkyId, content, fileUrls) {
const client = new Client();
const specs = new PubkySpecsBuilder(pubkyId);

// Create post with attachments (max 3 allowed)
const {post, meta} = specs.createPost(
content,
PubkyAppPostKind.Image,
null, // parent
null, // embed
fileUrls // e.g. ["pubky://user/pub/pubky.app/files/abc123"]
);

const postJson = post.toJson();
console.log("Attachments:", postJson.attachments);

await client.fetch(meta.url, {
method: "PUT",
body: JSON.stringify(postJson),
});

console.log("Post with attachments stored at:", meta.url);
return {post, meta};
}
```

### 4) Following a User

```js
import { Client } from "@synonymdev/pubky";
Expand Down Expand Up @@ -167,6 +198,40 @@ This library supports many more domain objects beyond `User` and `Post`. Here ar

Each has a `meta` field for storing relevant IDs/paths and a typed data object.

## 🔗 URI Builder Utilities

These helper functions construct properly formatted Pubky URIs:

```js
import {
userUriBuilder,
postUriBuilder,
bookmarkUriBuilder,
followUriBuilder,
tagUriBuilder,
muteUriBuilder,
lastReadUriBuilder,
blobUriBuilder,
fileUriBuilder,
} from "pubky-app-specs";

const userId = "8kkppkmiubfq4pxn6f73nqrhhhgkb5xyfprntc9si3np9ydbotto";
const targetUserId = "dzswkfy7ek3bqnoc89jxuqqfbzhjrj6mi8qthgbxxcqkdugm3rio";

// Build URIs for different resources
userUriBuilder(userId); // pubky://{userId}/pub/pubky.app/profile.json
postUriBuilder(userId, "0033SSE3B1FQ0"); // pubky://{userId}/pub/pubky.app/posts/{postId}
bookmarkUriBuilder(userId, "ABC123"); // pubky://{userId}/pub/pubky.app/bookmarks/{bookmarkId}
followUriBuilder(userId, targetUserId); // pubky://{userId}/pub/pubky.app/follows/{targetUserId}
tagUriBuilder(userId, "XYZ789"); // pubky://{userId}/pub/pubky.app/tags/{tagId}
muteUriBuilder(userId, targetUserId); // pubky://{userId}/pub/pubky.app/mutes/{targetUserId}
lastReadUriBuilder(userId); // pubky://{userId}/pub/pubky.app/last_read
blobUriBuilder(userId, "BLOB123"); // pubky://{userId}/pub/pubky.app/blobs/{blobId}
fileUriBuilder(userId, "FILE456"); // pubky://{userId}/pub/pubky.app/files/{fileId}
```

---

## 📌 Parsing a Pubky URI

The `parse_uri()` function converts a Pubky URI string into a strongly typed object.
Expand Down
17 changes: 17 additions & 0 deletions pkg/example.js
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,23 @@ console.log("Repost Post URL:", repostMeta.url);
console.log("Repost Data:", JSON.stringify(repost.toJson(), null, 2));
console.log("-".repeat(60));

console.log("📎 Creating Post with Attachments...");
const { post: postWithAttachments, meta: postWithAttachmentsMeta } = specsBuilder.createPost(
"Check out these photos from my trip!",
PubkyAppPostKind.Image,
null,
null,
[
`pubky://${OTTO}/pub/pubky.app/files/0034A0X7NJ52G`,
`pubky://${OTTO}/pub/pubky.app/files/0034A0X7NJ53H`,
]
);
console.log("Post ID:", postWithAttachmentsMeta.id);
console.log("Post URL:", postWithAttachmentsMeta.url);
console.log("Attachments:", postWithAttachments.toJson().attachments);
console.log("Post Data:", JSON.stringify(postWithAttachments.toJson(), null, 2));
console.log("-".repeat(60));

console.log("🔖 Creating Bookmark...");
let { bookmark, meta: bookmarkMeta } = specsBuilder.createBookmark(
`pubky://${RIO}/pub/pubky.app/posts/0033SREKPC4N0`
Expand Down
2 changes: 1 addition & 1 deletion pkg/package.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "pubky-app-specs",
"description": "Pubky.app Data Model Specifications",
"version": "0.4.1",
"version": "0.4.2",
"license": "MIT",
"collaborators": [
"SHAcollision"
Expand Down
Loading