Skip to content

IPIP-462: Ipfs-Path-Affinity on Gateways #462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/http-gateways/path-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,10 @@ Gateway should refuse attempts to register a service worker for entire
Requests to these paths with `Service-Worker: script` MUST be denied by
returning HTTP 400 Bad Request error.

### `Ipfs-Path-Affinity` (request header)

Optional content routing hint, see [`Ipfs-Path-Affinity`](https://specs.ipfs.tech/http-gateways/trustless-gateway/#ipfs-path-affinity-request-header) in :cite[trustless-gateway].

## Request Query Parameters

All query parameters are optional.
Expand Down
32 changes: 31 additions & 1 deletion src/http-gateways/trustless-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,21 @@ description: >
The minimal subset of HTTP Gateway response types facilitates data retrieval
via CID and ensures integrity verification, all while eliminating the need to
trust the gateway itself.
date: 2023-06-20
date: 2024-03-22
maturity: reliable
editors:
- name: Marcin Rataj
github: lidel
url: https://lidel.org/
affiliation:
name: IP Shipyard
url: https://ipshipyard.com
- name: Henrique Dias
github: hacdias
url: https://hacdias.com/
affiliation:
name: IP Shipyard
url: https://ipshipyard.com
xref:
- url
- path-gateway
Expand Down Expand Up @@ -88,6 +94,30 @@ Below response types SHOULD be supported:
A Gateway SHOULD return HTTP 400 Bad Request when running in strict trustless
mode (no deserialized responses) and `Accept` header is missing.

### `Ipfs-Path-Affinity` (request header)

Optional content routing hint for the server. Indicates that the requested
resource is a subset of a bigger DAG.

A Client SHOULD use it to send a relevant parent content path when:
- fetching a big file block by block (`application/vnd.ipld.raw`)
- parallelizing DAG download by fetching each branch sub-DAG as a CAR (`application/vnd.ipld.car`)

The value of `Ipfs-Path-Affinity` header SHOULD be percent-encoded
([ECMA262: `encodeURIComponent`](https://tc39.es/ecma262/multipage/global-object.html#sec-encodeuricomponent-uricomponent))
unless it meets the following conditions:
- contains no path beyond the root identifier (`/ipfs/cid`)
- contains no whitespace characters
- contains no `:` characters
- contains no non-ASCII characters

A gateway backend SHOULD leverage this hint to improve retrieval by querying
providers of additional content paths in addition to the requested one.

Gateway implementation SHOULD support client requests with `Ipfs-Path-Affinity`
header being present more than once, but also SHOULD set a hard limit of hints
to process (e.g. 3) to avoid abuse.

## Request Query Parameters

### :dfn[dag-scope] (request query parameter)
Expand Down
99 changes: 99 additions & 0 deletions src/ipips/ipip-0462.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "IPIP-0462: Ipfs-Path-Affinity on Gateways"
date: 2024-02-16
ipip: proposal
editors:
- name: Marcin Rataj
github: lidel
url: https://lidel.org/
affiliation:
name: IP Shipyard
url: https://ipshipyard.com
relatedIssues:
- https://github.com/ipfs/kubo/issues/10251
- https://github.com/ipfs/kubo/issues/8676
order: 462
tags: ['ipips']
---

## Summary

This IPIP adds gateway support for optional `Ipfs-Path-Affinity` HTTP request header.

## Motivation

Endpoints that implement :cite[trustless-gateway] may receive requests for a
single block, or a CAR request sub-DAG of a biger tree.

While every piece of data that is supposed to be able to be accessed
independently should be advertised on routing system, not every CID is today.
Some providers limit announcements to top-level root CIDs due to time, cost, or
misconfiguration.

What does this mean for the ecosystem? It should adapt and ensure
implementations leverage all infromation provided by the end user.

Over time, both clients and servers should leverage the concept of "affinity".

The introduction of an optional `Ipfs-Path-Affinity` header aims to increase
the success rate of the gateway retrieving internal standalone blocks or byte
ranges, especially if the requested blocks are not announced on routing
systems, but belong to a bigger DAG, and only the root CID of that parent DAG
is announced.

## Detailed design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really feel like this is missing some guidance on what the values should or can be.

With existing wording, it's too vague and could result in significant client pain of having to implement different values for different providers.

Also: we should really speak to the how of implementing this server on the client and server sides: at least some best practices. E.g. how are we going to implement this in @helia/verified-fetch and rainbow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified format in trustless-gateway.md, the gist is: you put the content path that you try to load via block or CAR request.

https://en-wikipedia--on--ipfs-org.ipns.inbrowser.dev/wiki/Books fetching necessary blocks from https://trustless-gateway.link it would have encodeURIComponent('/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/Books') value:

Ipfs-Path-Affinity: %2Fipfs%2Fbafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze%2Fwiki%2FBooks

in every ?format=raw request.


Introduce `Ipfs-Path-Affinity` HTTP request header to allow HTTP client to
inform gateway about the context of block/CAR request.
Comment on lines +46 to +47
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the format of the data that goes here? Is it ipfs://<cid>/<some>/<path> is it /ipfs/... is it just a CID?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it should be /ipfs/... (or ipfs://cid), not just a CID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much prefer an ipfs:// protocol instead of pathing, but i guess whatever the users is likely to have is better ux

Copy link
Member Author

@lidel lidel Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with content path, as we already talk about them all over the specs.

NOTE: because UnixFS can have whitespace, :, and arbitrary bytes in labels, we have to percent-encode the content path.

Clarified format in trustless-gateway.md (35a5eed)


Client asking gateway for a block SHOULD provide a hint about the DAG the block
belongs to, if such information is available.
Comment on lines +49 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only a strict IPLD DAG where we'd recommend this? It seems like you could plausibly do this for a set of related data that aren't explicitly linked via IPLD (e.g. a website that has HTML that loads jpegs from within the same or a different root DAG).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of some chats we've had about it being likely that a provider of "bafyFoo1" would likely also have "bafyFoo2"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way field format is specified, these can be arbitrary content paths. It is up to the client to provide a meaningful hint.

Most of the time it will be the content path the client tries to load, but it could also be /ipns/other-website.


A gateway unable to find providers for internal block should be
able to leverage affinity information sent by client and use CIDs of parent
path segments as additional content routing lookup hints.

## Design rationale

### User benefit

When supported by both client and server:

- Light clients are able to use trustless HTTP gateway endpoints more
efficiently, resume downloads faster.
- Gateway operators are able to leverage the hint and save resources related to
provider lookup.
- Content providers are able to implement smarter announcement mechanisms,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean by "smarter announcement mechanisms"?
Is it just a matter of whether only roots or all CIDs are announced?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"roots" isn't really a word here. Root of what? At every layer in the DAG going up you could slice off the top of the tree and declare a new root.

Every piece of content that needs to be independently addressable should be advertised. See https://github.com/ipfs/specs/pull/462/files#r1492996318

So at the very least if you make a block-request for the middle of a tar.gz file (where no part of the file really needs to be addressed on its own) you should be able to find it even if the provider has only advertised the root of the file.

As mentioned in the linked comment I do think we need to be careful not to mislead people though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every piece of content that needs to be independently addressable should be advertised.

Yes, that makes sense in theory, but if we go to the tar.gz example:

So at the very least if you make a block-request for the middle of a tar.gz file (where no part of the file really needs to be addressed on its own) you should be able to find it even if the provider has only advertised the root of the file.

Here, the only independently addressable content is the tar.gz file and for range block-requests in the middle, you'd need to pass the affinity header to be able to fetch that.

If we're on the same page thus far, what would be smarter announcement mechanisms/strategies? Is the idea for those to be codec-aware in the sense that you could tell the node to advertise all UnixFS Files and Directories?

Copy link
Member Author

@lidel lidel Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2color we have some ideas for smarter announcement mechanisms/strategies in ipfs/kubo#8676 and more actionable ipfs/kubo#10365 (which has wip implementation).

ps. we have a concept of entity, it was introduced in IPIP-402, we can use it in discussions like this, to say "only entity root CIDs (see IPIP-402)", making it more specific.

Copy link
Member

@2color 2color Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like "smarter" in this case means "frugal but smart" in the sense that it involves potentially less announcements that are effective enough for routing to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, there are "frugal" things we could do on both client and server to announce less, but be as efficient at retrieval in real world usage patterns (website browsing, video streaming, download resume or parallel downloads etc).

without worrying that some internal blocks are not announced (intentionally or unintentionally).

### Compatibility

This is an optional HTTP header which makes it backward-compatible with
existing ecosystem of HTTP clients and IPFS Gateways.

### Security

The client is in control when the affinity information is sent in the header,
and an implementation SHOULD allow an end user to disable it in context where parent
content path information is considered sensitive information.

Gateway implementation that supports `Ipfs-Path-Affinity` header being present
more than once MUST also set limit (e.g. max 3) to avoid abuse.

### Alternatives

- Why not just an arbitrary identifier the user could use to establish a
relationship between requests?
- Requires server to keep state, which breaks or complicates gateway
deployments with horizontal scaling.
- Does not help when client is sending requests for different blocks/sub-DAGs
to different trustless gateways, and none of them has the whole picture,
and majority of them do not know what is the parent content path.

## Test fixtures

N/A

### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).