Skip to content

Hash changes if we change our metadata #1152

Closed
@schomatis

Description

@schomatis

Spawned from ipfs/kubo#8974.

tl;dr The CID is not the hash of your file, do not rely on it. The normal learning path can leave you with a wrong impression of an apparent stability between user data and CID/hash representing it.

Brief outline:

  • New users are introduced in the IPFS world through the content-based paradigm: forget where you store it, all that counts is the data itself, which we identified through its hash. In contrast with location, your (user's) data doesn't change, neither will its hash.
  • New users experiment with this paradigm by adding files to the IPFS system (CLI, HTTP, web, whatever) and get a CID/hash in return.
  • There is now a discrepancy of what "data" means:
    • In the theory/docs the users visualize a block (string of bits) of their data, what was contained in the FS file they're adding, nothing more.
    • In practice, through the UnixFS abstraction, the file is formatted in a DAG of many chunks (blocks) of the user's data. The DAG structure is supported by IPFS (not user) metadata, which is also part of the block of data that is being hashed and thus affects its CID.
  • The metadata is leaked in the CID, whether the user cares about it or not. The same file added with different parameters (or even same parameters but new IPFS versions with different defaults) may be represented by different CIDs/hashes.

I think this happens to a lot of people (myself included). The simplest example of a "neutral" block of my data is what I first think of when immutability appears, and at some point we silently jump from that single block to a file without mentioning UnixFS, which is ugly and I get why is not in the foreground, but you normally translate that neutral/single/your block as your file, and therefore the immutability of data as also the immutability of its tag (CID). Not sure when but at some point we need to break it to you, maybe not even mentioning UnixFS but just the generic metadata, that we process your data and add some of our own to better organize and transmit it, and even if that is also immutable we may change our minds (very rarely) as to what the best organization is. And you'll see a different hash reflecting it. Kind of sucks, but that's life, and it's still much better than httping all the time. (We can omit this last remark 😬.)

Metadata

Metadata

Assignees

Labels

P2Medium: Good to have, but can wait until someone steps updif/mediumPrior experience is likely helpfuleffort/hoursEstimated to take one or several hourssubtaskIssue w/ parent GH issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions