Follow

Thinking of experimenting with delta updates of serialised JSON (as opposed to full serialisation every time, which is expensive for large collections) and wondering if anyone knows of any existing libraries, experiments, etc., that use special object IDs to mark the start and end of objects to enable delta string substitution in serialised JSON. My search engine fu is not returning any results.

· · Web · 6 · 5 · 2

… But, of course, given that changes to the data structure in Node.js cannot conflict (are synchronous in memory) what if we didn’t use JSON.stringify at all but simply stream the history of all changes to persist them. At server start (where performance is not the bottleneck), the in-memory object graph would be recreated from the playback of the history. And that should give us almost instant writes in addition to almost instant in-memory reads. Ooh, going to experiment with that now :)

Thank you to everyone who just helped me think through this and also for letting me rubber duck with you – appreciate it :)

@cjd Thanks, Caleb. Interestingly, WhatDB (the thing I’m considering this for already has a very similar structure). I can easily keep track of changes if I want to in the object graph. What I want to explore is reflecting those changes to a serialised copy so I don’t have to do a full serialisation every time the object graph changes. This made @Moon think of CRDTs and my start/end marker idea was from WOOT/Logoot but I handn’t actually considered using a CRDT as the data structure. I am now :)

@aral @Moon
That code might do full serialization and diffing, I'm not 100% sure, it's not designed to be more efficient than serialize/diff, it's designed to support operational transformation between collaborators - which if I understand well, is not your use case. But still it could very easily be more efficient.

On that note, if you just want to append patches to a list, you can serialize in full every time and then diff the text.

@cjd @Moon Yeah, no, OT is a world of pain. I’ll take a CRDT / kappa approach over that any day :)

And I don’t know if I’ll go this route. What I have now is performant enough for the sort of data set sizes I have in mind for this. I’m just worried about hogging CPU during the streaming JSON stringification on larger datasets… but I might be overthinking it for now.

Then again, all this is research for replicated data (which will be a first-class citizen on Site.js eventually) :)

@aral there’s two ietf standards for this. jsonpatch and. i don’t rememberers the name of the other one

@zensaiyuki Thanks. Seen those. Unless I’m mistaken, they’re for generating patches and merging two JSON objects – I need to update a stringified version without performing a full stringification (with string substitution).

@aral ah, that- i haven’t heard of anything specific like that but it does remind me of certain C language json parsing librarie(s?) that leave the json string in place and just give you a datastructure of pointers into the original json string. i imagine it would be possible to build off that, if that isn’t already an option in those libraries

@zensaiyuki @Moon actually made me think further along the lines of what originally got me thinking about this: why not use a CRDT for the data structure… I’m going to give that a think now :)

Thanks again for sharing your thoughts :)

Thread: shitposter.club/objects/793e22

@aral @Moon there are tradeoffs. Plenty of people that’s gone ahead and actually tried to implement crdt, they don’t always come out happy at the other end, and that’s usually after 2-3 years of trying to make it work and reluctantly giving up

@aral @Moon that’s not a personal reccomendation or caution. just noticint the trail of dead bodies on everest

@aral @Moon then other folk come along and see the dead bodies and say “they just weren’t doing it right”, which could mean either the speaker’s kung fu is stronger, or it’s a cult. I haven’t decided which i think it is yet.

@zensaiyuki @Moon Haha, yeah, it’s been an area I’ve had an interest in for quite a while now. I wouldn’t be rolling my own. I implemented WOOT in Swift back in the day (but Logoot is better) and I quite like causal trees (archagon.net/blog/2018/03/24/d). But also wondering if an append-only log wouldn’t work alongside a kappa architecture. Have lots more research to do now :)

@aral @Moon if you’re thinking along the lines of an append only log, another good option could be a Peice Table. in essence you have your start and end markers, but also an index of which ranges in your buffer will make up the output text. when you need to insert some text in the middle, you just stick your inserted text right at the end, and insert its range into the index at the right spot, splitting a chunk of text into two ranges, “split” at the insertion point.

@aral @Moon i am not explaining it well, but the peice table advantage is that it’s very efficient to write to. slightly less efficient to read from if it gets too fragmented. mutations to a peice table would map well to being constructed from a log of operations.

CRDT might be overkill, if you don’t intend for edits to be interleaved from multiple sources simultaneously

@aral @Moon especially if you can guarantee order of operations anyway without a special data structure

@aral @Moon here’s a pretty good explanation:

darrenburns.net/posts/piece-ta

it’s for text editing, but with a bit of imagination i think you could see how it could apply here.

@zensaiyuki @Moon Haha, chips! Was just thinking about all this and I think I’ve reached a very similar conclusion. Forget stringification: let’s store and replay changes in an append-only log :)

@zensaiyuki @Moon (Sorry if there’s more to the article you linked to; only skimmed the intro. Bookmarked to read properly later.) :)

@aral @Moon it’s just a write efficient data structure for append only logs of edits to text. and it’s old school- designed to be efficient on 1980’s floppy disks.

@aral @Moon (it’s what they used to make ms word save performance really good)

@zensaiyuki @Moon Right (no pun intended): so for my use case imagine an append-only log of object.x.y.z = 5; delete object.x.y; object.a = [1,2,3]; etc. // ;)

@aral @Moon right, and if you’re trying to make the json serialisation of that efficient, you’d translate those into text editing ops on the peice table- and you could make that very efficient if your peices are the intervals around each text token.

so, if you have e.g.
[1, “foo”, false] , you’d get exactly that text in a buffer, then your index
start, end, length, type
0 1 1, arrayst
2 4 2, sp
4 8 5, str
9 10 2 sp
11 16 5 bool
17 17 1 arrayend

@aral @Moon
then you’d have a function to translate “root.push(1)” into this text buffer, just appending the inserted text onto the end

[1, “foo”, false], 1

and then insert
18 19 2, sp
20 20 1, num

just before the last table entry for arrayend

@aral @Moon then most of the tricky work is on that index table, the actual text buffer only takes appends.

@aral @Moon and, you can enrich the table with whatever metadata you like. so for stance, if it were text, you could stash whether the chunk is bold or italic or whatever on a table column. in your use case it could be a cache of like, a jsonpath type lookup key.

Show newer
@aral @zensaiyuki Oh! Now I re-read and re-parsed your OP. You want the diff to be on the data structure but the patch to be on the rendered text?

@clacke @zensaiyuki Yeah, and while I was inspired by object demarcation in CRDTs like WOOT/Logoot originally, @Moon just made me realise: why not use a CRDT / some sort of Kappa architecture to generate the object graph at initial read (which can be as slow as it happens at server launch)… going to give it a think :)

@aral Please ignore if this is not relevant, but the problem sounds like what CRDTs are used for. https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type they're used for things like etherpad or how Google does updates in google docs. I don't think this is a drop-in solution like you're looking for but maybe help you search.

@Moon Haha, funny you should mention it: I came up with the idea while thinking of the WOOT/Logoot family of CRDTs that use a virtual start / end marker (mentioned it in the Tweet but not here: twitter.com/aral/status/130875). What I’m trying to see if I can’t remove the serialisation step from updates. And, now that you mention it, a CRDT may actually be the right thing for this. It will sacrifice disk space for write speed (good) and since initial read is at server startup, that also works :) Hmm…

@aral I'm curious to know what are some use-cases for this technique?

@pizza_pal I’m just looking at cutting out the expensive serialisation (stringification) step in persisting an in-memory JS database. But it’s interesting how that use case is also similar to what you need for replication… :)

@aral Yup, I think there is some sort of logical operation that is common to both, probly some ol' Charles S. Peirce, abductive reasoning kind of thing... eventual consistency and all that.

@aral Oh, just went back through the thread again, read the wikipedia about CRDT, I can see what the deal is now. Interesting, I was thinking about trying to make a collaborative text editor once.

I'm working on this personal project that's a static-site/knowledge-base thing, but with an interface to a semantic reasoner and constraint-solving engine. Serializing inferred axioms from big ontologies can be kinda slow.

@aral This is what most databases do, it is called a Write Ahead Log (WAL). Like your database, Redis works purely in memory. It also offers to write AOF persistence logs, which it can restore from when it fails

redis.io/topics/persistence

@moonglum @aral A very important clever trick is mentioned on that page. All you have to do to get a snapshot of an in-memory db is to fork!

So when you compact, you just fork, and:

- Parent starts a new append log and goes about its business.
- Child can take however long it needs to create the new snapshot. When it's done, it atomically replaces the old snapshot and removes the old transaction log.
clarification: atomically replaces the old snapshot, then removes the old transaction log -- the snapshot bit is what's atomic

@aral Yes, that is what mdata does and after serialization of the delta, it defers the disk write to a fifo buffer. It seems we're doing the same ;)

@aral i'm a fan of json-ld's use of "@id" as the property name for identifiers.

but i haven't seen any ndjson like specs/standards around sending updates to existing objects. ought to be a thing. it might be so easy to do that no one wrote-it-up as a practice/thing.

@aral hey i found an existing spec yesterday! HOCON from the Lightbend folks handles a stream of delta updates to objects.
github.com/lightbend/config/bl

it works by wrapping each object getting changed in a dictionary that names the object to be updated.

Sign in to participate in the conversation
Aral’s Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!