Thinking of experimenting with delta updates of serialised JSON (as opposed to full serialisation every time, which is expensive for large collections) and wondering if anyone knows of any existing libraries, experiments, etc., that use special object IDs to mark the start and end of objects to enable delta string substitution in serialised JSON. My search engine fu is not returning any results.
@aral CryptPad does deltas on a json structure using this: https://github.com/xwiki-labs/chainpad-listmap
@cjd Thanks, Caleb. Interestingly, WhatDB (the thing I’m considering this for already has a very similar structure). I can easily keep track of changes if I want to in the object graph. What I want to explore is reflecting those changes to a serialised copy so I don’t have to do a full serialisation every time the object graph changes. This made @Moon think of CRDTs and my start/end marker idea was from WOOT/Logoot but I handn’t actually considered using a CRDT as the data structure. I am now :)
@aral @Moon
That code might do full serialization and diffing, I'm not 100% sure, it's not designed to be more efficient than serialize/diff, it's designed to support operational transformation between collaborators - which if I understand well, is not your use case. But still it could very easily be more efficient.
On that note, if you just want to append patches to a list, you can serialize in full every time and then diff the text.
@cjd @Moon Yeah, no, OT is a world of pain. I’ll take a CRDT / kappa approach over that any day :)
And I don’t know if I’ll go this route. What I have now is performant enough for the sort of data set sizes I have in mind for this. I’m just worried about hogging CPU during the streaming JSON stringification on larger datasets… but I might be overthinking it for now.
Then again, all this is research for replicated data (which will be a first-class citizen on Site.js eventually) :)
@aral there’s two ietf standards for this. jsonpatch and. i don’t rememberers the name of the other one
@zensaiyuki Thanks. Seen those. Unless I’m mistaken, they’re for generating patches and merging two JSON objects – I need to update a stringified version without performing a full stringification (with string substitution).
@aral ah, that- i haven’t heard of anything specific like that but it does remind me of certain C language json parsing librarie(s?) that leave the json string in place and just give you a datastructure of pointers into the original json string. i imagine it would be possible to build off that, if that isn’t already an option in those libraries
@zensaiyuki @Moon actually made me think further along the lines of what originally got me thinking about this: why not use a CRDT for the data structure… I’m going to give that a think now :)
Thanks again for sharing your thoughts :)
Thread: https://shitposter.club/objects/793e2269-f24e-4260-a6aa-e54b1e410afb
@zensaiyuki @Moon Haha, yeah, it’s been an area I’ve had an interest in for quite a while now. I wouldn’t be rolling my own. I implemented WOOT in Swift back in the day (but Logoot is better) and I quite like causal trees (http://archagon.net/blog/2018/03/24/data-laced-with-history/). But also wondering if an append-only log wouldn’t work alongside a kappa architecture. Have lots more research to do now :)
@aral @Moon if you’re thinking along the lines of an append only log, another good option could be a Peice Table. in essence you have your start and end markers, but also an index of which ranges in your buffer will make up the output text. when you need to insert some text in the middle, you just stick your inserted text right at the end, and insert its range into the index at the right spot, splitting a chunk of text into two ranges, “split” at the insertion point.
@aral @Moon i am not explaining it well, but the peice table advantage is that it’s very efficient to write to. slightly less efficient to read from if it gets too fragmented. mutations to a peice table would map well to being constructed from a log of operations.
CRDT might be overkill, if you don’t intend for edits to be interleaved from multiple sources simultaneously
@aral @Moon here’s a pretty good explanation:
https://darrenburns.net/posts/piece-table/
it’s for text editing, but with a bit of imagination i think you could see how it could apply here.
@zensaiyuki @Moon Haha, chips! Was just thinking about all this and I think I’ve reached a very similar conclusion. Forget stringification: let’s store and replay changes in an append-only log :)
@zensaiyuki @Moon (Sorry if there’s more to the article you linked to; only skimmed the intro. Bookmarked to read properly later.) :)
@zensaiyuki @Moon Right (no pun intended): so for my use case imagine an append-only log of object.x.y.z = 5; delete object.x.y; object.a = [1,2,3]; etc. // ;)
@aral @Moon right, and if you’re trying to make the json serialisation of that efficient, you’d translate those into text editing ops on the peice table- and you could make that very efficient if your peices are the intervals around each text token.
so, if you have e.g.
[1, “foo”, false] , you’d get exactly that text in a buffer, then your index
start, end, length, type
0 1 1, arrayst
2 4 2, sp
4 8 5, str
9 10 2 sp
11 16 5 bool
17 17 1 arrayend
@clacke @zensaiyuki Yeah, and while I was inspired by object demarcation in CRDTs like WOOT/Logoot originally, @Moon just made me realise: why not use a CRDT / some sort of Kappa architecture to generate the object graph at initial read (which can be as slow as it happens at server launch)… going to give it a think :)
@Moon Haha, funny you should mention it: I came up with the idea while thinking of the WOOT/Logoot family of CRDTs that use a virtual start / end marker (mentioned it in the Tweet but not here: https://twitter.com/aral/status/1308750041784487946). What I’m trying to see if I can’t remove the serialisation step from updates. And, now that you mention it, a CRDT may actually be the right thing for this. It will sacrifice disk space for write speed (good) and since initial read is at server startup, that also works :) Hmm…
@aral I'm curious to know what are some use-cases for this technique?
@pizza_pal I’m just looking at cutting out the expensive serialisation (stringification) step in persisting an in-memory JS database. But it’s interesting how that use case is also similar to what you need for replication… :)
@aral Yup, I think there is some sort of logical operation that is common to both, probly some ol' Charles S. Peirce, abductive reasoning kind of thing... eventual consistency and all that.
@aral Oh, just went back through the thread again, read the wikipedia about CRDT, I can see what the deal is now. Interesting, I was thinking about trying to make a collaborative text editor once.
I'm working on this personal project that's a static-site/knowledge-base thing, but with an interface to a semantic reasoner and constraint-solving engine. Serializing inferred axioms from big ontologies can be kinda slow.
@aral This is object prevalence. See http://prevayler.org/ 😉
@vertigo Neat, I did not know that term. Reading up on it :) https://en.wikipedia.org/wiki/System_prevalence
@aral This is what most databases do, it is called a Write Ahead Log (WAL). Like your database, Redis works purely in memory. It also offers to write AOF persistence logs, which it can restore from when it fails
@aral Yes, that is what mdata does and after serialization of the delta, it defers the disk write to a fifo buffer. It seems we're doing the same ;)
@aral i'm a fan of json-ld's use of "@id" as the property name for identifiers.
but i haven't seen any ndjson like specs/standards around sending updates to existing objects. ought to be a thing. it might be so easy to do that no one wrote-it-up as a practice/thing.
@aral hey i found an existing spec yesterday! HOCON from the Lightbend folks handles a stream of delta updates to objects.
https://github.com/lightbend/config/blob/master/HOCON.md
it works by wrapping each object getting changed in a dictionary that names the object to be updated.
@jauntywunderkind420 Thanks! But I’ve moved from JSON to JS… ;) This is where I’m going with it: https://mastodon.ar.al/@aral/104950158495643324
… But, of course, given that changes to the data structure in Node.js cannot conflict (are synchronous in memory) what if we didn’t use JSON.stringify at all but simply stream the history of all changes to persist them. At server start (where performance is not the bottleneck), the in-memory object graph would be recreated from the playback of the history. And that should give us almost instant writes in addition to almost instant in-memory reads. Ooh, going to experiment with that now :)