Anant JhingranThe freebase folks do not reveal much about their scaling. The scaleout models for google and wikipedia (where partitioning/replication strategies work quite well) do not quite work in such a networked graph (after all, a query on person="anant" with one or two pointer chases would end up pinging a few nodes under any partition model), so the question is, if we have billions of pieces of information in a dense graph, how does the query load on the system scale?
I, too, have found precious little about the internals of freebase, and likewise I’m interested in the question at the end of the above paragraph. But this post is about the stuff in the middle.
For starters: what’s this about the scaleout models for google do not quite work in such a networked graph? To me, the web is the quintessential networked graph, one that is massively partitioned, and yet PageRank™ seems to scale just fine.
A similar approach could conceivably work for Freebase. Data would be organized into pages, and then relations would be either embedded in, or attached to, these documents. Mining this data can be done via MapReduce jobs.
Whether this guess is right or wrong or someplace in between, I’m continuing to see a pattern. One that Amazon’s Dynamo reinforces. What I am seeing is that the interesting thing isn’t the first two columns in this table. Or in the next three columns, or in the next five columns after that. Nor even in the next two columns. The most interesting thing may very well be the last column: memcached.
What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and that metadata is either attached to, or embedded within, that data.
So the future is key-value lookups or MapReduce batch jobs, nothing in between? App developers writing boilerplate code to maintain indices?
With CouchDB, the vision is that the there will be both temporary and persistent views, and both are defined by map and optional reduce jobs.
For persistent views, the output of map jobs will be stored and indexed. This demo and basura explores portions of these ideas... and anything that could be built on BDB could certainly be built on top of Dynamo.
I disagree about memcached. The first thing I noted was that LAMP is the platform of choice. The second, but far more interesting thing I noted was the Poisson distribution of languages. Programming languages become incidental.
Sam Ruby on Key + Data: What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and...
If you care about distributed systems, you need to read the paper about Amazon’s Dynamo. Comments: Making node joining/leaving an administrative command is not something most academics consider, but it significantly reduces complexity. We made a...
Amazon reveals its secret key-data overlords from the planet Cloud
Only the barest of glances at Dynamo so far, and by far the most interesting pieces are going to be how they do the scalable high availability, and of course we’re talking about “Werner Vogels Scalability(tm)“, but I was immediately struck, as Sam...
Sam Ruby - Key + Data : "What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and...
Sam Ruby - Key + Data : "What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and...
the web is the quintessential networked graph, one that is massively partitioned, and yet PageRank™ seems to scale just fine.
That argument would be more convincing if PageRank were being run against the web directly, instead of the copy (presumably normalized, denormalized, or otherwise transformed in various ways for performance) that lives in Google’s datacenters.
But perhaps you’re suggesting that two or more copies of the data optimized for different purposes (analogous to OLTP/OLAP) should now be assumed?
“What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www?”...
But perhaps you’re suggesting that two or more copies of the data optimized for different purposes (analogous to OLTP/OLAP) should now be assumed?
I think I’m suggesting that and more.
Data is often naturally partitioned. Not just for performance and reliability reasons, but for control reasons. Much of the data you want to query, you can’t control. There’s also the pesky fallacies of distributed computing issues to deal with.
The solution is often pull and subscribe. That’s how your feed reader works, how the web works, and how Google works. When a given site that planet intertwingly subscribes to goes down, the data from the previous successful fetch is used.
I could even see this working in an enterprise setting. Different departments running their own private servers, with a few common map/reduce jobs that contribute to an overall read-only view of the data. Note: that’s different than OLTP/OLAP; and the inverse of what you were suggesting: one copy of the data; contributing to a distributed implementation of a view.
Jeremy Zawodny : Key + Data - Key + Data: “What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www?” Tags : links...
Many people are buzzing about Amazon’s Dynamo , and for good reason. But the buzz is almost dual in nature, because not only is it very cool technology, but also because of the real and perceived impacts on other architectural designs. After all,...
Future of Web Startups If there were real money in startups then everybody would be doing them. Key + Data The thing is, these systems aren’t databases at all. They are big distributed caching systems. But it’s not realistic to offer MapReduce as...
I could even see this working in an enterprise setting. Different departments running their own private servers, with a few common map/reduce jobs that contribute to an overall read-only view of the data.
Hmm. With the right sort of commodity infrastructure available, and a some common integration patterns, this approach could lead to a drastic lowering of coordination costs, affecting both the ROI of post-M&A integration efforts and shifting the transaction cost boundary that defines the Coase ‘Nature of the Firm’ in ways that both lift the upper boundary on the size of corporations and reduce the need for hierarchical command-and-control within them to the point that the largest corporations may become federated networks, rather than feudal.
What do dynamo, memcached, Berkley DB, and couc...
What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and that metadata is either...