When Amazon’s Greg Linden applied for a patent for “item to item collaborative filtering” in 1998, he couldn’t possibly have imagined the societal impact it would have. Previously, Amazon had used data based on an individual's purchases to recommend a product they might like; this approach was inefficient because it recommended things that were extremely similar to what the customer had bought before. In fact, this first method of using data was so bad that Amazon was almost better off having their team of book critics pick what to promote on the landing page. Linden’s technique was revolutionary because it made associations across products and customers - if there were correlations between buying a bike and sneakers, then a customer who bought a bike would be shown sneakers as well. Today this seems obvious, but this technique is responsible for turning data collecting, buying, and selling into an annual 100 billion dollar industry.
The practice of using ‘Big Data’ to inform business may have started with Amazon, but now, almost every company with an online presence uses data in a variety of ways. Walmart uses correlation between what products are bought at the same time (and even what the weather is like when certain products are bought) to decide how to organize their store and boost sales. Google records user search data to sell to advertisers so they can decide exactly what to show you based on your interests. At its most extreme, this practice is known as surveillance capitalism; one of the harshest critics of intrusive data collecting, Harvard professor and renowned author Shoshana Zuboff defines the web2 business model as the “unilateral claiming of private human experience as free raw material for translation into behavioral data.” New laws aimed at protecting consumers from data leaks and hacks, like the EU General Data Protection Regulation passed in 2018, are a good step to improving privacy but do not address the root issue; that is, companies are incentivized to collect data because it is profitable and therefore will continue to do so unless it stops being profitable.
The importance and abundance of data will only continue to grow; by 2035, future tech strategist, Dr. Mark van Rijmenam predicts that “the average person will interact with a connected device every 18 seconds”, meaning there will be a continuously increasing amount of data available for companies to monetize. However, as Dr. van Rijmenam also notes, there is growing demand from consumers to maintain their privacy. Hacks like what occurred to Equifax in 2017 (in which the credit data of 143 million Americans was exposed) hurt profits, tarnish a company's image, and increase consumer distrust. A solution that increases security and individual privacy during data transfers is needed, but a solution of this sort won’t see widespread adoption if it does not allow for companies to continue to profitably monetize their data.
Currently, the global distribution of data between companies is handled by many specialized centralized data marketplaces, like DataStreamX, which connect owners of data to buyers. Companies buy data for all sorts of reasons because analyzing it helps them make better decisions and improve the quality of their service. For example, Facebook buys data from one-hundred and fifty companies so that their targeted advertisements can be more accurate. This provides revenue for the companies that sell the data and allows Facebook to charge more for better advertising, providing a net benefit to all parties. However, transferring data between companies and countries through centralized processes like this increases risk because it increases the points of failure and the number of people that get access to the data.
Different types of data is available on various platforms.
Data flows through many different marketplaces and between companies themselves; Acxiom, Nielsen, Experian, Equifax, and Corelogic all gather data on millions of people, but each company is specialized in the types of data they hold. Acxiom is focused on consumer behavior, and the data they sell is used to make 12% of targeted marketing sales in the US on a yearly basis. Corelogic collects data on property transactions, and they sell it to real estate companies and banks so they can better determine mortgage rates. The average consumer has very little knowledge of where their information goes and even less control over it. Corporations that want this data for any reason can approach these big data brokers and buy it directly from them, offering no transparency to the consumer and creating an environment prone to hacks and data leaks.
Decentralized data marketplaces use blockchain to vastly improve security and privacy when connecting potential buyers and sellers of data. Entities, such as companies, data unions, or individuals, publish datasets on these marketplaces while maintaining complete control over who gets access to the data. Decentralized systems are uniquely powerful for facilitating data transactions because the inherent transparency increases user trust and the impenetrable security makes hacks and data leaks effectively obsolete. This technology is still in the beta stage and there is still lots of work to be done in terms of increasing user traction, but startups, like the non-profit Ocean Protocol, have proved that it can work and be beneficial for users.
The most well known decentralized data marketplace, Ocean Protocol, uses ERC-20 datatokens and “ERC-721 “Data NFTs”” to create an ecosystem in which users can grant access to datasets in exchange for tokens on Ethereum. The tokenization of data is important because it makes data more interoperable with the rest of the crypto and DeFi space. The NFTs represent unique assets and act as a copyright claim over a particular dataset, meaning ownership can never be compromised when sharing data. Each dataset is also given one or more unique tokens; holding a data token allows access to a particular service, so sellers grant access to their dataset by sending a token to a buyer.
When a seller publishes a dataset on Ocean, this amounts to an “Initial Data Offering” and Ocean automatically creates a unique ERC-20 datatoken and ERC-721 Data NFT to correspond to this new data asset. The seller can fix the price for access or let Ocean Market’s Automated Market Maker (powered by Balancer) auto-discover the price. Balancer provides the proof-of-liquidity service, allowing anyone to stake OCEAN (Ocean’s native token) in a “datatoken AMM pool” corresponding to a specific dataset; the key aspect of proof-of-liquidity is that staking the OCEAN token increases the liquidity in the system because OCEAN is tied to the value of each datasets unique coin.
The more OCEAN that is staked in a specific AMM pool, the higher the price to access the dataset. Individuals are incentivized to stake OCEAN in good datasets because they earn a percentage every time a transaction involving the dataset occurs; the more OCEAN staked, the higher the fee per transaction and the more they earn. Therefore, stakers are also curators because “the amount of stake is a proxy to dataset quality.” Essentially, individuals will try to pick good datasets in which to stake because they are incentivized to make money, and in turn, the quality of datasets will be collectively determined as more OCEAN is staked in better datasets.
Using blockchain to power a fair marketplace for data is fascinating by itself, but the most interesting potential utility of Ocean lies with its compute-to-data technology; this new integration for its existing marketplace allows potential buyers to train algorithms on datasets that remain on the owner's premises. Essentially, this means users can train an AI or run an algorithm on a dataset, but never gain access to the individual data points. The transparent nature of blockchain prevents buyers from running any algorithm that the seller does not approve and ensures the data is never compromised. Ocean gives sellers the option to allow potential buyers to see sample data from the dataset; this means buyers can know how much ‘cleaning up’ of the data they will need to do. Sellers are also encouraged to clean the data themselves because people will stop buying it if it’s not useful. Additionally, most of the data that compute-to-data would first be used for is private and therefore well-curated, so buyers generally don’t have to worry about cleaning it before they use it.
Compute-to-data is groundbreaking because it allows access to otherwise very private and sensitive data, for example health records. The existing healthcare system doesn’t allow hospitals to share patient records, but pooling information could have massive benefits. If every hospital in the world shared all of their patient data, doctors would have a much better idea about correlations between symptoms and early indicators of disease, potentially saving millions of lives. Unfortunately, this is currently impossible because sharing this data directly would jeopardize patient information and could lead to hacks in which millions of individuals private health records are exposed.
Compute-to-data addresses the root issue - the fact that our data is being collected - because consumers can choose whether to make their data available in the first place. Therefore, it offers a path forward to a more secure and private society where individuals will benefit. But there is no easy solution for our data privacy problem; it is completely unrealistic to expect companies to stop recording and selling our data when this has become the dominant business model for web2.
Even if consumers continue to express their dislike of this system and the government increases regulation (like the new European GDPR), companies will still be incentivized to collect data because it’s so profitable. This is why the first layer of compute-to-data is important; it offers a chance for companies to continue to monetize their data without giving up any of their customers' personal information. Additionally, GDPR prevents data from being shared across borders. This is an obstacle for many companies that want to buy user data, but Ocean allows access to data without transferring it off-premise - therefore making it compliant with this new regulation.
There’s still a ways to go to achieve this secure, decentralized method of sharing data, and the first step is the major adoption of blockchain-based marketplaces by brokers and buyers of data. As of this writing, Ocean's transaction volume is relatively low. However via Ocean's grants platform, there appear to be at least a dozen active teams building on Ocean, and maybe many more. There was a spike in usage after V3, their latest iteration, was released, but then a decline again due to issues with the staking feature in the marketplace and an increase in Ethereum gas fees. V4, which is still in beta, promises to guarantee safe staking and integrate Ocean with multiple other networks that have low gas fees like Polygon and Binance Smart chain.
Beyond perfecting the system, the founder of Ocean, Trent McConaghy, sees a few traction hypotheses that could lead to a massive increase in the number of users. The most important in the near term and the go-to method for web3 is incentivization through liquidity mining; Ocean will release this shortly after V4. In the longer term, Ocean is working on catalyzing people to create their own projects on the platform through the grant program and other initiatives. This has created a lively and passionate community, which is essential for the long term success of the new data economy.
Trent recently outlined how Ocean’s technology can be used for applications beyond this first level as well; Ethereum DAO’s can power data unions which pay royalties to individuals for the use of their data. Even further along, compute-to-data could be integrated with decentralized social media platforms. Data unions, like Swash, are DAO’s that allow individuals to monetize their own activity online using the principle of collective bargaining. Users download a browser extension which collects their cookies when they visit a website. This cookie data is then packaged with other DAO member’s data and listed on Ocean as a dataset. When a buyer pays to access the dataset, all members of the data union receive royalties.
Swash just collects cookie data, but it's possible that in the future, data unions will exist for all types of private user data. DataUnion is an ecosystem that acts as a Shopify for other data unions, allowing others to build new data unions on their platform. Organizations like this are helping to build the new data economy, and if they are successful, it may not be too long before data is gathered ethically and users truly benefit.
There are many reasons both consumers and businesses should be attracted to the utility of decentralized data marketplaces; namely, government pressure, hacks, and changing consumer habits based on greater awareness of how their data is abused. Realistically, the only way companies will stop exploiting the existing business model is a combination of new laws that prevent them from using data in the way they do now, and consumers hurting their profits through choosing to use other services. As more and more data is collected, the existing centralized system will continue to be exposed for lack of security and privacy, driving consumers to seek a new model.
Awareness continues to grow about Ocean, data unions, and the utility of blockchain in general. Consumers will ultimately decide whether the existing data business model gets replaced by decentralized data marketplaces, but in order for people to choose products like Ocean and Swash, the interface needs to be friendly and users need to have a decent understanding of and trust in blockchain technology. It may take some time to manifest, but the benefits of decentralization for the individual are undeniable - they offer the chance for a future where privacy still exists.
Alexander Liptak is a student of the History of Science and web3 investor and enthusiast. He is interested in the societal impact of emerging technologies and how these will transform the human experience; his research explores the new definition of privacy and the importance of data in the global digital revolution.
Special thanks to Trent McConaghy and Roman Ugarte for their guidance and thoughtful feedback in the process of producing this piece.
Disclaimer: Harvard Blockchain and the author of this piece are not financial advisors. Nothing contained in this research piece should be construed as investment advice.