Blockchain technology and ZK proofs may be the last missing puzzle pieces for creating trustless and secure decentralized machine learning networks. This complex symphony of technologies opens the door for accessing knowledge from private databases, such as hospitals or government databases, without leaking information while still providing verifiable models and predictions.
Machine learning has become one of the most popular buzzwords in modern tech society, followed by Blockchain and Web 3.0. When the hype element is removed, there is actually a good reason for this. Machine learning, mainly in the form of deep learning, has highly utilized technological advances in hardware and opened a whole new world of use cases previously unreachable using purely deterministic algorithms. Average desktop computers are still unable to efficiently train large deep neural networks, so it is often the case that training of the models is delegated to cloud services powered by large server farms. There is a problem with this approach. If the data owners don't have the resources required for generating complex ML models, the only option for verifying the model is to trust the cloud service that the supplied model is correct. Another interesting pain point is large volumes of privately owned data. Medical data stored in hospitals would be highly beneficial for training ML models and providing predictions but are not accessible to external users for privacy and safety reasons. The solution for both of the presented problems may be closer than expected, thanks to ZK proofs and Blockchain technology.
Zero-knowledge proofs convince a verifier that a specific computation was executed correctly without revealing input values. The proof is verified against a public value, called commitment, announced before the computation. A good example of commitment value is a hash of the inputs. The ZK proof can then testify that the computation was performed correctly on the inputs whose hash was previously submitted as a commitment value.
On the other hand, extracting verifiable pieces of information while hiding the unwanted parts is a school example use case for using zero-knowledge proofs. It is not a new idea to use zero-knowledge magic to generate verifiable subsets of the credentials. Still, a lack of standardization for the existing approaches that could also cover the Blockchain use cases opens up a new research topic.
This research aims to give clues and directions toward achieving ML model generation in a trustless environment and deriving predictions from models generated using private data. Solving these issues is not straightforward, but we can follow the path from the fundamental assumptions and see where we can reach. Let's start with the basics.
Zero-knowledge proofs convince a verifier that a specific computation was executed correctly without revealing input values. The proof is verified against a public value, called commitment, announced before the computation. A good example of commitment value is a hash of the inputs. The ZK proof can testify that the computation was performed correctly on the inputs whose hash was previously submitted as a commitment value.
Any computation can be represented as an arithmetic circuit. In theory, we can generate ZK proofs for arbitrary computations, including training ML models and giving predictions based on the trained models. However, there are practical constraints that stand in our way. The steps of the ML algorithms, and their number, are highly dependent on the input data, which is a significant practical restriction and requires highly complex circuits to support all execution cases. The circuits may become so complex that the generation time for the zkSNARK proofs exceeds limits for any practical application.
Another severe limitation is the Blockchain computational power. EVM smart contracts can efficiently execute only zkSNARK proofs, specifically those generated using Groth16 and, in some cases, PLONK algorithms.
We may be slightly hitting a dead-end, but there is another way of looking at the problem. Even though our ML algorithms may be complex and highly dependent on inputs, they are executed on CPUs with relatively few deterministic instructions. If we could simulate program execution step by step instead of modeling the arithmetic circuit of a specific problem, that would be nice. To be more general, if we could prove that the state changes created by executing program instructions are correct, we could "easily" prove arbitrary computations.
This sounds like a job for Cairo or Noir, specific programming languages used to write programs with verifiable execution traces. The execution trace can then be used as input for generating proof of correct execution. This approach sounds promising, but there is a catch - the proofs of the two systems are not verifiable on the EVM blockchains (at least not using SC methods).
There are, however, ways to generate STARK proofs of the program execution and then use those proofs as inputs for SNARK provers. The SNARK provers can prove that the STARK proof was correctly generated, and SNARK proofs can be verified on-chain. This approach is used in the design of Polygon zkEVM. To summarize what we have so far - we have the tools for building provers and verifiers for ML algorithms. Next stop - decentralization.
Let's assume that we have a decentralized network of nodes capable of generating and verifying ZK proofs for the execution of ML algorithms. If node A requests that node B trains a new ML model over some data, the flow is the following:
Technically, we have everything we need, but the incentives are a bit off. Node A is incentivized to offload hard work to node B, but Node B is not incentivized to perform the computations. Let's consider some monetary incentives and see how that plays out.
Node B only wants to work after getting paid, so Node A offers some money for the service. Node B can take the money and disappear, but that is certainly different from what node A would prefer. To prevent that, node A needs to commit to some reward initially, but node B should receive the reward only if the execution proof is correct.
A blockchain would be a good escrow where node A can lock some money and node B can unlock the payment by providing the execution proof. To enable this scheme, node A would first commit to data by writing a data hash on the blockchain and send data to node B. Node B would verify if the data hash is correct and refuse the job if the hash doesn't match the provided data. If the hash matches the data, node B generates the model over the given data and provides proof, linked with the committed data hash, to the blockchain.
Assuming that the blockchain can verify the proof and transfer money to node B, node B may refuse to send the model to node A, so node A needs to receive the model before payment. If node B sends the model parameters to node A, node A may falsely claim that node B hasn't sent any parameters and cancel the payment. It may seem that we are hitting the wall - again.
If we look from the perspective of EVM blockchain - we are probably stuck. But what if we use a non-EVM blockchain where we can have custom transactions and custom block parameters? If we could put all model parameters on-chain, we would have a simple trustless schema:
The main issue is the costly storage on EVM blockchains, but a blockchain that allows custom transactions, like Cosmos, can be used as a special-purpose blockchain that we need here.
What about model predictions from models trained on private data? Fortunately, the situation is much easier here, as the nodes with the same governance generate the model and have access to data. For example, a hospital node may generate a data commitment hash and store it on-chain without revealing any data. An external node may request the classification prediction of a patient with given symptoms based on the model trained on hospital hidden data. The flow would be the following:
The proposed protocol requires us to think outside of EVM blockchains and construct specialized blockchains for the specific purpose of ML. This decision comes naturally, as we need the following:
EVM blockchains pass all requirements, but the limitations in costs and transaction size leave us with no choice but to look outside EVM space and create custom chains allowing greater transaction size.
Let's investigate the possibilities of such a decentralized ML system. All executions are verifiable, and the money reaches its target account only when all conditions are satisfied. This setup represents a fertile ground for commercial use cases. As we mentioned in the introduction, average users do not own hardware powerful enough to train huge neural models efficiently, but parallel training on multiple nodes, training of simpler models on private data, or training ensemble models.
This practically means that average users can offer their available hardware resources to train ML models, run predictions on local data and earn passive income. Private data silos can provide verifiable predictions from the models trained on their private data without any data leaks, monetizing their data and making huge quantities of data available to research and the business community.
Got any questions or comments? Find us on Twitter @3327_io