Can we build a chatGPT for human biology?

Imminent bioconvergent tools and AI will decode the black box of biology
The breakthroughs of large language models like chatGPT were made possible by converging massive computational power with large databases of human language information. If we could train a similar AI model on biology, the language of our bodies, we would be able to enhance our understanding of diseases and engineer superior therapeutics. However, this requires complex, rich, and thus far inaccessible datasets. To dive deep enough into the complexity of biology, our biotech toolset needs an update. Multidisciplinary collaboration in a precompetitive space is the best and the quickest way to build these tools.
Training chatGPT was a matter of scale. It required a tremendous amount of human knowledge to be digitized, labeled, and supplied as data. When it comes to biological data, though, our access to data is limited by many unknown variables such as biochemical reaction circuits, sub-cellular organization, cellular dynamics, cell-to-cell interactions, etc. Until we access these variables directly, human biology remains a black box. To train AI models that are useful for biology, we need to digitize human biology and we need to do it on a massive scale.
However, this is exponentially more challenging than building the next Large Language Model. Measured in a number of bits, the black box of life is incredibly vast, and we have only decoded the tip of the iceberg by adopting modern nanotechnologies such as sequencing and gene editing methods like CRISPR/cas9. Therefore, if we want to read and write biology with a strongly increasing bandwidth, we need to upgrade our toolset. For instance: we need new molecular microscopes to get a deep understanding of what's happening inside the cells. We also need micro-physiological systems on a chip, mimicking the environment of (parts of) living organs, allowing us to pre-test personalized drugs and treatments. These tools are only now becoming possible due to major advances in deep tech & biology.
Soon they will usher in a new era for our healthcare. For instance, high-quality data fed into AI models enable a better prediction of which drugs will most likely work for you specifically. All this fits in with a trend towards better preventing, predicting, and personalizing healthcare. There will be no more 'one size fits all' diagnostics and medicines because there will be more insight into each individual patient's context and (genetic) disposition.
Although this approach offers highly valuable benefits to patients, it is not scalable today. To make that efficiency leap, higher throughput in screening, evaluating millions of candidates in no-time by running an unprecedented number of tests in parallel on a scalable device, is crucial.
There's a burning question, though. Who is going to fund the development of these massively complex and expensive tools? This is fundamentally a multi-disciplinary project: the development requires deep knowledge of biology, deep tech and software/AI. And the extensive investment costs come with an uncertain return, because after all we are talking about data that we do not yet have access to, implying that its exact value is hard to predict. On top of that the end-users, the pharmaceutical and biotech companies who are interested in the rich data sets that these tools will generate, are not in the business of selling these tools. Hence, they are not willing to carry the cost or the risks all by themselves. There is also a strong need for standardization in order to reach a constant, high level of data quality.
This has led to a substantial gap between the power that could conceivably be achieved with new tools and the actual tools available for the industry. The leap is too bold for any company to bridge alone, and thus promising innovations are stuck, while they could be changing people's lives.
A need for huge investments, complex design and manufacturing processes, standardization challenges, and an urge to innovate across the entire ecosystem to lower the risks is a strikingly familiar scenario for me, working in the technology sector. Exactly the same challenging circumstances have pushed the semiconductor industry to cooperate, building a mutual roadmap to meet Moore's law across a variety of disciplines, working on a precompetitive level across the entire ecosystem, even among longstanding rivals, and building mutual intellectual property. It has proven to be a success formula. The proof? The smartphone in your pocket is about a million times more powerful than the NASA computer that put the first man on the moon, and it's both small, affordable and power-efficient. The spectacular advances we are witnessing in AI today have derived from this collaborative approach. If the challenges for the health industry are similar, why wouldn't the solutions be?
The life sciences industry needs a similar multi-partner and interdisciplinary collaboration model in which the various stakeholders of the ecosystem come together, to co-develop the life sciences tools, share the risks, and combine technological and biological knowledge far beyond traditional organization boundaries.
R&D centers and academic institutions can bring validated biology, assay and technology knowhow, pharma and biotech can bring application insights, and system-integrators or ventures can commercialize a viable, scalable solution, while ensuring end-user adoption.
It's no coincidence that imec, at the R&D heart of the tech industry and experienced with this type of collaboration models, is aiming to assemble such a multi-partner model for life sciences, rallying global pharmaceutical and biotech. IP sharing is much less common in the life sciences industry, and it is often associated with revenue loss. But today, given the great complexity of the accumulated challenges ahead, collaboration will undoubtedly translate into success and corresponding revenues for those who subscribe. Recent discussions with major industry players indicate a growing willingness to give this type of collaboration a chance.
With generative AI reshuffling the deck, we have a hint of what's to come for health: a personalized, data-driven approach that is both affordable and efficient. But, at the end of the day, much will depend on the quality of the data that feed the AI-models. Will AI succeed in cracking the code of human biology, and designing new molecules to treat diseases, just like ChatGPT generates new pieces of human language?
Maybe, but it will require a new, disruptive way of working that begins with change of perspective across the entire life sciences ecosystem.
Provided by IMEC