I hope you are reading this in good health, in the broadest sense of the word. The COVID-19 pandemic is challenging our society and economy in an unprecedented way. You may be sheltering in place with young kids who test the limits of your noise cancelling headphones. Or you may be all by yourself and relying on tools like Skype, WhatsApp, and the trusty old telephone to stay in touch with even your closest friends.
These complex times can encourage us to innovate – to find new ways to keep the kids busy, find nice places nearby for walks, and come up with new hobbies we can learn by reading articles and watching webinars. Many of us have become researchers, trying to uncover “stuff” that’s valuable to us, without knowing exactly what we’re looking for upfront.
Real researchers and other knowledge professionals do this almost every day, navigating available knowledge to find nuggets of information to drill into and finding inspiration for new experiments that eventually yield value.
About a month ago, I learned that the Allen Institute for AI had made a large dataset of scientific research articles related to COVID-19 and coronaviruses available free of charge. Their goal was to encourage and accelerate innovation to combat the pandemic by removing subscription and other license barriers that otherwise restricted access to this available knowledge.
The corpus gave us 44,000 scientific articles to read. That will, for sure, keep the kids busy until schools reopen (imagine how much fun your two-year-old would have with this and scissors!), and it will probably also encourage many to go for a walk to get away from such a dense pile of research. But it also presents a technology challenge, and that’s essentially why the Allen Institute published it. Putting this massive dataset online solicits smart use of technology. And that’s exactly what we felt compelled to do.
For ten years now, I have worked with InterSystems Natural Language Processing technology, which focuses on bottom-up analysis of free text. What makes it unique is that it focuses on patterns of natural language; therefore, it doesn’t have to be an “expert” on any one subject or vocabulary. That means it’s unbiased and especially useful for looking at data you’re not familiar with to begin with, such as 44,000 scientific articles on coronaviruses.
So I set out and applied our Natural Language Processing tool – which is available via open source to the Allen Institute’s corpus and published it as a “content navigator” on Open Exchange – free to all.
What started as a quick sanity check proved to work quite well, and we decided to host our experiment as a resource for participants in MIT’s COVID-19 Challenge hackathon, where it was used by several teams. We’re now offering the code and a hosted version of this content navigator to anyone who would like to dig into this large knowledge repository. We are also actively looking for users who would like to take this a step further and embed this code into a solution, especially if it’s one that can be used to help end the pandemic and get us all out of the house again.