Ignacio Toledo, Data Analyst, Data Science Initiative Lead
Tomás Staig, Software Development Lead, Data Science Initiative Lead
Rosita Hormann, Software Engineer
Jorge García, Science Archive Content Manager
Jose Luís Ortiz, Technical Lead - Digital Systems
Mark Gallilee, Technical Lead - Mechanics
Sergio Pavez, Software Engineer
Takeshi Okuda, Senior Instrument Engineer
Gastón Velez, Systems Administrator
Maxs Simmonds, Technical Lead and Deputy - Archive and Pipeline Operations
Jorge Ibsen, Head of the Department of Computing
The Atacama Large Millimeter/submillimeter Array (ALMA) is an international partnership of the European Southern Observatory (ESO), the U.S. National Science Foundation (NSF) and the National Institutes of Natural Sciences (NINS) of Japan, together with NRC (Canada), NSC and ASIAA (Taiwan), and KASI (Republic of Korea), in cooperation with the Republic of Chile. ALMA -the largest astronomical project in existence- is a single telescope of revolutionary design, composed of 66 high precision antennas located on the Chajnantor plateau, 5000 meters altitude in northern Chile.
Just as close as 15 years ago, most of the earth-based observatories were small facilities producing data for astronomical research, bearing more resemblance to laboratories than to industries. However, since the beginning of the 2000s, more complex and ambitious observatories have been built, with multi-million dollar budgets.
A major issue emerged: these could not be operated with a staff of 5 to 10 people, with one or two astronomers coming onsite to do their own experiments. As institutions, today's big astronomical observatories have become gigantic "data industries", producing terabytes (and soon petabytes) of data every year to power scientific research.
ALMA requires a staff of 300+ people and, to provide 4,300 hours of useful scientific data from our skies in a given year, the same time must be spent on maintenance and updating activities. That includes hardware components such as the "radio interferometer" (a virtual telescope made of 66 antennas and two giant computers to join their signals), software systems used to collect and process the data, but also monitoring power supplies and weather conditions to ensure that observations are being performed with a sufficient level of quality. In short, the volume of data from observations increased, along with the variables to consider to operate an observatory correctly.
Yet, we didn’t have the proper tools and processes to make sense of this new data. While we asked ourselves questions, we did not have the ability to provide quick and efficient answers. For instance, we once received an avalanche of problems reported from a particular hardware component, which became of critical importance as it impacted the quality of the observations performed. We began analyzing the number of successful hours observed in that month with this particular component - it turns out, it was the most productive month ever for that component! Obviously this seemed contradictory, but it registered more problems because it was simply used much more.
This all required a simple data analysis to find out, but we didn’t realize this sooner because we did not have the tools nor the infrastructure to query and parse the data, clean it, and enrich it with other data sources. This lack of efficient analytical tools for system diagnostics pushed us to look for them outside the organization. Here comes Dataiku, and the Ikigai program giving free licenses to nonprofit organizations.
With Dataiku, we’re building an infrastructure that is allowing the observatory’s staff to take their analytical work to the next level through:
1. Giving all people access to all the relevant data sources
Our databases were previously only accessible by astronomers to process data for scientific research. As the central data science platform, Dataiku enables our whole organization to participate in the analytical process and find out answers for their day-to-day work. For instance, now engineers and data analyst can access the CMMS*, Jira tickets, and log files from a data warehouse populated using the ETL and data preparation capabilities provided by Dataiku, and they can enrich their analysis by joining and correlating data that was previously difficult to access and analyse.
2. Enabling them to upskill through integrating with a big technology stack
Dataiku provides a visual interface to enable all technical levels to collaborate, while integrating with most current technologies to facilitate upskilling - for instance, learning a bit of SQL to query the data in various ways. The resources provided in the Dataiku Academy, as well as the Community platform where anyone can get quick answers from other users and experts (thanks to fellow Neurons!), are highly valuable for everyone to gain new knowledge.
3. Providing ways to leverage more advanced techniques, incl. machine learning
Dataiku also provides ways for even the less technical staff to foray into machine learning, thanks to its user-friendly AutoML features, and the visual interface showing (and explaining) the most relevant performance indicators of different models - also conveniently summarized in the models competition page!
4. Easily presenting insights with user-friendly data visualization capabilities
Anyone in our staff is able to perform exploratory data analysis, thanks to visual features and a drag-and-drop charting interface - and those willing to code can do deeper at their own pace. Presenting final results is also greatly accessible, with dashboards composed of tiles, to centralize from other parts of the project, in just a few clicks.
5. Giving guidance and resources to onboard enable everyone in the organization
Lastly, Dataiku has been key to easily onboard new users and make them realize the value of data insights. We’ve developed a Working Group with members of the Software, Engineering, and Science teams, with the mission to train new users and propagate best practices. We’re leveraging content from the Dataiku Academy, and are highly involved with the Community platform where any users can go to ask questions and share knowledge.
We’re also currently leading a hands-on challenge in which volunteer users give their time and expertise to bring a valuable contribution to ALMA through seeking to automatize quality assurance assessment. Always more people internally and externally collaborating in Dataiku, to advance the ‘search of our cosmic origins’!
*Computerized maintenance management system (CMMS), also known as computerized maintenance management information system (CMMIS), is a software package that maintains a computer database of information about an organization's maintenance operations.
Today, the ALMA Observatory is one of the first earth-based observatories, if not the first, to make advancements in using data science, machine learning and automation to improve its operations.
By bringing people together on a single platform, Dataiku helped grow general awareness on data analytics and taking decisions based on information produced by the data. Now the value of analytical work is broadly recognized across the organization, triggering fruitful cross-functional collaborations between various profiles - astronomers, but also analysts, archive managers, software engineers, system engineers, etc.
This leads to many wins across the organization, in which Dataiku replaces old processes to improve efficiencies through saving time and resources for building and maintaining data projects, plus optimizing through automation, machine learning, and easy monitoring among other features.
For instance, the data management team needs to keep track of observation times to comply with those requested, and create indicators enabling them to identify possible problems which might hinder the delivery of the observation data to the scientific community. It formerly took years to create that tracking tool due to the efforts and resources required, now it is only a matter of months, as the approach to solve the problem moved from a software development perspective to a data science perspective, where Dataiku supports every step from accessing the data to providing the tools to present the results to the consumer, and the focus of the analysts was no longer debugging code but understanding the data and obtaining the information needed out of it.
Eventually, the biggest value brought by Dataiku eventually relates to powering scientific discoveries: not only are we producing scientific data, but we are starting to look into it to make our operations more efficient, so as to increase the number of hours in the sky by lowering the hours needed to keep everything working as expected, and to make the best use possible of those hours by improving the quality of the observations.