Ignacio Toledo, Data Science and Analytics Lead, with:
In the Chajnantor Plateau in the Atacama Desert, one of the highest and driest places on Earth, a gentle “rain” is falling. It is light from space, in millimetric and submillimetric wavelengths, a natural, scarce and precious resource. It is well-known that these waves are full of information about our cosmic origins, that is why people thirsty for this knowledge have gathered here to collect, channel and analyze it.
This is what gives rise to the Atacama Large Millimeter/submillimeter Array (ALMA), currently the largest radio telescope in the world. This achievement is the result of an international association between Europe (ESO), North America (NRAO) and East Asia (NAOJ), in collaboration with the Republic of Chile, to build the observatory of the “Dark Universe”.
As one of the world's premier terrestrial observatories, ALMA stands at the forefront of astronomical exploration. Our legacy has been built upon cutting-edge software and hardware development, enabling us to peer into the vastness of space with unparalleled precision. However, a new challenge looms on our horizon. While our expertise in observatory technology is mature, our adoption of data science and analytics, particularly in managing intricate operations, has lagged.
The operational complexity of ALMA has surged, driven by evolving demands from the global astronomers community [Figure 1] and the complexities of maintaing 66 telecopes with theri instruments at 5000 meters over the sea level. These researchers, our primary clientele, exert growing pressure for more refined data, quicker observational cycles, and streamlined processes. They seek insights into the universe's deepest secrets, and their expectations from us have never been higher. Yet, ironically, our challenges are not just celestial but deeply terrestrial.
Our operational budgets, a mere fraction compared to the colossal investments made during the facility's construction phase, have placed us in a tight spot. How do we elevate our operational efficiency, cater to the intensified demands, and manage the vast influx of data, all while keeping expenditures in check? The answer, it seems, lies in harnessing the power of data science and analytics.
However, our journey into this realm is fraught with challenges. Diving into data science without a strong foundational knowledge risks missteps. We face the dual challenge of determining which tools align with our unique needs and ensuring they seamlessly integrate with our existing systems, all within a context where we have a limited capacity for recruitment, or for funding new initiatives. The consequences are important, as a misstep in this domain might lead to inefficiencies, potentially compromising the quality of our observations and the trust of the astronomers we serve.
In summary, ALMA stands at a crossroads. The demands of the present call for a shift toward a more data-driven operational approach. But the journey to integrate data science and analytics into our workflow, with our budgetary constraints and the immense responsibility we hold to the astronomical community, is a challenge of cosmic proportions.
In 2018, our path converged with Dataiku, a collaboration that marked the beginning of our data-driven metamorphosis. Thanks to our research and non-profit stance geared towards enhancing scientific acumen, Dataiku generously provided ALMA with a free license to their Data Science Studio (Dataiku DSS).
The immediate impact of Dataiku was transformative. It wasn't just a platform; it was a holistic environment where our team could dive deep into the data's depths. With its robust suite of tools for data access, preparation, cleaning, and analysis, Dataiku streamlined our analytical process. What set Dataiku apart was its "enforced" data science and analytical workflow. This structured pathway enabled our team to grasp the intricacies of data science, illuminating the kind of team dynamics and workflows essential for our evolution.
The versatility of Dataiku, particularly its adaptability to diverse data storage technologies and database preferences, coupled with the feasibility of on-premises deployment, was invaluable. Its ease of deployment and maintenance meant that, even with limited resources and manpower, we swiftly established a formidable data stack.
The impact was palpable. Our analysts, engineers, and scientists rapidly embraced Dataiku, and in a space of a few weeks, we had near 20 people (10% of the observatory staff) doing analytical work within the platform, and up to 100 consumers connecting. We swiftly progressed from rudimentary data analytics to crafting insightful dashboards and visualizations, giving us a real-time pulse of our operations. This was just the beginning. As we delved deeper, we defined critical KPIs, and soon, machine learning was no longer a futuristic concept but a tangible tool in our arsenal. One of our pioneering endeavors involved harnessing ML for preemptive maintenance. Another milestone was deploying Natural Language Processing (NLP) to smartly classify project proposals, assigning them to reviewers seasoned in the project's theme. This innovation drastically enhanced our review process's efficiency.
Our journey with Dataiku wasn't just about technological advancement; it was a story of empowerment, growth, and maturity. The successes we achieved with our experimental deployment resonated profoundly, garnering the unequivocal support of ALMA's management. It paved the way for transitioning from an experimental to a production-level deployment, solidifying our commitment to data-driven excellence.
Business Area Enhanced: Internal Operations
Use Case Stage: In Production
Over the span of our collaboration, the tangible value that Dataiku has delivered transcends mere numbers. Here's a dive into the multifaceted benefits we've derived:
1. Cost-effective Scalability
Building a robust data stack often requires significant investment, both in terms of finance and human resources. However, with Dataiku's architecture, we achieved a feat many would deem impossible. For an expenditure of nearly $100,000 spread over five years and the involvement of no more than 1 Full-Time Equivalent (FTE) per month, we've launched a production-ready data stack. Split between a system administrator, a Dataiku administrator, and a software engineer specializing in data engineering, this minimal team catered to 20 analytic users and over 100 data consumers. Dataiku’s inherent design ensures that scaling our operations is only tethered to funding, not to the intricacies of deployment.
2. Standardized Reporting & Enhanced Decision-making
The transition to Dataiku catalyzed the creation and automation of over 50 reports and dashboards. Where once analysts toiled in isolation on their personal computers, leading to duplicated efforts and inconsistent findings, Dataiku offers a unified platform. This harmonization eliminates inconsistencies and fosters collaboration. Moreover, this centralized repository of insights has empowered decision-makers. With real-time metrics and a comprehensive overview of the observatory's operations, pinpointing areas for efficiency enhancement is now systematic and evidence-based.
3. A Paradigm Shift in Culture
Perhaps the most profound impact of Dataiku has been on ALMA's organizational culture. The introduction of a unified data stack, accessible through Dataiku, has instilled a newfound respect for data integrity. Staff members now recognize the indispensability of quality data. They understand that each report or data product is crafted with a specific intent, and thus, conclusions shouldn't be hastily drawn. Importantly, there’s a growing recognition that data science isn’t a siloed endeavor for the tech-savvy few. Instead, it’s a collective pursuit. The narrative has shifted from an outsourced task to a collaborative team sport, knitting together diverse professionals in a shared mission.
1. The 'Project' Paradigm & The Flow Advantage
One of Dataiku's most transformative features is its 'project' concept. By streamlining data sources, preparation sequences, analysis, and final output into cohesive units, our staff has accelerated their analytical processes, delivering valuable insights in significantly reduced time. The embedded 'flow' system takes transparency to the next level. It's a visual representation, laying out the crux of any project's methodology. This has fostered a collaborative ethos. Staff can effortlessly share the intricacies of their work, and with Dataiku’s integrated version control, collaboration is free from the constant dread of overwriting or losing vital work.
2. Seamless DataOps Implementation through Diverse Instance Types
Dataiku’s architecture is tailor-made for a seamless DataOps lifecycle, and our experience stands testament: