Comparing Data Science Platform Capabilities
UserBird
Dataiker, Alpha Tester Posts: 535 Dataiker
I want to understand and evaluate the cost of building a data science platform that has the capabilities listed below -
Data Ingestion - File uploads from filesystem (FTP, SFTP)
Cloud (S3)
HDFS
Oracle
Plugin support
Data Versioning - Ability to manage versions of data
File Format Support CSV
Text
JSON
Excel
Automatic Schema Detection
Data Wrangling - Visual interactive & collaborative data cleaning and data imputation
Data Preparation - Apply data transformations (visually)
Variable type detection
Encoding
Data grouping and aggregation
Data Pipeline - Ability to visually create and manage data pipelines
Automating & Scheduling data pipelines
Machine Learning - Comparing models
Feature engineering
Model versioning
Distributed Processing
Data Mining Interactive & collaborative notebooks for data exploration
Data Visualization
Many built in charts
Ability to integrate javascript libraries (d3, leaflet etc)
Dashboards for executives
Design To Production
Expose your model as REST api's
Running multiple versions of the same model for testing
Can someone guide me on what tools/frameworks would we need to add on top of apache spark and zeppelin to get the expected results?
Answers
-
Two possible solutions here:
- Buying Dataiku DSS, if you are interested you should contact our sales team. The main advantage is the cost is limited to the price of our license, no additional tools/frameworks or development costs required.
- Not buying Dataiku DSS, you can refer to this year's Gartner Magic Quadrant to find inspiration on what software you should try to copy. Once you have found inspiration, you will simply need a few years of work, a team of engineers, and a little bit of funding.