Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Appreciate some pointers...
Trying to find info on below:
1) Where does DSS store Features, is there an internal feature store with in DSS?
2) Can DSS work with AWS SageMaker(SM) Feature Store, to push and re-use features which are in SM Feature store?
3) Best way to store and share features with in a team?
4) Any other Feature Stores does DSS integrate with?
While DSS does not have a separate concept of a feature store, DSS natively has, and has always had, all the capabilities of a dedicated feature store, while also including ample building capabilities, both visual and code-based.
The two central concepts in DSS are the Dataset and the data-oriented Flow.
A Dataset can natively be used as a feature group. It is a set of records, that can include both the lookup keys and the derived features.
The Flow in DSS natively provides the lineage and reproducibility capabilities that are needed for a feature store: the Flow guarantees that you can rebuild your features in exactly the same way they were initially designed, whether said Flow is entirely visual, entirely code-based, or hybrid. (https://doc.dataiku.com/dss/latest/flow/index.html). The Flow checks which data or processing has changed in order to only rebuild what is needed. It can provide either complete rebuild or incremental rebuild, through the use of partitioning (for example, for daily features)
Native integration with Git means that you can easily track the history of the flow (and hence your feature groups), revert to older versions, create branches of your work, ... (https://doc.dataiku.com/dss/latest/collaboration/git.html)
Unlike many dedicated feature stores, DSS does not limit you to a single kind of storage: datasets can be stored in any of the numerous storage options (cloud file storage, SQL databases, NoSQL databases, Hadoop, ...) and can be built using any of the compute options (DSS builtin engine, Python, R, Kubernetes, Spark, ...) (https://doc.dataiku.com/dss/latest/connecting/index.html)
Batch serving of features is natively done by the "Join" recipe in DSS which allows you to visually join and enrich your master dataset with as many datasets representing feature groups as you want (https://knowledge.dataiku.com/latest/courses/lab-to-flow/join/join-summary.html).
Online serving is done through the "Dataset lookup" feature that dynamically creates an easy to use API for querying features, including a Python client (https://doc.dataiku.com/dss/latest/apinode/endpoint-dataset-lookup.html)
Finally, Dataiku is natively a collaborative platform. Everything you do in DSS can be shared with your colleagues, either within a single project, or across multiple projects. This sharing is fully secure and you can control who can access what. DSS natively contains a data catalog, which allows you to have a shared library and understanding of your feature groups, including ample metadata and tagging of datasets (hence feature groups).
Dataiku does not natively integrate with third-party feature stores, but thanks to the coding capabilities that are native to DSS, you can easily query features in third-party feature stores using their Python or R APIs.
If you have a moment. Can you share a bit more about how you have setup your feature store?
How do users find features?
How do you maintain your features? (Refresh, versioning, Access Control)
Are there other governance benefits that you gain from your use case?
Thanks for asking. Yes, I'd be happy to share more about our feature store.
Features in our feature store are all pre-calculated (vs. calculated on the fly). All processing runs in database via SQL. In other words, we used SQL to build our features. With the amount of data we are working with, using Python would not be practical. Also, the source data for most of our features is already stored in a SQL database so much faster to use it there directly rather than pulling it out into memory.
Given the pre-calculated approach, the resulting feature tables are fairly large since for many of our features we store daily values for our entire customer base. Most features are thus updated daily (that is, new values are calculated daily). Day level feature values is sufficient for the vast majority of our use cases.
The overriding benefit of the our feature store is of course how much more quickly we can develop ML models. Developing an initial model often takes just a few hours whereas without our feature store that same model may have taken days, weeks, or even months. In some cases, our final ML models only use features from our feature store although more commonly we supplement these features with features designed for the particular problem.
We deploy updates to our feature store using DSS automation instances. We develop and test on the Design instance and then deploy the updates to a Test and finally to a Production instance. We have incorporated a variety of design time and run time checks (via DSS Metrics and Checks) to assure data accuracy and reliability.
We've experienced a bit of a virtuous cycle effect with our feature store. As the store expands and the value increases, it's easier to justify investing the resources to develop new features, test them thoroughly, assure that leakage is not occurring, etc. This in turn further increases the value of the store which makes it even easier to invest in further enhancements. And so on.
We've focused most of our efforts on building features and less on discoverability. In part, that's because use had initially been limited to small team and because our general approach is try all store features in our models. We are building a fairly simple webapp in DSS to provide a data discoverability and exploration in preparation for rolling out the feature store to more teams in our company.
Access control is covered by our pre-existing data access policies (as implemented in our database). We have a plan for versioning of features but haven't gotten to the point yet of needing to do this.
I'm not sure I've precisely answered all of your questions. Feel free to follow up with any additional questions.
If you don't mind me asking, I would be interested to get some additional insight. If these questions are too specific please feel free to be more general.
Hi Grant (@GCase),
Sure I can share some more specifics.
We are using Netezza as our backend. Our data warehouse is hosted on Netezza and in general we do all of our data work there. Netezza works well for this purpose although any similar platform should be fine. In fact, we did some testing with Teradata and got similar performance.
We have 270 features in our store currently. We have tended to add new features in a couple of ways. One is through specific feature store enhancement projects where we solicit ideas (as needed, we maintain a list of ideas), prioritize them, and then assign them out to the team for development. The other leverages feature development work we do for specific ML projects; in these cases we identify features needed for the current purpose but realizing the more general value we develop them as part of the feature store.
As a team of data scientists, we balance taking time from model development with feature store enhancement. So our pace of adding to the feature store is more episodic than monthly. I'd expect that over the next year or two, we'll be adding features at a fairly high rate. At some point at least with the current subject area we'll have pretty well covered the space of possible features and thus the pace of additions would slow.
Our feature store includes multiple tables. We did think about trying to put all features in one table but decided multiple tables was a better choice. We have a core table and then several tables for specific types of features. The data in these other tables is of a particular type or source and is available in a particular timing. This approach results in easier development (e.g., each table has its own DSS project), will scale better over time (we don't have to worry about # of column limits), and gives data scientists options regarding what data to include in their models.
Hope this is helpful.