Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Appreciate some pointers...
Trying to find info on below:
1) Where does DSS store Features, is there an internal feature store with in DSS?
2) Can DSS work with AWS SageMaker(SM) Feature Store, to push and re-use features which are in SM Feature store?
3) Best way to store and share features with in a team?
4) Any other Feature Stores does DSS integrate with?
While DSS does not have a separate concept of a feature store, DSS natively has, and has always had, all the capabilities of a dedicated feature store, while also including ample building capabilities, both visual and code-based.
The two central concepts in DSS are the Dataset and the data-oriented Flow.
A Dataset can natively be used as a feature group. It is a set of records, that can include both the lookup keys and the derived features.
The Flow in DSS natively provides the lineage and reproducibility capabilities that are needed for a feature store: the Flow guarantees that you can rebuild your features in exactly the same way they were initially designed, whether said Flow is entirely visual, entirely code-based, or hybrid. (https://doc.dataiku.com/dss/latest/flow/index.html). The Flow checks which data or processing has changed in order to only rebuild what is needed. It can provide either complete rebuild or incremental rebuild, through the use of partitioning (for example, for daily features)
Native integration with Git means that you can easily track the history of the flow (and hence your feature groups), revert to older versions, create branches of your work, ... (https://doc.dataiku.com/dss/latest/collaboration/git.html)
Unlike many dedicated feature stores, DSS does not limit you to a single kind of storage: datasets can be stored in any of the numerous storage options (cloud file storage, SQL databases, NoSQL databases, Hadoop, ...) and can be built using any of the compute options (DSS builtin engine, Python, R, Kubernetes, Spark, ...) (https://doc.dataiku.com/dss/latest/connecting/index.html)
Batch serving of features is natively done by the "Join" recipe in DSS which allows you to visually join and enrich your master dataset with as many datasets representing feature groups as you want (https://knowledge.dataiku.com/latest/courses/lab-to-flow/join/join-summary.html).
Online serving is done through the "Dataset lookup" feature that dynamically creates an easy to use API for querying features, including a Python client (https://doc.dataiku.com/dss/latest/apinode/endpoint-dataset-lookup.html)
Finally, Dataiku is natively a collaborative platform. Everything you do in DSS can be shared with your colleagues, either within a single project, or across multiple projects. This sharing is fully secure and you can control who can access what. DSS natively contains a data catalog, which allows you to have a shared library and understanding of your feature groups, including ample metadata and tagging of datasets (hence feature groups).
Dataiku does not natively integrate with third-party feature stores, but thanks to the coding capabilities that are native to DSS, you can easily query features in third-party feature stores using their Python or R APIs.
If you have a moment. Can you share a bit more about how you have setup your feature store?
How do users find features?
How do you maintain your features? (Refresh, versioning, Access Control)
Are there other governance benefits that you gain from your use case?
Thanks for asking. Yes, I'd be happy to share more about our feature store.
Features in our feature store are all pre-calculated (vs. calculated on the fly). All processing runs in database via SQL. In other words, we used SQL to build our features. With the amount of data we are working with, using Python would not be practical. Also, the source data for most of our features is already stored in a SQL database so much faster to use it there directly rather than pulling it out into memory.
Given the pre-calculated approach, the resulting feature tables are fairly large since for many of our features we store daily values for our entire customer base. Most features are thus updated daily (that is, new values are calculated daily). Day level feature values is sufficient for the vast majority of our use cases.
The overriding benefit of the our feature store is of course how much more quickly we can develop ML models. Developing an initial model often takes just a few hours whereas without our feature store that same model may have taken days, weeks, or even months. In some cases, our final ML models only use features from our feature store although more commonly we supplement these features with features designed for the particular problem.
We deploy updates to our feature store using DSS automation instances. We develop and test on the Design instance and then deploy the updates to a Test and finally to a Production instance. We have incorporated a variety of design time and run time checks (via DSS Metrics and Checks) to assure data accuracy and reliability.
We've experienced a bit of a virtuous cycle effect with our feature store. As the store expands and the value increases, it's easier to justify investing the resources to develop new features, test them thoroughly, assure that leakage is not occurring, etc. This in turn further increases the value of the store which makes it even easier to invest in further enhancements. And so on.
We've focused most of our efforts on building features and less on discoverability. In part, that's because use had initially been limited to small team and because our general approach is try all store features in our models. We are building a fairly simple webapp in DSS to provide a data discoverability and exploration in preparation for rolling out the feature store to more teams in our company.
Access control is covered by our pre-existing data access policies (as implemented in our database). We have a plan for versioning of features but haven't gotten to the point yet of needing to do this.
I'm not sure I've precisely answered all of your questions. Feel free to follow up with any additional questions.
If you don't mind me asking, I would be interested to get some additional insight. If these questions are too specific please feel free to be more general.
Hi Grant (@GCase),
Sure I can share some more specifics.
We are using Netezza as our backend. Our data warehouse is hosted on Netezza and in general we do all of our data work there. Netezza works well for this purpose although any similar platform should be fine. In fact, we did some testing with Teradata and got similar performance.
We have 270 features in our store currently. We have tended to add new features in a couple of ways. One is through specific feature store enhancement projects where we solicit ideas (as needed, we maintain a list of ideas), prioritize them, and then assign them out to the team for development. The other leverages feature development work we do for specific ML projects; in these cases we identify features needed for the current purpose but realizing the more general value we develop them as part of the feature store.
As a team of data scientists, we balance taking time from model development with feature store enhancement. So our pace of adding to the feature store is more episodic than monthly. I'd expect that over the next year or two, we'll be adding features at a fairly high rate. At some point at least with the current subject area we'll have pretty well covered the space of possible features and thus the pace of additions would slow.
Our feature store includes multiple tables. We did think about trying to put all features in one table but decided multiple tables was a better choice. We have a core table and then several tables for specific types of features. The data in these other tables is of a particular type or source and is available in a particular timing. This approach results in easier development (e.g., each table has its own DSS project), will scale better over time (we don't have to worry about # of column limits), and gives data scientists options regarding what data to include in their models.
Hope this is helpful.
Welcome to the Dataiku Community. We are glad that you have chosen to join us.
First, I'm not doing much work with Feature Groups at this point. It is on my roadmap to explore and possibly implement.
However, I will point out that most things in Dataiku DSS project are under git version control under the hood. You can see this by going to the version control in each project.
That said, since Feature Groups are sort of outside the scope of any one particular project. I'm not clear if that applies.
Also in reading your post I'm wondering if your thinking about "tag"ing certian feature groups as "Production" and others as say legacy. Not clear if that is a feature.
Looking forward to hearing what others know about this topic.
By versioning are you referring to adding features or changing the definitions of features?
If so, this hasn't been much of an issue for us so far.
One strategy we have used is keep groups of features relatively small so these groups can be replaced (over time) as an entire unit by a new feature group.
Within feature groups, new features are pretty easy as long as existing projects specify columns (vs. SQL's select *) - just add new columns and new projects use them and old projects work without change (until modified to use the new features).
Changed features are treated as new features. The additional step here is to exclude previous versions of the feature when using the feature set. We use a macro to build the SQL that adds our feature sets to a SQL recipe (almost all of our work is in SQL) and so would exclude the non current feature versions in this macro. That said, we haven't versioned any specific features yet.
Changing our feature store is something we do periodically and most often we are focused on adding new features. After several years of use, we haven't yet felt compelled to drop or replace features with new versions. I expect that we will at some point but our experience is that there is less need for this than one might expect.
We need to build out features at a customer level to be used in both models in production, which require features to be up to date (e.g. what is a customer's lifetime spend now), but also for training models, which requires us to be able to use the same features but as of a point in time (e.g. what was the customer's lifetime spend as of a year ago, two years ago, etc.).
It seems to us like the simplest way to allow this would be to use python scripts to build features, stored in GitHub and used in Dataiku via the Git integration, with different modules for different feature groups. For models in production, these scripts could be built into a feature store flow and the output tables published into a Dataiku feature store to be imported by individual flows. For training models, these scripts could be imported via Dataiku libraries, and the time period filtered before they are used in the flows.
Can you see any issues with this method and/or would you be able to help with it?
When we first started work on our feature store development, we used this approach. We ended up changing to an approach where we store feature values for current and historical periods in a table. We changed because the specific method we were using was too hard to work with. We were using Python scripts to execute SQL scripts. Using a pure Python approach would be quite slow with the amounts of data we are dealing with (although maybe with the ability to run Python in database, e.g., in Snowflake) could change that assumption. We also had much more expertise in our team with SQL vs. Python.
Given all this, we decided to code our features in SQL and given limitations of SQL couldn't come up with a way to apply SQL based transformations to particular input data (e.g., historical for training or current for scoring). So instead we swapped the idea of applying transformations with storing the output of the transformations for all needed points in time. For us that is a day. The resulting table is large but perfectly usable. Some additional advantages are that the table is easy to use, the same table is used for both training and scoring, joins to get feature values are quite fast (as everything is pre-calculated), and historical feature values are maintained even if input data is inappropriately updated.
While we ended up going a different direction, I think what you propose is definitely reasonable and may very well be the best approach for your situation. If Python would provide acceptable performance and you have the skills to develop feature transformations in Python on your team then it would seem like a good way to go.
cc @fsergot (re feature transforms vs feature stores)