Partitioning one CSV file according to a category column value
Hi everyone,
After some time spent on dataiku's website and forums, I can't find an answer to my question.
First of all, I specify that I use the free version of DSS.
My wish is to partition a CSV file, according to the value of a category column, and then to pass each partition in an anomaly detection algorithm (forest isolation). To simplify, let's say that I want to partition according to a client category. The flow would be launched once a day, and each day there can be a different number of categories. The number of categories is more or less 100.
You can find attached a schema explaining my aim.
Could someone tell me if this is possible? If so, what are the steps to follow?
Thank you very much in advance.
Answers
-
Hi Valengo,
You can certainly run a partitioned model based on a partitioned dataset. You should be able to do this with the free edition of DSS. As for the steps, I think our academy can explain it better than I would be able to in a forum post. We have a module on training a partitioned random forest model against a partitioned dataset here: https://academy.dataiku.com/partitioned-models/543579. The module will explain the concepts and then walk you through a tutorial project for this use case. If you need a more in depth tutorial for partitioning, we also have an academy module on that here: https://academy.dataiku.com/advanced-partitioning/657681. Both of these modules should help with your use case.
Feel free to reach out if you have any questions about the tutorials!
Hope this helps!
Andrew M
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
First of all, I love using the Community edition of DSS.
@AndrewM
,Cool!
As a long-time user of the Community Edition of DSS, I found that one of the limitations of the version for me was the lack of Partitioning. When I look at this page I see that Partition has not been listed as a feature of the Community Edition. If you know a way to get Partitions and Partitioned models to work with the Community Edition, I'd love to learn more and share this with others.
@Valengo
,One of the other limitations I've found with the community edition is the lack of what Dataiku calls Scenario Support. The ability to schedule jobs to run on a schedule or by some type of trigger condition.
Can you tell us a bit more about the content that you are doing your project?
-
I overlooked the partition limitation on the edition comparison, apologies for the mistake. Unfortunately you will need the business or enterprise edition for partitioning.
@tgb417
,You are correct, thank you for catching my mistake. Unfortunately there isn't a way to get partitioning on the free edition of DSS. I overlooked that limitation when reviewing the edition comparisons.