Partitioning one CSV file according to a category column value

Valengo
Level 1
Partitioning one CSV file according to a category column value

Hi everyone,

After some time spent on dataiku's website and forums, I can't find an answer to my question.โ€ƒ

First of all, I specify that I use the free version of DSS.

My wish is to partition a CSV file, according to the value of a category column, and then to pass each partition in an anomaly detection algorithm (forest isolation). To simplify, let's say that I want to partition according to a client category. The flow would be launched once a day, and each day there can be a different number of categories. The number of categories is more or less 100.

You can find attached a schema explaining my aim.

Could someone tell me if this is possible? If so, what are the steps to follow?

Thank you very much in advance.

 

0 Kudos
3 Replies
AndrewM
Dataiker

Hi Valengo,

You can certainly run a partitioned model based on a partitioned dataset. You should be able to do this with the free edition of DSS. As for the steps, I think our academy can explain it better than I would be able to in a forum post. We have a module on training a partitioned random forest model against a partitioned dataset here: https://academy.dataiku.com/partitioned-models/543579. The module will explain the concepts and then walk you through a tutorial project for this use case. If you need a more in depth tutorial for partitioning, we also have an academy module on that here: https://academy.dataiku.com/advanced-partitioning/657681. Both of these modules should help with your use case. 

Feel free to reach out if you have any questions about the tutorials!

Hope this helps!

Andrew M

0 Kudos
tgb417

@Valengo 

First of all, I love using the Community edition of DSS.

@AndrewM ,

Cool!

As a long-time user of the Community Edition of DSS, I found that one of the limitations of the version for me was the lack of Partitioning.  When I look at this page I see that Partition has not been listed as a feature of the Community Edition.  If you know a way to get Partitions and Partitioned models to work with the Community Edition, I'd love to learn more and share this with others.

Community Edition and Partitioning.jpg

@Valengo ,

One of the other limitations I've found with the community edition is the lack of what Dataiku calls Scenario Support.  The ability to schedule jobs to run on a schedule or by some type of trigger condition.  

Can you tell us a bit more about the content that you are doing your project?

--Tom
0 Kudos
AndrewM
Dataiker

@Valengo,

I overlooked the partition limitation on the edition comparison, apologies for the mistake. Unfortunately you will need the business or enterprise edition for partitioning. 

@tgb417 ,

You are correct, thank you for catching my mistake. Unfortunately there isn't a way to get partitioning on the free edition of DSS. I overlooked that limitation when reviewing the edition comparisons. 

Labels

?
Labels (1)
A banner prompting to get Dataiku