Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
For prepare steps, we've run into trouble having it auto infer schema based on the first 10,000 rows of a dataset. To get around this, we change the sample to be one of the random samples. This works, but now we have the issue that some of these datasets are huge and auto generating this random sample is consuming a lot of resources.
Is there a way to have the first 10,000 rows be a random sample already preloaded with the rest of the dataset being the normal? That way dataiku only needs to read in the first 10,000 rows, but is still getting a representative sample without having to recompute a sample every time the recipe is opened.
I don't know if this will work for you or not. Could you create a column called sample? Pre-place a true value on random10,000 rows of your choice. Then in the sample screen use the filter option to select only the predefined "sample" records. I imagine that this might remove a bunch of the calculations to find the "random" sample. And the compute would be pushed to the database engine which may be more efficient at finding you 10,000 useful records than DSS's 2 pass on the fly method of finding you a new "random" sample every time you refresh. This would allow you the option to go back to the standard method if you liked. You could also have different values in this column for different types of samples. For efficiency, I'd keep these values to Int which should be the easiest for most data repositories to find efficently.
Just my $0.02 for the evening. Let us know how you get on with this. Others please jump in here if you have any smart idea about this.
P.S. You might also consider partitioning your dataset, and engineer a partition as your "sample" data. (This feature is not available on the community edition of the tool.)