Using Dataiku

Sort by:

21 - 30 of 85

File based partitioning
Hello, I'm using file based partitioning, whereby the partitions are derived from the file name: Filename_YYYY_MM_DD/<files> . The partition pattern: /Filename_%Y-%M-%D/.* So far so good. Normally I r…
Question
Partitioning
Ignition
Started by Henk
Dec 8, 2022
1
Reply to Discussion
Reply to Discussion
When working with a Partitioned dataset is there a way to determine which partition the record is in
I'm working with several partitioned datasets. I've run into a problem that data in one of the partitions is partially corrupt. (Lots of extra spaces added to a field.) Going forward, I can put steps …
Answered ✓
Partitioning
Ignition
Started by tgb417
Most recent by tgb417
Dec 1, 2022
0
5
Solution by tgb417
Looks like I found an answer.
One can use the enrich record with file info visual recipe processor as a step.
However, this processor does not seem to work in a "lab" visual analysis.

Solution by tgb417
Looks like I found an answer.
One can use the enrich record with file info visual recipe processor as a step.
However, this processor does not seem to work in a "lab" visual analysis.

Reply to Discussion
Reply to Discussion
Exporting multiple datasets into dataiku
Dear all, The task is to export the multiple datasets into Dataiku. The original data format includes three different dataframe (20 each) within a folder. I need to have all separately, 60 files. How …
Answered ✓
Partitioning
Started by SeherFazlioglu
Most recent by SeherFazlioglu
Oct 27, 2022
1
2
Solution by
Solution by Álvaro Andrés
Hello Seher,
If I understand correctly, you have 20 different files in 3 different folders stored outside of DSS. If you would like to have these 60 files in one location you can use a managed folder to upload these files to DSS, after this you can either create one dataset for multiple files or create one dataset for each file. In the following link you can find a tutorial on how to do this:
https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html#create-a-files-in-folder-dataset
Another option is creating a file-based dataset, in this case, you should activate partitioning + define a dimension identifier that matches your folders structure:
https://doc.dataiku.com/dss/latest/partitions/fs_datasets.html#partitioning-files-based-datasets
I've uploaded 2 examples of file-based datasets using an S3 folder:
time_partition.png: Includes 3 files in 3 different folders and it's partitioned using a time dimension identifier
dimension_partition.png: Includes multiple files in 3 different folders using a discrete dimension identifier
https://doc.dataiku.com/dss/latest/partitions/identifiers.html#partition-identifiers
BR,
Álvaro
Reply to Discussion
Reply to Discussion
Bug when updating a flow with partitioned data ?
We have been using partitions in order to archive data but we started noticing some weird behaviours, as if the partitioning configuration was being ignored when updating our flow. Here is a workflow …
Answered ✓
Partitioning
Started by Tanguy
Most recent by CoreyS
Oct 13, 2022
0
3
Solution by
Solution by Tanguy
Thanks to dataiku's support team (thank you Clément Stenac ), I finally understood why dataiku was behaving this way.

TLDR:
with a recursive build, dataiku indeed ignores the partition configuration in the sync recipe. Instead, dataiku relies on its dependency management which, in this case, leads to rebuild all the partitions.

In detail:
The last recipe builds a non partitioned (NP) dataset from a (P) partitioned dataset. In this configuration, dataiku by default applies the "ALL AVAILABLE" dependency between the P dataset and the NP dataset (which, at least in this case, makes sense as I want to retrieve all the partitions).

See the following screenshot to visualize the dependency function between the P dataset and the NP last dataset:

So when asking dataiku to build the last dataset in a recursive format (e.g. with a smart build), it will seek to rebuild all partitions in the previous dataset.
So when arriving to the sync recipe, the partitions will be rebuilt by using the version of the first dataset (recall that the redispatch option is deactivated in the sync recipe so the NP input dataset will overwride each partition of the P ouptut dataset).
Apart from using dataiku's api (as I have in my other reply), one can prevent rebuilding the P dataset in a recursive fashion by setting its building behaviour to "explicit". However, this will prevent dataiku from updating the flow from the P dataset.
Note that when solving the flow dependencies, dataiku restricts itself to existing partitions, so it will not try to build a new partition (as configured in the sync dataset).
Reply to Discussion
Reply to Discussion
Partition Redispatch S3 parquet dataset using column - how to run optimally?
Hi, I am working on data in S3 which is partitioned by timestamp on filename. I need to repartition the data using a column value as the files contain unordered timestamp data. I tried redispatch part…
Question
Partitioning
Ignition
Started by gt
Most recent by gt
Oct 6, 2022
0
2
Last answer by
Last answer by gt
Hi @SarinaS
,

Thank you for your response.

In this case, the smallest partitioned data I can read each time is hourly data. Repartitioning each hour data(~7M records) is also slow when using DSS. So, I tried using Pyspark and it's performing better.

Best,
Gowtham.
Reply to Discussion
Reply to Discussion
Reading partitions one at a time from Python
Hi, I am trying to read a partitioned dataset using Python. I got a list of partitions using the following code. But I do not know how to read those partitions one by one as dataframe. mydataset = dat…
Question
Python
Partitioning
Started by ankitmat45
Most recent by Skanda Gurunathan
Aug 11, 2022
0
3
Last answer by
Last answer by Skanda Gurunathan
Why is the list_partitions() taking a lot of time, how does internally dataiku works on this?
Say I want to read only the latest partition, is there anyway to get that quickly?

Reply to Discussion
Reply to Discussion
partitioning & scenarios
hello I want to run the scenario every week for partitioning. My data set consists of weeks If I set the partitioning period to "2021-05-01/2021-07-01" like this, I'm got empty partitions Can I not cr…
Question
Partitioning
Scenarios
Started by yunho
Jun 22, 2022
0
Reply to Discussion
Reply to Discussion
update an existing column without creating a new dataset in flow recipe.
Hi Team, What I am trying to accomplish is to run a couple of column updates in a PostgreSQL after receiving new data, however, every time I try to use a new recipe, it's trying to create a new datase…
Question
Visual recipes
Partitioning
Processors
Started by Juan Carlos
Most recent by Alexandru
May 18, 2022
0
1
Last answer by
Last answer by Alexandru
Hi @Aray
,
What changes are you looking to perform? Are you looking at doing SQL recipe or Python?
You can create a python recipe with dummy output e.g a folder(where you will not be writing anything) and read the input dataset and modify it in the code.
You can also use Python Steps or SQL steps in a scenario as well.
Reply to Discussion
Reply to Discussion
Sub-partitioning
Hi, I have a dataset containing hourly rain data that is discretely partitioned following IDs of pluviometers (1000 IDs). I'd like to sub-partition this dataset by month (i.e. : /%{ID}/%Y/%M/.*), but …
Question
Datasets
Partitioning
Processors
Started by ACB
Most recent by MikeG
Apr 14, 2022
0
1
Last answer by
Last answer by MikeG
Hello @ACB
,
It is possible to implement a partitioning scheme involving a discrete dimension followed by a time dimension.
The initial state I’m envisioning is a dataset that was uploaded from a single csv file containing the pluviometer data.
From this initial state here are the steps you would need to perform to implement DSS-based partitioning on this dataset using a discrete dimension (ID) followed by a time dimension (month).
STEP 1 - Add a Prepare recipe (https://doc.dataiku.com/dss/latest/other_recipes/prepare.html) to the original dataset. In the prepare recipe do the following:
Use parse date processor (https://doc.dataiku.com/dss/latest/preparation/dates.html?#parse-date-processor) to parse the date column you wish to use in the partition key (this will allow for time-based partitioning on the date column)
Use the copy column processor (https://doc.dataiku.com/dss/latest/preparation/processors/column-copy.html) to create copies of the partition key columns (i.e. create copies of the ID column and the date column that will be used as the partition key). This step is necessary to preserve the partition key columns in the output dataset after partitioning is performed on the original dataset. Doing this will address your constraint about the columns being dropped from the output dataset.
At end of STEP 1 the Flow looks like: `original_dataset` -> `prepare recipe` -> `original_dataset_prepared`

STEP 2 - Add a Sync recipe (https://doc.dataiku.com/dss/latest/other_recipes/sync.html) to `original_dataset_prepared` to create a copy of `original_dataset_prepared`
At the end of STEP 2 the Flow looks like: `original_dataset` -> `Prepare recipe` -> `original_dataset_prepared` -> `Sync recipe` -> `original_dataset_prepared_copy`

STEP 3 - Define the desired partitioning scheme. Access `original_dataset_prepared_copy` in the Flow.
Then go to `Settings > Partitioning` and implement the partitioning scheme of ID followed by MONTH:
Click Activate Partitioning
Add discrete dimension, choose name of the ID column
Add time dimension, choose name the date column
Pattern field should look like: `%{ID_column_name}/%Y/%M/.*`
At the end of STEP 3 the Flow is the same as at the end of STEP 2.

STEP 4 - Go back to the Sync recipe. Now that you have defined a desired partitioning scheme directly downstream from the Sync recipe you should see a checkbox labeled “Redispatch partitioning according to input columns” in the UI of the Sync recipe

Check the box labeled “Redispatch partitioning according to input columns” and click Run. This will kick off the job that takes input dataset `original_dataset_prepared` and updates the layout on disk of output dataset `original_dataset_prepared_copy` to conform to the partitioning scheme defined in STEP 3.

At the end of step 4 the Flow looks the same as at the end of Step 3 with the only difference being `original_dataset_prepared_copy` is now a partitioned dataset (i.e. if you go look at the dataset on disk you will notice a directory structure based on the partitioning scheme and the partitioning columns have been dropped from the output dataset).

At the end of steps 1-4 your Flow should look similar to this:
Note:
We preserved the partitioning columns in the output dataset by making copies of them in the upstream Prepare recipe.
Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
I recommend this free course on DSS partitioning: https://academy.dataiku.com/path/advanced-designer/advanced-partitioning
---
Specifically addressing your constraints:
You should be able to use the redispatch function in a Sync recipe when the dataset directly downstream from the Sync recipe has partitioning enabled.
You can prevent the partition key columns from disappearing from the final output dataset by creating copies of the those columns via the Prepare recipe. Note: Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
I’m not sure what levers (if any) you’ll have for improving performance.
What’s the schema of the original dataset and how many rows are in it?
Which storage are you using for the input dataset?
Which storage are you using for the output dataset?
How long does the redispatch job take to complete?
Reply to Discussion
Reply to Discussion
Display predictions after callking a model on new data.
Hello Dataikuers, I'm trying to call a model that was deployed on my flow, to do so, I created a notebook where I generated new data and made a call to the model a follows : #df: is a dataframe that i…
Question
Partitioning
setup
Started by saraa1
Most recent by Alexandru
Mar 9, 2022
0
1
Last answer by
Last answer by Alexandru
Hi @saraa1
,
You can use a scoring recipe for this https://knowledge.dataiku.com/latest/courses/scoring/scored-results/scored-results-summary.html
Is there a particular reason you are looking at doing this from a Notebook instead?
Thanks,
Reply to Discussion
Reply to Discussion