Using Dataiku
- Hello, I'm using file based partitioning, whereby the partitions are derived from the file name: Filename_YYYY_MM_DD/<files> . The partition pattern: /Filename_%Y-%M-%D/.* So far so good. Normally I r…
When working with a Partitioned dataset is there a way to determine which partition the record is in
I'm working with several partitioned datasets. I've run into a problem that data in one of the partitions is partially corrupt. (Lots of extra spaces added to a field.) Going forward, I can put steps …Solution by tgb417Looks like I found an answer.
One can use the enrich record with file info visual recipe processor as a step.
However, this processor does not seem to work in a "lab" visual analysis.
- Dear all, The task is to export the multiple datasets into Dataiku. The original data format includes three different dataframe (20 each) within a folder. I need to have all separately, 60 files. How …Solution bySolution by Álvaro Andrés
Hello Seher,
If I understand correctly, you have 20 different files in 3 different folders stored outside of DSS. If you would like to have these 60 files in one location you can use a managed folder to upload these files to DSS, after this you can either create one dataset for multiple files or create one dataset for each file. In the following link you can find a tutorial on how to do this:
https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html#create-a-files-in-folder-dataset
Another option is creating a file-based dataset, in this case, you should activate partitioning + define a dimension identifier that matches your folders structure:
https://doc.dataiku.com/dss/latest/partitions/fs_datasets.html#partitioning-files-based-datasets
I've uploaded 2 examples of file-based datasets using an S3 folder:
- time_partition.png: Includes 3 files in 3 different folders and it's partitioned using a time dimension identifier
- dimension_partition.png: Includes multiple files in 3 different folders using a discrete dimension identifier
https://doc.dataiku.com/dss/latest/partitions/identifiers.html#partition-identifiers
BR,
Álvaro
- We have been using partitions in order to archive data but we started noticing some weird behaviours, as if the partitioning configuration was being ignored when updating our flow. Here is a workflow …Solution bySolution by Tanguy
Thanks to dataiku's support team (thank you Clément Stenac
), I finally understood why dataiku was behaving this way.
TLDR:
with a recursive build, dataiku indeed ignores the partition configuration in the sync recipe. Instead, dataiku relies on its dependency management which, in this case, leads to rebuild all the partitions.
In detail:
The last recipe builds a non partitioned (NP) dataset from a (P) partitioned dataset. In this configuration, dataiku by default applies the "ALL AVAILABLE" dependency between the P dataset and the NP dataset (which, at least in this case, makes sense as I want to retrieve all the partitions).
See the following screenshot to visualize the dependency function between the P dataset and the NP last dataset:So when asking dataiku to build the last dataset in a recursive format (e.g. with a smart build), it will seek to rebuild all partitions in the previous dataset.
So when arriving to the sync recipe, the partitions will be rebuilt by using the version of the first dataset (recall that the redispatch option is deactivated in the sync recipe so the NP input dataset will overwride each partition of the P ouptut dataset).
Apart from using dataiku's api (as I have in my other reply), one can prevent rebuilding the P dataset in a recursive fashion by setting its building behaviour to "explicit". However, this will prevent dataiku from updating the flow from the P dataset.
Note that when solving the flow dependencies, dataiku restricts itself to existing partitions, so it will not try to build a new partition (as configured in the sync dataset).
- Hi, I am working on data in S3 which is partitioned by timestamp on filename. I need to repartition the data using a column value as the files contain unordered timestamp data. I tried redispatch part…Last answer byLast answer by gt
Hi @SarinaS
,
Thank you for your response.
In this case, the smallest partitioned data I can read each time is hourly data. Repartitioning each hour data(~7M records) is also slow when using DSS. So, I tried using Pyspark and it's performing better.Best,
Gowtham.
- Hi, I am trying to read a partitioned dataset using Python. I got a list of partitions using the following code. But I do not know how to read those partitions one by one as dataframe. mydataset = dat…Last answer by
- hello I want to run the scenario every week for partitioning. My data set consists of weeks If I set the partitioning period to "2021-05-01/2021-07-01" like this, I'm got empty partitions Can I not cr…
- Hi Team, What I am trying to accomplish is to run a couple of column updates in a PostgreSQL after receiving new data, however, every time I try to use a new recipe, it's trying to create a new datase…Last answer byLast answer by Alexandru
Hi @Aray
,What changes are you looking to perform? Are you looking at doing SQL recipe or Python?
You can create a python recipe with dummy output e.g a folder(where you will not be writing anything) and read the input dataset and modify it in the code.
You can also use Python Steps or SQL steps in a scenario as well.
- Hi, I have a dataset containing hourly rain data that is discretely partitioned following IDs of pluviometers (1000 IDs). I'd like to sub-partition this dataset by month (i.e. : /%{ID}/%Y/%M/.*), but …Last answer byLast answer by MikeG
Hello @ACB
,It is possible to implement a partitioning scheme involving a discrete dimension followed by a time dimension.
The initial state I’m envisioning is a dataset that was uploaded from a single csv file containing the pluviometer data.
From this initial state here are the steps you would need to perform to implement DSS-based partitioning on this dataset using a discrete dimension (ID) followed by a time dimension (month).
STEP 1 - Add a Prepare recipe (https://doc.dataiku.com/dss/latest/other_recipes/prepare.html) to the original dataset. In the prepare recipe do the following:
- Use parse date processor (https://doc.dataiku.com/dss/latest/preparation/dates.html?#parse-date-processor) to parse the date column you wish to use in the partition key (this will allow for time-based partitioning on the date column)
- Use the copy column processor (https://doc.dataiku.com/dss/latest/preparation/processors/column-copy.html) to create copies of the partition key columns (i.e. create copies of the ID column and the date column that will be used as the partition key). This step is necessary to preserve the partition key columns in the output dataset after partitioning is performed on the original dataset. Doing this will address your constraint about the columns being dropped from the output dataset.
At end of STEP 1 the Flow looks like: `original_dataset` -> `prepare recipe` -> `original_dataset_prepared`
STEP 2 - Add a Sync recipe (https://doc.dataiku.com/dss/latest/other_recipes/sync.html) to `original_dataset_prepared` to create a copy of `original_dataset_prepared`
At the end of STEP 2 the Flow looks like: `original_dataset` -> `Prepare recipe` -> `original_dataset_prepared` -> `Sync recipe` -> `original_dataset_prepared_copy`
STEP 3 - Define the desired partitioning scheme. Access `original_dataset_prepared_copy` in the Flow.
Then go to `Settings > Partitioning` and implement the partitioning scheme of ID followed by MONTH:- Click Activate Partitioning
- Add discrete dimension, choose name of the ID column
- Add time dimension, choose name the date column
- Pattern field should look like: `%{ID_column_name}/%Y/%M/.*`
At the end of STEP 3 the Flow is the same as at the end of STEP 2.
STEP 4 - Go back to the Sync recipe. Now that you have defined a desired partitioning scheme directly downstream from the Sync recipe you should see a checkbox labeled “Redispatch partitioning according to input columns” in the UI of the Sync recipe
Check the box labeled “Redispatch partitioning according to input columns” and click Run. This will kick off the job that takes input dataset `original_dataset_prepared` and updates the layout on disk of output dataset `original_dataset_prepared_copy` to conform to the partitioning scheme defined in STEP 3.At the end of step 4 the Flow looks the same as at the end of Step 3 with the only difference being `original_dataset_prepared_copy` is now a partitioned dataset (i.e. if you go look at the dataset on disk you will notice a directory structure based on the partitioning scheme and the partitioning columns have been dropped from the output dataset).
At the end of steps 1-4 your Flow should look similar to this:
Note:
- We preserved the partitioning columns in the output dataset by making copies of them in the upstream Prepare recipe.
- Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
- I recommend this free course on DSS partitioning: https://academy.dataiku.com/path/advanced-designer/advanced-partitioning
---
Specifically addressing your constraints:
- You should be able to use the redispatch function in a Sync recipe when the dataset directly downstream from the Sync recipe has partitioning enabled.
- You can prevent the partition key columns from disappearing from the final output dataset by creating copies of the those columns via the Prepare recipe. Note: Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
- I’m not sure what levers (if any) you’ll have for improving performance.
- What’s the schema of the original dataset and how many rows are in it?
- Which storage are you using for the input dataset?
- Which storage are you using for the output dataset?
- How long does the redispatch job take to complete?
- Hello Dataikuers, I'm trying to call a model that was deployed on my flow, to do so, I created a notebook where I generated new data and made a call to the model a follows : #df: is a dataframe that i…Last answer byLast answer by Alexandru
Hi @saraa1
,You can use a scoring recipe for this https://knowledge.dataiku.com/latest/courses/scoring/scored-results/scored-results-summary.html
Is there a particular reason you are looking at doing this from a Notebook instead?
Thanks,
Top Tags
Trending Discussions
- Answered2
- Answered ✓7
Leaderboard
Member | Points |
Turribeach | 3702 |
tgb417 | 2515 |
Ignacio_Toledo | 1082 |