Using Dataiku

Sort by:
21 - 30 of 85
  • Hello, I'm using file based partitioning, whereby the partitions are derived from the file name: Filename_YYYY_MM_DD/<files> . The partition pattern: /Filename_%Y-%M-%D/.* So far so good. Normally I r…
    Question
    Started by Henk
    1
  • I'm working with several partitioned datasets. I've run into a problem that data in one of the partitions is partially corrupt. (Lots of extra spaces added to a field.) Going forward, I can put steps …
    Answered ✓
    Started by tgb417
    Most recent by tgb417
    0
    5
    tgb417
    Solution by tgb417

    Looks like I found an answer.

    One can use the enrich record with file info visual recipe processor as a step.

    However, this processor does not seem to work in a "lab" visual analysis.

    tgb417
    Solution by tgb417

    Looks like I found an answer.

    One can use the enrich record with file info visual recipe processor as a step.

    However, this processor does not seem to work in a "lab" visual analysis.

  • Dear all, The task is to export the multiple datasets into Dataiku. The original data format includes three different dataframe (20 each) within a folder. I need to have all separately, 60 files. How …
    Answered ✓
    Started by SeherFazlioglu
    Most recent by SeherFazlioglu
    1
    2
    Solution by
    Álvaro Andrés
    Solution by Álvaro Andrés

    Hello Seher,

    If I understand correctly, you have 20 different files in 3 different folders stored outside of DSS. If you would like to have these 60 files in one location you can use a managed folder to upload these files to DSS, after this you can either create one dataset for multiple files or create one dataset for each file. In the following link you can find a tutorial on how to do this:

    https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html#create-a-files-in-folder-dataset

    Another option is creating a file-based dataset, in this case, you should activate partitioning + define a dimension identifier that matches your folders structure:

    https://doc.dataiku.com/dss/latest/partitions/fs_datasets.html#partitioning-files-based-datasets

    I've uploaded 2 examples of file-based datasets using an S3 folder:

    • time_partition.png: Includes 3 files in 3 different folders and it's partitioned using a time dimension identifier
    • dimension_partition.png: Includes multiple files in 3 different folders using a discrete dimension identifier

    https://doc.dataiku.com/dss/latest/partitions/identifiers.html#partition-identifiers

    BR,

    Álvaro

  • We have been using partitions in order to archive data but we started noticing some weird behaviours, as if the partitioning configuration was being ignored when updating our flow. Here is a workflow …
    Answered ✓
    Started by Tanguy
    Most recent by CoreyS
    0
    3
    Solution by
    Tanguy
    Solution by Tanguy

    Thanks to dataiku's support team (thank you Clément Stenac ), I finally understood why dataiku was behaving this way.

    TLDR:
    with a recursive build, dataiku indeed ignores the partition configuration in the sync recipe. Instead, dataiku relies on its dependency management which, in this case, leads to rebuild all the partitions.

    In detail:
    The last recipe builds a non partitioned (NP) dataset from a (P) partitioned dataset. In this configuration, dataiku by default applies the "ALL AVAILABLE" dependency between the P dataset and the NP dataset (which, at least in this case, makes sense as I want to retrieve all the partitions).

    See the following screenshot to visualize the dependency function between the P dataset and the NP last dataset:
    14.jpg

    So when asking dataiku to build the last dataset in a recursive format (e.g. with a smart build), it will seek to rebuild all partitions in the previous dataset.
    15.jpg

    So when arriving to the sync recipe, the partitions will be rebuilt by using the version of the first dataset (recall that the redispatch option is deactivated in the sync recipe so the NP input dataset will overwride each partition of the P ouptut dataset).

    Apart from using dataiku's api (as I have in my other reply), one can prevent rebuilding the P dataset in a recursive fashion by setting its building behaviour to "explicit". However, this will prevent dataiku from updating the flow from the P dataset.

    Note that when solving the flow dependencies, dataiku restricts itself to existing partitions, so it will not try to build a new partition (as configured in the sync dataset).

  • Hi, I am working on data in S3 which is partitioned by timestamp on filename. I need to repartition the data using a column value as the files contain unordered timestamp data. I tried redispatch part…
    Question
    Started by gt
    Most recent by gt
    0
    2
    Last answer by
    gt
    Last answer by gt

    Hi @SarinaS
    ,

    Thank you for your response.

    In this case, the smallest partitioned data I can read each time is hourly data. Repartitioning each hour data(~7M records) is also slow when using DSS. So, I tried using Pyspark and it's performing better.

    Best,

    Gowtham.

  • Hi, I am trying to read a partitioned dataset using Python. I got a list of partitions using the following code. But I do not know how to read those partitions one by one as dataframe. mydataset = dat…
    Question
    Started by ankitmat45
    Most recent by Skanda Gurunathan
    0
    3
    Last answer by
    Skanda Gurunathan
    Last answer by Skanda Gurunathan

    Why is the list_partitions() taking a lot of time, how does internally dataiku works on this?

    Say I want to read only the latest partition, is there anyway to get that quickly?

  • hello I want to run the scenario every week for partitioning. My data set consists of weeks If I set the partitioning period to "2021-05-01/2021-07-01" like this, I'm got empty partitions Can I not cr…
    Question
    Started by yunho
    0
  • Hi Team, What I am trying to accomplish is to run a couple of column updates in a PostgreSQL after receiving new data, however, every time I try to use a new recipe, it's trying to create a new datase…
    Question
    Started by Juan Carlos
    Most recent by Alexandru
    0
    1
    Last answer by
    Alexandru
    Last answer by Alexandru

    Hi @Aray
    ,

    What changes are you looking to perform? Are you looking at doing SQL recipe or Python?

    You can create a python recipe with dummy output e.g a folder(where you will not be writing anything) and read the input dataset and modify it in the code.

    You can also use Python Steps or SQL steps in a scenario as well.

  • Hi, I have a dataset containing hourly rain data that is discretely partitioned following IDs of pluviometers (1000 IDs). I'd like to sub-partition this dataset by month (i.e. : /%{ID}/%Y/%M/.*), but …
    Question
    Started by ACB
    Most recent by MikeG
    0
    1
    Last answer by
    MikeG
    Last answer by MikeG

    Hello @ACB
    ,

    It is possible to implement a partitioning scheme involving a discrete dimension followed by a time dimension.

    The initial state I’m envisioning is a dataset that was uploaded from a single csv file containing the pluviometer data.

    From this initial state here are the steps you would need to perform to implement DSS-based partitioning on this dataset using a discrete dimension (ID) followed by a time dimension (month).

    STEP 1 - Add a Prepare recipe (https://doc.dataiku.com/dss/latest/other_recipes/prepare.html) to the original dataset. In the prepare recipe do the following:

    1. Use parse date processor (https://doc.dataiku.com/dss/latest/preparation/dates.html?#parse-date-processor) to parse the date column you wish to use in the partition key (this will allow for time-based partitioning on the date column)
    2. Use the copy column processor (https://doc.dataiku.com/dss/latest/preparation/processors/column-copy.html) to create copies of the partition key columns (i.e. create copies of the ID column and the date column that will be used as the partition key). This step is necessary to preserve the partition key columns in the output dataset after partitioning is performed on the original dataset. Doing this will address your constraint about the columns being dropped from the output dataset.

    At end of STEP 1 the Flow looks like: `original_dataset` -> `prepare recipe` -> `original_dataset_prepared`

    STEP 2 - Add a Sync recipe (https://doc.dataiku.com/dss/latest/other_recipes/sync.html) to `original_dataset_prepared` to create a copy of `original_dataset_prepared`

    At the end of STEP 2 the Flow looks like: `original_dataset` -> `Prepare recipe` -> `original_dataset_prepared` -> `Sync recipe` -> `original_dataset_prepared_copy`

    STEP 3 - Define the desired partitioning scheme. Access `original_dataset_prepared_copy` in the Flow.
    Then go to `Settings > Partitioning` and implement the partitioning scheme of ID followed by MONTH:

    1. Click Activate Partitioning
    2. Add discrete dimension, choose name of the ID column
    3. Add time dimension, choose name the date column
    4. Pattern field should look like: `%{ID_column_name}/%Y/%M/.*`

    partitioning.png

    At the end of STEP 3 the Flow is the same as at the end of STEP 2.

    STEP 4 - Go back to the Sync recipe. Now that you have defined a desired partitioning scheme directly downstream from the Sync recipe you should see a checkbox labeled “Redispatch partitioning according to input columns” in the UI of the Sync recipe

    Check the box labeled “Redispatch partitioning according to input columns” and click Run. This will kick off the job that takes input dataset `original_dataset_prepared` and updates the layout on disk of output dataset `original_dataset_prepared_copy` to conform to the partitioning scheme defined in STEP 3.

    redispatch.png

    At the end of step 4 the Flow looks the same as at the end of Step 3 with the only difference being `original_dataset_prepared_copy` is now a partitioned dataset (i.e. if you go look at the dataset on disk you will notice a directory structure based on the partitioning scheme and the partitioning columns have been dropped from the output dataset).

    At the end of steps 1-4 your Flow should look similar to this:

    flow.png

    Note:

    • We preserved the partitioning columns in the output dataset by making copies of them in the upstream Prepare recipe.
    • Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
    • I recommend this free course on DSS partitioning: https://academy.dataiku.com/path/advanced-designer/advanced-partitioning

    ---

    Specifically addressing your constraints:

    • You should be able to use the redispatch function in a Sync recipe when the dataset directly downstream from the Sync recipe has partitioning enabled.
    • You can prevent the partition key columns from disappearing from the final output dataset by creating copies of the those columns via the Prepare recipe. Note: Even if you did not copy the partition key columns you would still be able to access the partition key columns for the purpose of performing operations on specific partitions of data.
    • I’m not sure what levers (if any) you’ll have for improving performance.
      • What’s the schema of the original dataset and how many rows are in it?
      • Which storage are you using for the input dataset?
      • Which storage are you using for the output dataset?
      • How long does the redispatch job take to complete?
  • Hello Dataikuers, I'm trying to call a model that was deployed on my flow, to do so, I created a notebook where I generated new data and made a call to the model a follows : #df: is a dataframe that i…
    Question
    Started by saraa1
    Most recent by Alexandru
    0
    1
    Last answer by
    Alexandru
    Last answer by Alexandru

    Hi @saraa1
    ,

    You can use a scoring recipe for this https://knowledge.dataiku.com/latest/courses/scoring/scored-results/scored-results-summary.html

    Is there a particular reason you are looking at doing this from a Notebook instead?

    Thanks,

21 - 30 of 853