Using Dataiku

Sort by:

61 - 70 of 85

Does Dataiku support backfill?
Suppose I have just put a monthly batch pipeline in production. I want to run the pipeline as if it is triggered in January 2019 (i.e., it uses January 2019 data to make predictions for February). How…
Question
Preparation
Datasets
Partitioning
Started by batchmeister
Most recent by jereze
Apr 10, 2019
0
1
Last answer by jereze
Hi,

Did you try using the partitions?

https://doc.dataiku.com/dss/latest/partitions/index.html
https://blog.dataiku.com/partition-data-easily
Last answer by jereze
Hi,

Did you try using the partitions?

https://doc.dataiku.com/dss/latest/partitions/index.html
https://blog.dataiku.com/partition-data-easily
Reply to Discussion
Reply to Discussion
Partition Variable for previous day, previous week....
Hi all, The partition variables such as CURRENT_DAY are apparently available but it it impossible to find how to get 1_DAYS_BEFORE to work in SQL Recipes for manual partition specification. The log sh…
Question
code
SQL
Partitioning
Started by dmajard
Most recent by AdrienL
Mar 21, 2019
0
5
Last answer by AdrienL
Please download a scenario diagnostic (from the scenario page, under the "Last runs" tab, "download diagnostic" link on top of the page) and open a support ticket (from the "?" top-right menu > Get Help), including your diagnostic and a link this page.
Last answer by AdrienL
Please download a scenario diagnostic (from the scenario page, under the "Last runs" tab, "download diagnostic" link on top of the page) and open a support ticket (from the "?" top-right menu > Get Help), including your diagnostic and a link this page.
Reply to Discussion
Reply to Discussion
Accessing metrics in reporters when partitioned datasets are used
Hi, I had a reporter that worked marvels until I started keeping historical records through partitioning. I cannot find the way to dynamically parse the ever changing json. What was once: ${(filter(pa…
Question
API
Partitioning
Metrics & checks
Started by dmajard
Most recent by Alex_Combessie
Mar 13, 2019
0
1
Last answer by
Last answer by Alex_Combessie
Hello,

For advanced usage of metrics such as partitioning, I would not advise using the ${} magic you were using before partitioning.

Instead, you can add a Python step in your scenario which retrieves the value of your metric for your CURRENT_DAY partition, and store it in a project variable called "myvar". You would then be able to use ${myvar} in your reporter. That way you have the full flexibility and simplicity of the Dataiku public API in your Python step. See https://doc.dataiku.com/dss/latest/python-api/rest-api-client/metrics-and-checks.html for reference. You can easily prototype it first in a Python notebook, and then paste it into the Python scenario step.

Hope it helps,

Alex
Reply to Discussion
Reply to Discussion
refresh partitions in dss via API
Hi, we have added by a python api a new dataset into the project and pointing it to an existing location in HDFS where partition folders are stored. (This location is managed by another DSS instance).…
Question
Datasets
API
Partitioning
Started by Tomas
Most recent by Clément_Stenac
Feb 14, 2019
0
1
Last answer by
Last answer by Clément_Stenac
Hi,

There is no notion of refreshing a list of partitions in DSS - It's not like the metastore where the partitions must be declared. DSS scans the source to find the partitions whenever needed.
Reply to Discussion
Reply to Discussion
Horizon or vertical scaling of Dataiku on small datasets?
I am looking for a bit of advice here. Technically, I'm handling data sources between 5 - 20GB. Let's say you have 10 customers from which each has about 5 datapipelines that run every 10 minutes. Eac…
Question
Datasets
Partitioning
Started by casper
Most recent by Alex_Combessie
Aug 20, 2018
0
3
Last answer by
Last answer by Alex_Combessie
In order to scale beyond the Dataiku machine, you will need an external Kubernetes cluster. This will allow you to push all Python jobs (including Dataiku visual ML with Python backend) to containers running in your Kubernetes cluster. The details of the integration are explained here: https://doc.dataiku.com/dss/latest/apinode/kubernetes/index.html
Reply to Discussion
Reply to Discussion
Can changes in HDFS datasets be automatically tracked?
Hi, I am using HDFS datasets in my workflow which are updating on a daily basis and I would like to find out if these daily changes can be tracked by DSS and saved in a separate "delta" file through a…
Question
Hadoop
Partitioning
Scenarios
Started by UserBird
Most recent by Clément_Stenac
Jun 30, 2018
0
1
Last answer by
Last answer by Clément_Stenac
Hi,

This needs some work but can be achieved using scenarios and partitioning.

You would have your "stock" dataset (not partitioned), and a changes dataset, partitioned by day. You will need to create a coding recipe that takes the change dataset both at input and output, but with a partition dependency that says

"to compute day N of the change dataset, I use the stock dataset and day N-1 of the change dataset" (use the "Time range" dependency)

Then your recipe does the actual computation

An important point is that you should not run this recipe in "recursive" mode, because this would recurse until the big bang (since to compute day N-1, you need day N-2 which needs day N-3, ...)

Then this can be automated using a time-based trigger, since you expect your files to change daily (note that this requires a professional version of DSS)
Reply to Discussion
Reply to Discussion
Plan scenarios for partitioned dataset without clearify partitions.
Hi, so I want to create a scenario with a partitioned dataset and I want to select all partitions. However it is not possible, the ok is greyed out. Although in the description it says that i can leav…
Question
Partitioning
Scenarios
Started by Marie
Most recent by jereze
Jun 28, 2018
0
1
Last answer by
Last answer by jereze
Hi Marie,

Can you contact the support team for this? Thanks.
Reply to Discussion
Reply to Discussion
Export a partitioned Dataset to a set of CSVs
Hi, I have a partitioned dataset in Dataiku and I want to export it to a set of csvs, with one csv file per partition. Is this possible in Dataiku? Best, John
Answered ✓
Partitioning
Managed folders
Started by John
Most recent by John
Apr 5, 2018
0
2
Solution by
Solution by Jediv
Hi,

When you build your export recipe, choose a partitioned folder. It will create one folder per partition with one CSV in each folder.

Best,

Jed
Reply to Discussion
Reply to Discussion
Missing ID in partitioned group by
Hi, I've got a partitioned dataset with IDs in one column. The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions. I then want to group this datase…
Answered ✓
Datasets
Sampling
Partitioning
Started by suard_raphaelle
Most recent by suard_raphaelle
Mar 13, 2018
0
2
Solution by
Solution by Alex_Combessie
Hello,

All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.

Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.

Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:

Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.

Hope it helps,

Alex
Reply to Discussion
Reply to Discussion
validation failed: Cannot insert into target table because number/types are different.
Hi, I get this message from a hive recipe on a partitioned dataset stored on HDFS: validation failed: Cannot insert into target table because number/types are different "2018-02": Table inclause-0 has…
Question
Hadoop
Partitioning
Hive
Started by Mattsco
Most recent by Mattsco
Feb 21, 2018
0
1
Last answer by
Last answer by Mattsco
Hi Mattsco,

You can't use SELECT * on partitioned dataset because the partition column must not be in the query.

Reply to Discussion
Reply to Discussion