Using Dataiku
- Suppose I have just put a monthly batch pipeline in production. I want to run the pipeline as if it is triggered in January 2019 (i.e., it uses January 2019 data to make predictions for February). How…Last answer by jerezeHi,
Did you try using the partitions?
https://doc.dataiku.com/dss/latest/partitions/index.html
https://blog.dataiku.com/partition-data-easilyLast answer by jerezeHi,
Did you try using the partitions?
https://doc.dataiku.com/dss/latest/partitions/index.html
https://blog.dataiku.com/partition-data-easily - Hi all, The partition variables such as CURRENT_DAY are apparently available but it it impossible to find how to get 1_DAYS_BEFORE to work in SQL Recipes for manual partition specification. The log sh…Last answer by AdrienLPlease download a scenario diagnostic (from the scenario page, under the "Last runs" tab, "download diagnostic" link on top of the page) and open a support ticket (from the "?" top-right menu > Get Help), including your diagnostic and a link this page.
- Hi, I had a reporter that worked marvels until I started keeping historical records through partitioning. I cannot find the way to dynamically parse the ever changing json. What was once: ${(filter(pa…Last answer byLast answer by Alex_CombessieHello,
For advanced usage of metrics such as partitioning, I would not advise using the ${} magic you were using before partitioning.
Instead, you can add a Python step in your scenario which retrieves the value of your metric for your CURRENT_DAY partition, and store it in a project variable called "myvar". You would then be able to use ${myvar} in your reporter. That way you have the full flexibility and simplicity of the Dataiku public API in your Python step. See https://doc.dataiku.com/dss/latest/python-api/rest-api-client/metrics-and-checks.html for reference. You can easily prototype it first in a Python notebook, and then paste it into the Python scenario step.
Hope it helps,
Alex - Hi, we have added by a python api a new dataset into the project and pointing it to an existing location in HDFS where partition folders are stored. (This location is managed by another DSS instance).…Last answer by
- I am looking for a bit of advice here. Technically, I'm handling data sources between 5 - 20GB. Let's say you have 10 customers from which each has about 5 datapipelines that run every 10 minutes. Eac…Last answer byLast answer by Alex_CombessieIn order to scale beyond the Dataiku machine, you will need an external Kubernetes cluster. This will allow you to push all Python jobs (including Dataiku visual ML with Python backend) to containers running in your Kubernetes cluster. The details of the integration are explained here: https://doc.dataiku.com/dss/latest/apinode/kubernetes/index.html
- Hi, I am using HDFS datasets in my workflow which are updating on a daily basis and I would like to find out if these daily changes can be tracked by DSS and saved in a separate "delta" file through a…Last answer byLast answer by Clément_StenacHi,
This needs some work but can be achieved using scenarios and partitioning.
You would have your "stock" dataset (not partitioned), and a changes dataset, partitioned by day. You will need to create a coding recipe that takes the change dataset both at input and output, but with a partition dependency that says
"to compute day N of the change dataset, I use the stock dataset and day N-1 of the change dataset" (use the "Time range" dependency)
Then your recipe does the actual computation
An important point is that you should not run this recipe in "recursive" mode, because this would recurse until the big bang (since to compute day N-1, you need day N-2 which needs day N-3, ...)
Then this can be automated using a time-based trigger, since you expect your files to change daily (note that this requires a professional version of DSS) - Hi, so I want to create a scenario with a partitioned dataset and I want to select all partitions. However it is not possible, the ok is greyed out. Although in the description it says that i can leav…Last answer by
- Hi, I have a partitioned dataset in Dataiku and I want to export it to a set of csvs, with one csv file per partition. Is this possible in Dataiku? Best, JohnSolution by
- Hi, I've got a partitioned dataset with IDs in one column. The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions. I then want to group this datase…Solution bySolution by Alex_Combessie
Hello,
All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.
Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.
Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:
Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.
Hope it helps,
Alex
- Hi, I get this message from a hive recipe on a partitioned dataset stored on HDFS: validation failed: Cannot insert into target table because number/types are different "2018-02": Table inclause-0 has…Last answer by
Top Tags
Trending Discussions
- Answered2
- Answered ✓7
Leaderboard
Member | Points |
Turribeach | 3702 |
tgb417 | 2515 |
Ignacio_Toledo | 1082 |