setup a scenario whenever the data change is detected on a hive query from hive database.
I have a hive server connection, whenever the data team add data into hive server table, the query result change in data in Dataiku, whenever the data is added, i want every dataset to get upgraded and built with missing values or the remaining values.
please share me the visual process if possible than a conceptual paragraph if possible.
Thanks in advance
Answers
-
Hi,
there is a "trigger on SQL query change" type of trigger to initiate such scenarios. Typically a query like a select count(*) from ... is used to detect when rows are added to a table. But your wording seems to imply that you want only the new rows to be processed by DSS, not the full table, which is not really possible without partitioning. Is that the case?
-
Yes, we are not partitioning he dataset and we need to add the new data, dropping and adding the whole data is kind of long term executing process.
-
attached a project with an example of a scenario to rebuild a flow.
The hive table is named data_in_hive and is fed by the editable dataset + sync recipe (this is just test harness, to more easily update a hive table's contents)
The hive table is read as hive dataset data_from_hive, and used by the flow.
The scenario listens on the row count in data_in_hive and whenever it changes:
- force-rebuilds the first dataset after the hive dataset
- smart-rebuilds the rest of the flow, ie the scenario requests from DSS to rebuild the output dataset and all intermediary datasets that could be needed