refresh partitions in dss via API
Tomas
Registered, Neuron 2022 Posts: 121 ✭✭✭✭✭
Hi,
we have added by a python api a new dataset into the project and pointing it to an existing location in HDFS where partition folders are stored. (This location is managed by another DSS instance). This kind of "import" of read only dataset works, but I did not find a way how to "refresh" the list of partitions, i.e. when a new folder is created from outside of the Dataiku, then I would like to trigger the refresh/discovery of partition - similar to the button in the UI. And also I would like to compute metrics
Thanks
ds_params['hiveTableName'] = d['params'].get('hiveTableName').replace('${projectKey}',ip['projectKey'])
ds_params['path'] = import_prj.project_key+'/'+d['name']+'/data/'
ds_params['metastoreSynchronizationEnabled'] = False
newds_name = d['name']+'_prod'
...
logging.info(' > creating new dataset '+newds_name)
newds = prj.create_dataset(newds_name, 'HDFS', params=ds_params, formatType=d['formatType'], formatParams=ds_formatparams)
newdef = newds.get_definition()
newdef['partitioning'] = ds_partitioning
newdef['schema'] = dict( d['schema'] )
newdef['tags'] = set_src_project_key( newdef['tags'], import_prj.project_key )
newds.set_definition(newdef)
we have added by a python api a new dataset into the project and pointing it to an existing location in HDFS where partition folders are stored. (This location is managed by another DSS instance). This kind of "import" of read only dataset works, but I did not find a way how to "refresh" the list of partitions, i.e. when a new folder is created from outside of the Dataiku, then I would like to trigger the refresh/discovery of partition - similar to the button in the UI. And also I would like to compute metrics
Thanks
ds_params['hiveTableName'] = d['params'].get('hiveTableName').replace('${projectKey}',ip['projectKey'])
ds_params['path'] = import_prj.project_key+'/'+d['name']+'/data/'
ds_params['metastoreSynchronizationEnabled'] = False
newds_name = d['name']+'_prod'
...
logging.info(' > creating new dataset '+newds_name)
newds = prj.create_dataset(newds_name, 'HDFS', params=ds_params, formatType=d['formatType'], formatParams=ds_formatparams)
newdef = newds.get_definition()
newdef['partitioning'] = ds_partitioning
newdef['schema'] = dict( d['schema'] )
newdef['tags'] = set_src_project_key( newdef['tags'], import_prj.project_key )
newds.set_definition(newdef)
Tagged:
Answers
-
Hi,
There is no notion of refreshing a list of partitions in DSS - It's not like the metastore where the partitions must be declared. DSS scans the source to find the partitions whenever needed.