starting a project with pyhon script that generate data, is it possible ?

Solved!
gto
Level 1
starting a project with pyhon script that generate data, is it possible ?

Hello everyone, I'm new to dataiku.

I developped a python script that is collecting data from differentes (web scrapping, local files...) sources before generating a pandas dataframe, then I performe my analysis on it.

I would like to switch this project into dataiku. BUT, when I start a project, I need a dataset whereas I don't have it yet.

Question 1 : is it possible to start a flow with my python script to generate a dataset ?
Question 2 : if not, can I start my project with an empty dataset, then include a python code that fill the dataset, then reload the dataset ?

Thank you for your help!


Operating system used: ubuntu 22.04

0 Kudos
1 Solution
PaulK
Dataiker

Hello @gto,

It is possible to start a flow with a python script.
In your project, on the top right of the flow view, select +RECIPE > CODE > Python in order to create a new Python recipe. This recipe can be created without an input and with an output (or more if you want several output datasets).

Once your code recipe is created, you will have a python code sample, which should end with something like this :

# Write recipe outputs
outputDataset = dataiku.Dataset("outputDataset")
outputDataset.write_with_schema(outputDataset_df)

You will need to correctly fill outputDataset_df with the panda dataframe output of your script.

If you need more information on the python API, please have a look at the Dataiku documentation. Here is the link to the documentation on how to write the output schema.

Please let me know if that works for you.

Best regards,

Paul


View solution in original post

0 Kudos
2 Replies
PaulK
Dataiker

Hello @gto,

It is possible to start a flow with a python script.
In your project, on the top right of the flow view, select +RECIPE > CODE > Python in order to create a new Python recipe. This recipe can be created without an input and with an output (or more if you want several output datasets).

Once your code recipe is created, you will have a python code sample, which should end with something like this :

# Write recipe outputs
outputDataset = dataiku.Dataset("outputDataset")
outputDataset.write_with_schema(outputDataset_df)

You will need to correctly fill outputDataset_df with the panda dataframe output of your script.

If you need more information on the python API, please have a look at the Dataiku documentation. Here is the link to the documentation on how to write the output schema.

Please let me know if that works for you.

Best regards,

Paul


0 Kudos
gto
Level 1
Author

thank you!

0 Kudos