The Dataiku Frontrunner Awards have just launched to recognize your achievements! Submit Your Entry

Improve performance on new recipe modal dataset selector

Improve performance on new recipe modal dataset selector

0 Kudos

The performance of the new recipe modal currently seems to be tied to the number of datasets in a project. The slow method seems to be listDatasetsUsabilityInAndOut. The underlying slow function seems to be copyRecurse. In a small project with 386 datasets, it took more than 11 seconds to execute. In a medium project with 1,231 datasets, it took 65 seconds to execute. In a large project with 33,119 datasets, it runs for more than ten minutes before running out of memory.

I measured these times in Firefox 78.2.0 on Windows 1909 against Dataiku 7.0.2 on a Dell Precision 7520. I measured them by selecting two datasets and clicking the join recipe button in the right-side flyout. Performance doesn't seem to change much across any of the recipes- any time a recipe needs to populate the dataset selector, I have a similar experience. I also have this issue the first time I load a flow.

Since a huge part of my workflow is creating new recipes, it would be great if the performance could be improved, and also if the results could be cached somehow and updated incrementally so that the full load only happens once. Even waiting just a few seconds each time I want to create a recipe breaks my mental flow, and on medium sized projects, I usually end up waiting so long that I lose track of what I was doing or get distracted before the load completes. For large projects, it's much better to initialize recipes through the API then navigate directly to the recipe's route than to wait for the modal to finish loading. On large projects where it doesn't run out of memory, I've waited more than 30 minutes before for a modal page to finish loading.

In general, I don't need a list of every dataset on in this modal. I've usually already selected the datasets I want before I click the new recipe button, so 90% of the time, I won't click the dataset selector anyway, so ideally it won't need to be initialized with anything but the dataset I pre-selected. But in the rare cases when I do use the dataset selector to search for a dataset, it would be nice if it worked like a server-side autocomplete search box, loading only 10-20 results into the UI component at a time as I search (and even better, with some enhanced context functionality, like letting me hover over each search result to see a preview of where in the flow that dataset is created and the dataset's full name).

Another issue to consider is that dataset names usually overflow the result length of the dataset selector, so typically the results can't be distinguished anyway. While the left-right visual flow of this modal definitely keeps the input-output UX clear, it constrains the available space well below the amount needed to display useful information. Since the immediate next step after the recipe is created is to be routed to that recipe's view, maybe just taking me directly to that recipe's view where I can then define the output dataset (and make changes to the input datasets if needed) would speed up my flow the most. In any case, with the current UI, if for some reason I do need to use the dataset selector in the recipe creation modal, since the results overflow the space and can't be distinguished, I have to open another tab, find the specific dataset I want in the flow, and copy its exact name to paste back into the dataset selector.

I'd love to see a performance improvement on this! For most of my projects, a performance fix here would be the make the biggest contribution to overall project development speed.