Decision Tree and Random Forest Model Settings and Optimization

saatcsi · ‎12-18-2023

I am in a course that is teaching Dataiku as an add-on curriculum feature. I would like to know more about techniques to improve model performance for a classification problem using decision trees and random forest.

Also, I see that we are only able to see the test results when running the models. Is there a way to see both the training and test results to compare and evaluate if the model is overfitting at the training stage?

Best,

Saa

Operating system used: Mac OS M1

AdrienL · ‎12-18-2023

Export the datasets. This create one dataset with a row_origin column saying whether it's coming from test or train.
1. You can use either a filter/sample recipe to filter inly the train set (if you only wan this one, as the test set results are the ones you already see in the model's result)
2. Or if you really want both, you can use a split recipe to split the dataset in two: train vs test
Deploy the model to the flow (that's from the model's Action menu on the top right)
Create an evaluation recipe, with the train set and the model as input (this recipe can take 2 inputs) and creating a model evaluation store as output.
Run the recipe, explore the output evaluation

View solution in original post

AdrienL · ‎12-18-2023

Hi,

In the latest DSS 12.4 release, we added Learning curves, that allow you to compare the test vs train set metrics when the model is trained with growing portions of the train set.

You can also export the model's train & test sets: from a model's result page, under Actions (button on the top-right corner) > Export train & test sets, to run further tests should you need to.

saatcsi · ‎12-18-2023

Thanks for your reply, AdrienL.

I attached a screenshot of what I'm working on.

I've exported the train/test set and that's not exactly what I am looking for. I am searching for the results on the training and the testing. This way I can compare the confusion matrix results, the trees, see if the training is overfitted, etc.

Also, there are two screenshots of my algorithm parameters. I am trying to optimize the model's performance and am at a point where I'm dancing around the same results. For the Decision tree: AUC 0.92 / Random Forest: 0.93

Thanks again.

Saa

AdrienL · ‎12-18-2023

I've exported the train/test set and that's not exactly what I am looking for. I am searching for the results on the training and the testing.

Yes, I'm just mentioning these as more flexible starting point. With those you could for instance run an evaluation recipe on a model (need first to deploy said model to the flow), or use code to check for specific metrics.

You should also check out the learning curves, if you can install DSS 12.4.

saatcsi · ‎12-18-2023

Thanks again, Adrien!

We haven't covered using the evaluation recipe nor deploying the model(s) to the flow. I'm assuming it is as easy as selecting the deploy button for each of the models I'd like to evaluate, then connecting them to the exported dataset.

So:
1. Export the dataset after running the training.

2. Deploy the models that I'd like to evaluate

3. Connect the models I'd like to evaluate to the exported training sets

4. Open the evaluation recipe?

Please excuse my ignorance. We were show the basics for cleaning up the data and setting up and running the algorithm but that's about as far as the instructions went.

I'm assuming in the evaluation recipe you can then see the results of the training and testing. But, can you make adjustments from there?

Best,

Saa

P.S. I am on the free version and have to locate where I can see what version I am on. I updated about a week or two ago so I am assuming I'm on the latest version. Green as a leprechaun dancing on grass in the middle of spring eating a salad.

AdrienL · ‎12-18-2023

Export the datasets. This create one dataset with a row_origin column saying whether it's coming from test or train.
1. You can use either a filter/sample recipe to filter inly the train set (if you only wan this one, as the test set results are the ones you already see in the model's result)
2. Or if you really want both, you can use a split recipe to split the dataset in two: train vs test
Deploy the model to the flow (that's from the model's Action menu on the top right)
Create an evaluation recipe, with the train set and the model as input (this recipe can take 2 inputs) and creating a model evaluation store as output.
Run the recipe, explore the output evaluation

saatcsi · ‎12-18-2023

I'm assuming it should look like it does in the attached screenshot.

So, for each training session I need to export the generated dataset.

Then, create a split recipe so I can split the training and test from the view_origin column.

Deploy the model to the recipe.

Create the evaluate recipe for each.

Is there a way to see both results so I can compare them?

If I export several sessions can I see all of the results of each session in a comparison?

I am almost there. I am so grateful for you doing this. I like to learn.

Is there a resource that I can tap into for best practices?

Thanks,

Saa

AdrienL · ‎12-19-2023

You can select models from multiple sessions and click Compare to creeate a new model comparison. You can also add model evaluations to this comparison (such as the one you have done with the train set).

Indeed there are 2 things on which there are many things to learn:

Machine learning practices
Dataiku's software and how it help with ML / ML Ops

Dataiku's Learning center offer guided paths, some of which are about ML & ML Ops

saatcsi · ‎12-19-2023

Hello AdrienL

I am with extreme gratitude for your help. Yesterday I completed the Intro to ML course because it was helpful for the project I am currently working on. Also, did a few of the core design modules. Going to finish that up today. I intend to go through most of if not all of the modules.

Already, my mind is spinning with all of the possibilities.

If I have any other questions I'll reach out.

Thank you!

Saa

saatcsi · ‎12-18-2023

I am starting to get the strong feeling that we haven't been shown the full process of training a decision system and evaluating the results. I am guessing that up to what I've shared with you through the screenshots is just the preliminary results from the training/test but to go deeper and retrain there are plenty more steps.

saatcsi · ‎12-18-2023

I would like to learn more about the correct pipeline and workflows. Please direct me to tutorials or someone who would be willing to spend an hour or two going over the software. I am a fast learner.

Sign up to take part

Decision Tree and Random Forest Model Settings and Optimization

Decision Tree and Random Forest Model Settings and Optimization

Setup info