Fields in Scoring Dataset that weren't in Training Dataset

Solved!
ccecil
Level 3
Fields in Scoring Dataset that weren't in Training Dataset
Hi there, 

Two questions: 

1) I'm receiving the error message below and I'm wondering if case on the field text is impacting this. I have 'Number of Rooms'  in my training set and 'NUMBER_OF_ROOMS' in my scoring set.

- Will I need to go into my scoring set and match the case of the field names with the training set?

An invalid argument has been encountered : in act.score_Model_Score_NP: Cannot apply the model with the output of preparation on this input (Missing column: Number_of_Rooms)

2) I have some extra fields in my scoring set that were not in my training set, is there a way for my model to ignore those additional fields when scoring the model?

 

Thank you in advance, I really appreciate it.


Operating system used: Windows


Operating system used: Windows

0 Kudos
1 Solution
AlexT
Dataiker

Hi @ccecil ,
You either need to match the schema of the training dataset to the scored datase.
You can to use a preparation script in the visual analysis for the model you've trained. 
E.g Simply add rename column step where the column name is "Number of Rooms" and change to "Number_of_Rooms" , you may get a warning if the column doesn't exist in the training datasets.

Screen Shot 2023-03-31 at 4.44.55 PM.pngBut it would apply the same script steps when running the scoring recipe and should avoid the error you see.

Thanks

View solution in original post

5 Replies
AlexT
Dataiker

Hi @ccecil ,
You either need to match the schema of the training dataset to the scored datase.
You can to use a preparation script in the visual analysis for the model you've trained. 
E.g Simply add rename column step where the column name is "Number of Rooms" and change to "Number_of_Rooms" , you may get a warning if the column doesn't exist in the training datasets.

Screen Shot 2023-03-31 at 4.44.55 PM.pngBut it would apply the same script steps when running the scoring recipe and should avoid the error you see.

Thanks

ccecil
Level 3
Author

Hi @AlexT ,

I made a small typo in my original question, the field in my training dataset is 'Number_of_Bedrooms' and in my scoring dataset it is 'NUMBER_OF_BEDROOMS'.

Does the same solution still apply?

0 Kudos
AlexT
Dataiker

Hi,

That would still apply. Rename any columns you expect to have in your scoring dataset. 
Thanks

0 Kudos
ccecil
Level 3
Author

Okay. 

I do that and then hit deploy script. Which produces a new dataset with my renamed columns. If I use that new dataset to train my model, it throws an error saying that one of the columns I renamed is now empty. Did I go wrong somewhere?

 

@AlexT 

0 Kudos
AlexT
Dataiker

If you deploy the script, it creates a prepare recipe you would need to change the input of the newly created recipe to the dataset you are scoring. 

https://knowledge.dataiku.com/latest/data-preparation/lab-visual-analyses/tutorial-lab.html

 

0 Kudos