How to (un)fold data for prediction? Specific case
Hello,
I'm Dataiku and ML beginner, so excuse my (maybe) simple question.
I have a dataset with data on internet companies. Originally it came with ">" and "," separated info on target markets (column: markets). There are some extra columns on eg. #of employees, financing etc to the right.
My goal is to create a model with "activity" as a target variable (it has 3 values: operating, acquired and non-operating). Eg, to identify the most promising markets to "survive", or the most dangerous (causing "non-operation").
My original file had 1 record per company (app. 1 000 companies), with only "markets" column. I started with splitting it, first with ">", and then "," as separators. Finally (after some cleaning and merging) I got the dataset with many records per company, as displayed below, with distinct "market__" features.
My questions:
1. Is it OK for ML model to keep a data on a single company in a form of many records (see picture below)?
2. Is there any other procedure of data preparation (folding, splitting, transformation, etc) You would recommend?
I would greatly appreciate Your help,
Many thanks in advance,
Andy
Answers
-
Hello,
Applying Machine Learning requires to understand the business problem behind your prediction task. Hence, you need to adapt your methodology to the problem at hand. In your specific case, I would recommend clarifying what is your goal:
1. Detecting if companies will be acquired in the future? or will still be operating? In this case, you need to define a time window and shift your target. You may have to aggregate data to reduce the number of observations by company.
2. Attribute a current status to companies about which you have business information, but do not know if they are acquired/operating/not?
If you are in the VC industry, I guess goal 1 could be of interest to you. In this case, be careful in the way you handle your temporal features.
Good luck with this interesting project! -
Dear Alexandre,
thank you very much for the answer. In my case I have no temporal data, so my goal is 2. (Atrribute....): having some characteristics of the company I would like to predict it's most probable status.
My main concern is data preparation for that scenario. In the original file I had 1 line per company, with all it's characteristics in the "markets" columns (theses ">" and "," separated values). In order to prepare the data I used different splits and got the state as in the picture.
My question, as a ML beginner: is it OK (for modelling) to have multiple records per 1 entity? Especially, that in the model options I choose only target variable (in my case: "activity"), and have no option for "entity" variable (for me Id or name)? Or some other data preparation method would be more appropriate?
I would really appreciate Your advice,
Many thanks in advance
Andy -
It is OK to have multiple lines in your training set for a given entity. But avoid using an identifier colum as a feature in your model. Also it could be helpful to compute derivatived features based on the previous history on the given entity. That is if you are able to define a notion of temporal order.
-
Thank you