How to (un)fold data for prediction? Specific case

wodecki
wodecki Registered Posts: 3 ✭✭✭✭

Hello,

I'm Dataiku and ML beginner, so excuse my (maybe) simple question.

I have a dataset with data on internet companies. Originally it came with ">" and "," separated info on target markets (column: markets). There are some extra columns on eg. #of employees, financing etc to the right.

My goal is to create a model with "activity" as a target variable (it has 3 values: operating, acquired and non-operating). Eg, to identify the most promising markets to "survive", or the most dangerous (causing "non-operation").

My original file had 1 record per company (app. 1 000 companies), with only "markets" column. I started with splitting it, first with ">", and then "," as separators. Finally (after some cleaning and merging) I got the dataset with many records per company, as displayed below, with distinct "market__" features.

My questions:

1. Is it OK for ML model to keep a data on a single company in a form of many records (see picture below)?

2. Is there any other procedure of data preparation (folding, splitting, transformation, etc) You would recommend?

I would greatly appreciate Your help,

Many thanks in advance,

Andy

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Hello,

    Applying Machine Learning requires to understand the business problem behind your prediction task. Hence, you need to adapt your methodology to the problem at hand. In your specific case, I would recommend clarifying what is your goal:

    1. Detecting if companies will be acquired in the future? or will still be operating? In this case, you need to define a time window and shift your target. You may have to aggregate data to reduce the number of observations by company.

    2. Attribute a current status to companies about which you have business information, but do not know if they are acquired/operating/not?

    If you are in the VC industry, I guess goal 1 could be of interest to you. In this case, be careful in the way you handle your temporal features.

    Good luck with this interesting project!
  • wodecki
    wodecki Registered Posts: 3 ✭✭✭✭
    Dear Alexandre,

    thank you very much for the answer. In my case I have no temporal data, so my goal is 2. (Atrribute....): having some characteristics of the company I would like to predict it's most probable status.

    My main concern is data preparation for that scenario. In the original file I had 1 line per company, with all it's characteristics in the "markets" columns (theses ">" and "," separated values). In order to prepare the data I used different splits and got the state as in the picture.

    My question, as a ML beginner: is it OK (for modelling) to have multiple records per 1 entity? Especially, that in the model options I choose only target variable (in my case: "activity"), and have no option for "entity" variable (for me Id or name)? Or some other data preparation method would be more appropriate?

    I would really appreciate Your advice,

    Many thanks in advance :)

    Andy
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    It is OK to have multiple lines in your training set for a given entity. But avoid using an identifier colum as a feature in your model. Also it could be helpful to compute derivatived features based on the previous history on the given entity. That is if you are able to define a notion of temporal order.
  • wodecki
    wodecki Registered Posts: 3 ✭✭✭✭
Setup Info
    Tags
      Help me…