For our tenth Conundrum we would like you to take a crack at a classic modeling problem: The Titanic.
The Titanic presents a fascinating dataset to work with as we have some good information about various characteristics of those aboard. Combine this with knowing if each passenger would be a member of the lucky 32% who survived the wreck and we have a perfect puzzle!
Can you use the data attached to build a model to establish the most important three factors in a passenger's survival chances?
Good luck and, as always, feel free to share results and discuss your models below.
Here is a key to make the data a little more readable for you all:
PassengerId - ID Survived - The target (1 = survived, 0 = dead) Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) Name - Name of the passenger Sex - Sex of the passenger Age - Age of the passenger Sibsp - Number of Siblings/Spouses Aboard Parch - Number of Parents/Children Aboard Ticket - Ticket Number Fare - Passenger Fare Cabin - Cabin number Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
This, of course, would vary depending on the model I build.
The feature given in this data set that is most valuable for me is the Name. From which I can derive a number of other features. For example, I can pull out of the Name, the Name_Title. (Mr. Mrs. Miss, Master.) In fact in the models, I'm playing with right now the existence of "Mr." in the name field is typically the most useful to the models I've been building.
I built some models using feature reduction. And came up with these features.
Another interesting set is not finding a name in parentheses "( )" in the name. Above Name2_Found is False get's at this aspect. I believe that we are encoding something about social structure in the presence or absence of such values. The Existence of a nickname in the Name also appears to have predictive power. Someone cared enough to get the preferred nickname on the manifest.
I could derive gender from the suffix and first name so I might not need the gender column.
However Then Price Class or Fare are important in this data.
Currently, as a model builder, I would likely take the following features
What do others think?
Tom, I like your approach! For fun and just to see what would happen, I removed any empty value rows for Cabin and any for Age which left me with only 891 rows. I tried a correlation stats test to see if anything might pop up as highly correlated (it did not). Then in the model design, I told DSS to go ahead and generate pairwise features. What I ended up with is a Random Forest model .943 showing top variables as: Sex is Male; Sex is Female; Age-SibSp and Age * Parch. I don't think my model is very good, the size of the train and test sets is below and so my ROC is not even a curve:
|Train set rows||149|
|Test set rows||36|
From an explainability point of view, what do you conceptually make out of these derived features
I usually like to try to at least explain to myself the idea behind any of the derived features. What, if anything, are you making of these features?
The ROC number .943 on the face looks fairly impressive but I agree that given the final size of your training and test set sizes, (once you remove the rows with missing Age and Cabin) I'm not clear that your model is any way representative of the population of folks on the Titanic.
When looking at this dataset, In the past I've tended to think of this data as "missing" or folks did not have a cabin. However when looking this up on https://www.encyclopedia-titanica.org/cabins.html . I note that
"The allocation of cabins on the Titanic is a source of continuing interest and endless speculation. Apart from the recollections of survivors and a few tickets and boarding cards, the only authoritative source of cabin data is the incomplete first-class passenger list recovered with the body of steward Herbert Cave. The list below includes this data and includes the likely occupants of some other cabins determined by other means.
The difficulty in determining, with any degree of accuracy, the occupancy of cabins on the Titanic indicates the need for further research in this area."
Because of this historical situation. This apparently just happens to correlate to folks who were first class and on this partially discovered list. This may really mean that this cabin data is a duplication of the Class field to some extent. I know this is beyond the typical rules of the conundrums here in the Dataiku community. But, I've been wondering about the inclusion of this additional data table. Does Cabin become significantly more predictive? (If maybe a bit noisy feature.) For now, in the models I'm playing with I'm using the presence or lack of presence of a value. I believe there might be some value in pulling the first letter off of the cabins because this has something to do with a location on the boat.
@taraku where would you take your model?
What do others think?
Hi Tom! Endless speculation indeed! This dataset is great fun! Just guessing about the derived features, but perhaps having parents and siblings on board meant that a person was more likely to survive? That might be my next hypothesis to test. I agree Cabin is an area needing further research. Happy modeling!