An easier Conundrum for you this week - but with a twist! For this one you will need the handy plugin ‘Zipcode geocoding’ which can be downloaded from the Dataiku plugin store. If you need any help downloading plugins check out our handy reference documentation for the details!
Now that you have the plugin installed take a look at the attached dataset with the locations of various Brewpubs and Microbreweries around the United states. Using the new ‘zipcode geolocate’ prepare step added by the plugin - can you represent the location of these establishments around the US on a map?
The more clear which type is which - and the more clear the relative density of locations - the better.
Thank brew very much for posting this....made me very hoppy..🍺..To create this chart, I installed the plugin, combined the two microbrewery columns, removed the closed types, and any with very small numbers. This left me with brewpub, microbrewery, contract brewery, and regional brewery. I then created a type_count column using the Window recipe, and finally many maps, of which this is just one.
So it looks like a bit of a data cleanup problem you have set for us.
With a 5 line preparation recipe, I can get a zipcode for all but 2% of the records (n=56). And of those zipcodes, I can use the zipcode geolocate plugin to locate all but 75 of these addresses.
I know that there is at least one more address that I have not dug out the zipcode. It might take as many as 3 more prepare recipe steps to gather just that one address. (And I'm not clear how generalizable that step is going to be.)
Can anyone get a higher success rate with the same or fewer steps? And what approach are you using?
P.S. If I'm going to do any more refinement. I might have to dig out a regular expression.
Here are two zoom levels of my final visualization:
This is the continental US. Note that Red dots are Brew Pubs and Blue Dots are Microbreweries. The size of the dots reflects the number of either type within a single zip code. I had not realized that Colorado was into its crafty beers.
Not shown here are the Alaskan and Hiwian locations of the dataset.
Here is a somewhat zoomed-in view of the northeastern united states. The list was fairly good. it even contained the brewpub in New Brunswick, NJ.
Note due to data issues in the original data file. I was not able to plot 75 of the original dataset. With some more work, I suspect that someone might be able to do a little better. Let me know how it goes.
Mine it's a little bit straightforward, since I decided not to use the default Zip codes provided. And instead, I created a new one coming from the Address column.
- In the preparation recipe, I used two methods to create the new column and named the column as "Zip".
- The result wasn't that bad, as Dataiku able to retrieve all of the Zip codes from the column, except for 22 of them and one empty record. That's less than 1% (0.009% to be exact) from the total records in the data frame, awesome job Dataiku!
- And here are the visualization for some of the States from the data frame.
- US Map
- New York
- New Orleans
Nice map. Thanks for sharing?
Your images has left me with a few questions.
What do the color and size of the dots mean in your images?
- For this particular visualization, it's just the Total Brewery Numbers for each states in the US. As you may have predicted, the bigger the bubble, the greater the total brewery number and vice versa. But don't bother much on the color scheme, as they're coming from the Zip code, while they're getting treated as an integer (total?), not as a text (as this is required by the plugin).
- And for this particular visualization, it's merely showing the brewery location, while getting the brewery types, in color-coded.
Where you able to geocode the 0#### and 00### style zip codes in the northeastern USA? Our zip codes in my area are kind of hard deal with because the 0s often get dropped off when a zip code gets converted to an integer.
I see your point, noticed them too at the following screen capture. Though Dataiku's formula recipe able to generate most of the Zip code coming from the address column (as long as they're not empty, as indicated by its green bar), but somehow, the Plugin didn't manage to catch them all, as indicated by the followings:
- 1002 , which is an actual Zip Code found here.
- 1007 , which is an actual Zip Code found here.
And so forth, but that's something doable to fix, if the current conundrum didn't limit you to have less than 5 steps of preparation to generate the whole visualization 😏. Perhaps this processor might do the trick.
Did you end up using the geocode plugin? I did not see this in your prepare recipe. Or did you find some other cool way to parse the zip codes and get them into your map?
Yeah, I did utilize the plugin, as indicated by my respond above. Among one of the crucial aspect of Data Mining that Dataiku hasn't yet provided the feature by default, would probably the Scrapping ability. I guess you could do that through the Jupyter Notebook builtin module in Dataiku, by firstly install the "beautifulsoup" library through the pip mechanism, but that's another story for the our next Conundrum 😉.
Looking at colors in the example you are showing As you say, It looks like the sum of the zip codes with over 26million for the number. The use of this value for the color is a little unexpected to me. I'm wondering about alternatives.
What is being required by the plugin? (My understanding is that the plugin needs a text value of postal code and another string for the country.)
The way I dealt with the Northeastern US is to make sure that the Skems treats the column as a string. And then the formula is a set of if statements that put these 0s back into place based on the length of the postal code. If 4 I'm pre-pending 1 "0" if the length is 3, I'm pretending 2 "0"s.
It is my understanding that there are no limited to the number of prepared recipes that you can use. I just said that I was able to do do this in 5.
For my visualization, I did use the geocoding plugging. My flow looked like this.
I think that you could do this without a coding recipe using beautifulsoup.
Actually Tom, I liked the whole ideas better to have a certain limit number on the preparation steps. That way, your creativity can really kick-in. Well in contrast with not to have any limitation, where you have all the luxury to generate all the process needed to generate the required visualization. By raising the bar, we found ourself somewhat more creative by carefully picking the recipes and not hitting the limit while still achieving our objectives.
So I think, our possibilities are endless to overcome the previous obstacles without the needs to pay attention with the limit that we need to adhere. Well, that's just my two cents, anyway, I like yours version of the visualization better, since they've composed better information with both the dots and the colors scheme.
Catch you up on the next week conundrum Tom.