My name is Christy L. Wentz and I am a analyst at a liberal arts college in central NY. I am looking to connect with others who have experience using Dataiku. I would like to know how you are using higher ed. data in this platform.
welcome to the Dataiku community. I work in the nonprofit sector, and have taught in Higher Ed in the past. Which side of the organization are you interested in using Dataiku DSS? On the academic teaching side, helping students build data science skills? Or more on the business side? Or more on the fundraising side? Or on the marketing and student retention side?
In the work I’ve done in the Non-Profit sector. I’ve done things like Churn Modeling on memberships. Some attendance forecasting in a museum context, and some audience member clustering. I would think that all of these and more would be useful in a college situation.
Tom asked a very good question. Depends on the level of your students and the applications you're interested in. I teach financial data analysis with Dataiku and found that data cleaning/manipulation, classification and textual analysis can be helpful. It was challenging for the finance students. So you may want to slow it down a bit.
I am on all sides as I work for the President and all the VP's of the institution. I am currently working on gathering all the data for a 5 yr retention analysis, and projects in the backlog are Evaluation of a student program in Career services, a Student Success Model(this is a huge project that will be using data from common features such as demographics, GPA's, etc, but also incorporate student experiences and other features that are only a name as we have not defined them or determined out to measure them), and Prospect Giving model. I will also be working with a professor on persistence project to predict at risk students in STEM (what are the key features that affect persistence) that will have effects across the entire college in any department/division. Im new to analyzing data in this platform with most of my experience in Python, R, SPSS and Excel.
I plan to extend this to the classroom but that is a later project. I have been using the free version as I wait for IT to get the full version on our server.
I am very interested in Churn modeling, as I think it would be perfect for retention and persistence.
All of those sound like a whole lot of fun. 😀
Sounds like you might have a local "design" node on your personal computer already. I've done the same. One of the things that I've done is to add a PostgreSQL database to my computer as well. (I just turn it on when I'm actively using DSS.) This gives me a somewhat more robust data store, and also allows me to manipulate the data with SQL directly if I need to do so.
From your descriptions below, it sounds like you will be creating a number of ETL projects to prepare "master data" on several subjects, likely gathered from a number of data systems. And then create a series of analytic projects.
Once you have your production nodes in place you will also have security in place to grant access to different datasets which should be of some help in an academic environment.
The web UI will also be helpful in sharing access to insights and granting access to other researchers to the data and environment.
You will find that DSS nicely supports coding in Python & R, typically I do this from within Jupyter Notebooks. Although you can also use IDEs like VS Code or Rstudio with the fully licences version of Dataiku DSS. In early April the NYC Dataiku User Group will be doing a community gathering on the use of VS Code and Dataiku DSS. You are welcome to attend.
When it comes to integration with MS Excel, DSS does a fairly nice job of reading spreadsheets as data input and exporting to MS Excel Spreadsheets.
You can also get a plugin for DSS https://www.dataiku.com/product/plugins/spss-format/ that will read SPSS files.
Regarding Churn Modeling, you might also want to look into Survival analysis as another approach to the same kind of research question.
I'd love to have a further chat if you are so inclined. Direct message me here in the DSS community if that is of interest to you.
Hi Tom @tgb417 ,
Thank you for the information. All the data I am using comes from Snowflake. We have multiple systems on campus, and we are working to connect them all to the warehouse. It has been a long project as we are governing all the data that comes in and that takes time.
Should I still add PostgreSQL?
I am a member of the NYC User group even though I am in CNY and will be attending all those events, but I would like to connect with you. I have been hoping to find someone that understands this data and its complexities to connect with and learn from.
I just recently graduated with MSDS and my final part was an analysis on STEM persistence using Dataiku as the platform. I ran a random forest and logistic regression to determine features that impacted persistence. My paper, Machine Learning Classification for predicting STEM persistence of women and Underrepresented Minorities was published on ProQuest last month. However, being new to the platform I feel that I missed so much in terms of the model's performance. I had, actually still have so many questions regarding the results of the models. Women and Underrepresented Minorities in STEM and Math is a passion of mine.
There are other organizations in the community doing similar things with Snowflake. (I've not had an opportunity to work with Snowflake yet.)
Regarding PostgreSQL, if you have the right to download intermediate data sets for your analytics, and you are not pushing all of the analysis back to your Snowflake infrastructure related to data security concerns.
Then yes, I would have access to a PostgreSQL server on my local machine. I prefer this to the file system-based method that DSS will fall back to by default. More sophistication possibly a bit more speed for intermediate-sized datasets running from hundreds of thousands or rows to a few million rows of data.
In the COVID-19 period, you are completely welcome to join our NYC-based Dataiku meet-up meetings. When we eventually get back to face to face this may be a bit more difficult for you. Glad to hear that you will be joining us.
So glad to hear about the recent publication in ProQuest. Is there a public link to the paper?
I'm wondering if there is a place to connect you with some of the data scientists at Dataiku to better understand the modeling that you have done so far.