Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Please forgive me if this has been asked a million times, but I was unable to find what I was looking for with the searches I tried.
In my last job, our data specialist used DSS to migrate data from a legacy database (in Paradox, I believe) to MySql. There were a huge number of differences between the two schemas, but he was able to accomplish everything that was needed with this marvelous tool!
I now have a similar task—migrating between an old and new MySql db—but I'm having a hard time figuring out the next step(s).
I have connections to both databases, and I've added the tables from the source database to my project. At this point, I have no idea how to add the second connection (to the destination db).
Edited to add: I've been able to add datasets from both the source and destination databases but I need some guidance on what kind of recipes will allow me to modify and export data from one schema to the other. I've included a screen shot showing part of two schemas.
Any help you can offer would be greatly appreciated!
Is the image the schema of the "Source" database or the destination database?
What schema are you looking for in your destination database? Is it the same as the existing database? Or are you intending to make a bunch of changes?
The left-hand table is the source, and the right-hand, normalized tables are the destination. Thanks!
In the image, I see an Entity Relationship Diagram ERD and one copy of a series of related tables on the right. This image does not appear to be all of the relationships given that a number of lines point outside the frame of the green window.
On the left-hand side of the image, you have sent I see what appears to be the schema of maybe just one of the memberships table on the right the members table. However, I don't know if I'm seeing the whole description of all of the elements in tblMembers.
Is there just one table tblMembers today and you are trying to break this out into a relational table schema?
Yes, that’s the essence of it. That horrible tblMembers needs to be broken out into a normalized structure. The image only shows part of it—all of the columns in tblMembers beyond the first handful need to be moved into separate tables with constraints and some pivot tables.
My hope is to be able to say (for example) “tblMembers.Active” maps to “members.active” and “tblMembers.Address1” maps to “addresses.address_1”, which has a foreign key relationship to “members”, via “members_addresses”, etc. Does that make sense? (I have to admit that I’m primarily a coder and not a database guy)
Yeah, that makes some sense.
Most DBA folks would likely do this by using native Database Table Create commands. They would implement an appropriate relational database design (normalized) with appropriate key and foreign key relationships; maybe add a few indexes to keep performance up. Maybe add a few table triggers, to include information about when records were created who created the records. And then use a series of insert SQL commands to actually copy the content from your denormalized table (on the left) into the new relational tables schema (on the right).
Although you might use DSS visual recipes to create the datasets for the needed tables. DSS tends to produce tables for analytic purposes. And analysts tend to use denormalized datasets to do analytics. (Very much like what you have in your left single monolithic table.)
When it comes to implementing a normalized database schema, I don't know what DSS is doing under the hood related to key relationships between tables. I'm fairly sure that it does not put triggers on the tables it creates. You could do all of these things from a SQL recipes inside DSS. However, in that case, you are basically doing the steps of a DBA, just using DSS as a front-end Database tool. Although I like DSS a lot for data analytics and machine learning. There are a number of tools that are better suited to relational database design and administration.
If this is an important project, I'd suggest that if you don't have the existing skills, that you consider a short term consultant to implement a relational database schema. You appear to have most of the design work done. So, I would think that a good DBA could work through your specification fairly quickly.
Is there anyone else out there who is using DSS and a primary front-end application for building and managing a normalized relational database?
Just my $0.02 do with this as you see fit.
Thanks @tgb417!. I'm actually using something called DbShema to build out the ERD. It sync's with the database as I go along. I'm swimming in some unfamiliar waters, but folks here are helping stay afloat.
Hi Tom ( @tgb417 ),
Related to the use of DSS in these kind of scenarios, there is post in the community where they talk about how DSS can take care of the process:
Can we leverage Dataiku for performing ETL operations?
Is not a capability too often highlighted, and so there are not many examples or tutorials. But we are using DSS exactly for that in at least 4 or 5 projects.
Hi @mpemburn, welcome to the community!
I think I fully understand your use case and it looks like you will need to create an ETL pipeline (a one time pipeline since you are migrating), a thing that DSS makes wonderfully. You are not only migrating to a new database but also creating a new data model.
You already have a data model for your new database tables (the ERD that @tgb417 mentions). What you need now is to start creating the "flow" that will get you from your source to your new data model.
Some questions before making suggestion or providing hints: is the tblMembers the only source table? I ask because in your screenshot it is not clear from where do you get the 'leaders' information or 'leadership_roles'.
This seems like a really nice use case were one can learn a lot about ETL and data modeling, so I'll be happy to help. I've been working on a similar problem, where we start from a data model with 4 tables (instead of one) and we are ending up with the diagram shown on the attached file.
Excellent—thank you! I knew this could be done because I watched my coworker doing something far more complex than this in DSS.
tblMembers is not the only table in the source, just the one that requires the most normalization. There are also a bunch more tables on the destination side that I’m not showing. I still have a bit of work to do on the latter, but I wanted to get a start on understanding DSS first.
Thank you for the offer of help! I look forward to the next step!
Hi @mpemburn. No need to thanks, I just hope I can give you some hints on what to do.
What I could do is to create a small video showing a similar use case we are working within our organization, that could give you hints on how to move forward with the normalization. Would that work for you?
Also, I think the input from Tom (@tgb417) is relevant, and once you are done with the normalization steps, questions about declaring the indexes and relationships should be addressed. Because you saw that one colleague of yours was actually able to do it within DSS, and the availability of SQL recipes / scripts, I think you will be able to do all the things within the platform, with some extra tweaking. Disclaimer: in that second step, declaring indexes, triggers, constraints, I'm not an expert, but I think you should be able to find the needed help at that point, from a DBA like Tom's say.
That is most generous of you! I've attached the source (sacredwh_roster) and destination (ASW Roster Schema) ERD's. The latter has most of the foreign keys defined, but I haven't yet added any indexes besides the primary. All of this is to the best of my current knowledge which, I'll admit, has a lot holes in it.
Hi @mpemburn. Thanks for the information, that is useful to have a better idea of the sources and targets. What I'd do first in your place, is try to "draft" the relations or connections between the original columns and the target columns.
I'll be first working on a small hands-on video with the datasets we are working at ALMA (non-sensitive data), to give you an idea of how we use the Dataiku recipes to go from one model to the other. I wouldn't be able to create a hands-on or a project with your particular tables because I don't know their particularities, like: do we need to deal with duplicate entries? is there any information missing for some users or members? Are all members also users or vice versa? Where does the "prefix" column in the output data model comes from? (I can't find it in that column in the original source, sacredwh).
Hopefully I will have something by Monday or Tuesday, if that is OK with you.
Excellent—thank you! I think that if I can get over the hurdle of connecting a dataset from the source to that of the destination, I can probably figure out what needs to be done from there. As I mentioned, I’m a coder by trade, and if the solution requires some code, I’m equal to the task, even if it’s not in a language I’m accustomed to.
There probably won’t be any duplicates to worry about—this is a very small database representing fewer than 200 members. There will, however, be things like having to break out an email field that is supposed to contain only a single address and, in some cases, contains three (oy).
As an aside: I was not familiar with ETL before you mentioned it, and I can see that there are a lot of tools out there that support it (most of them paid, and I have a zero-dollar budget 😉). If DSS is capable of doing ETL, this fact seems to little known. Nothing in the docs, and the one question I found in the community has gone unanswered. Fertile ground for some new information, I’d say!
Just wanted to keep you updated that I've not forgotten about this topic. Is just that I've been under more workload than I expected at my working place so I haven't been able to move as quick as I wished.
I hope you have been able to keep moving forward anyway! I'll be back soon with some tips.
Thanks @Ignacio_Toledo —I figured this was probably the case. I've definitely got my hands full with developing the code that will support this database, so no worries!
Hi @mpemburn. I'm working in some tutorial material using a sample MySQL database called sakila.
I'm attaching here a test or draft Project, where I'm developing an ETL exercise. However, to import it, you need to have your own mysql installation, connected to dataiku, and with the example data loaded. The full instructions to install sakila are here: https://dev.mysql.com/doc/sakila/en/sakila-installation.html, and it takes about 5 minutes to install.
If installing this database and example data goes beyond your expertise or time available (installing MySQL requires some available time), let me know and then I'll adapt the project to start from csv files. But I think there might some value added by starting from a SQL database.
Question: the project will eventually write the outputs into a postgreSQL database. Do you have one you can access with dataiku?
I have a local MySQL setup, so it's no problem adding the sakila db to it. I haven't worked with postgres, but I installed it on my Mac today along with a copy of pgAdmin to manage it. I've also been able to connect it to dataiku, so we should be ready to roll!
Let me know when you are ready, and if you need help importing the project in DSS.
(I hope that this project and thread can later be transformed perhaps in a tutorial that might be useful for other users too)
I guess I'll need some help. After adding the sakila database to my local MySQL, I created a DSS connection called sakila, and then tried importing the project .zip file. It seemed to be working, but then threw a lot of errors:
Failed loading data for a dataset city: Table 'city' already exists (repeat for sever tables)
Bundle was created on a design node with plugins github, text_monitor_prepare, confluence, googledrive, time-series-forecast, excel-sheet-importer, text_monitor_retrieve, timeseries-preparation, clear-intermediate-datasets, waterfall-chart, googlesheets, join-and-keep-unmatched, dataset-audit, graph-analytics, sendmail, project-deployer but they're not installed
The project itself does exist, though, so maybe it just needs some tweaks.
I'm attaching a new version, could you try now? I think that previously I included not only the dataset connections but also the data, and that might create that error related to the already existing tables.
The warning about the plugins you can ignore safely in this case, as non of the plugins installed in my node are being used in this project.
Let's see if that works!