Parsing XML file
I have an XML file as shown below which I am loading as a dataset in Dataiku. The XML file shows the membership of each user in in each room. In the Format/Preview dataset screen I can either select /FullReport/rooms/room to get all the rooms or /FullReport/rooms/users to get all the users. But what I really want is to have the combined output as follows:
roomID | roomTitle | fullName |
1 | Room 1 | JOHN DOE |
1 | Room 1 | TOM DOE |
2 | Room 2 | JOHN DOE |
2 | Room 2 | JOHN SMITH |
3 | Room 3 | SARAH SMITH |
Any ideas how to do that in DSS?
<FullReport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <rooms> <room> <roomID>1</roomID> <roomTitle>Room 1</roomTitle> </room> <users> <fullName>JOHN DOE</fullName> </users> <users> <fullName>TOM DOE</fullName> </users> </users> </rooms> <rooms> <room> <roomID>2</roomID> <roomTitle>Room 2</roomTitle> </room> <users> <fullName>JOHN DOE</fullName> </users> <users> <fullName>JOHN SMITH</fullName> </users> </users> </rooms> <rooms> <room> <roomID>3</roomID> <roomTitle>Room 3</roomTitle> </room> <users> <fullName>SARAH SMITH</fullName> </users> </users> </rooms> </FullReport>
Best Answer
-
You can do this in 2 steps:
- import the XML with format params:
- Root element XPath -> /FullReport/rooms
- Data extraction -> Explicit XPath
- extracted elements :
- ./room/roomTitle/text() -> roomTitle. (unique)
- ./room/roomID/text() -> roomID (unique)
- ./users/fullName/text() -> users
- use a Prepare recipe to fold the users array with the Fold processor
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron
Hi, I played with it last night a bit more and I ended up doing the same thing you suggested so thanks for validating the road I took. It's a pity you can't extract the values into rows as part of the load as this can speed up the file load. But for my use case the file wasn't that big so the extra recipe didn't take that long. Thanks!