Join us on Wednesday, June 3rd for a deep dive into Customer Predictive Analytics Learn more

Parsing XML file

Level 2
Parsing XML file

I have an XML file as shown below which I am loading as a dataset in Dataiku. The XML file shows the membership of each user in in each room. In the Format/Preview dataset screen I can either select /FullReport/rooms/room to get all the rooms or /FullReport/rooms/users to get all the users. But what I really want is to have the combined output as follows:

roomIDroomTitlefullName
1Room 1JOHN DOE
1Room 1TOM DOE
2Room 2JOHN DOE
2Room 2JOHN SMITH
3Room 3SARAH SMITH

 

Any ideas how to do that in DSS?

<FullReport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <rooms>
        <room>
            <roomID>1</roomID>
            <roomTitle>Room 1</roomTitle>
        </room>
        <users>
            <fullName>JOHN DOE</fullName>
        </users>
        <users>
            <fullName>TOM DOE</fullName>
        </users>
        </users>
    </rooms>
    <rooms>
        <room>
            <roomID>2</roomID>
            <roomTitle>Room 2</roomTitle>
        </room>
        <users>
            <fullName>JOHN DOE</fullName>
        </users>
        <users>
            <fullName>JOHN SMITH</fullName>
        </users>
        </users>
    </rooms>
    <rooms>
        <room>
            <roomID>3</roomID>
            <roomTitle>Room 3</roomTitle>
        </room>
        <users>
            <fullName>SARAH SMITH</fullName>
        </users>
        </users>
    </rooms>
</FullReport>

 

2 Replies
Dataiker
Dataiker

You can do this in 2 steps:

- import the XML with format params:

  - Root element XPath -> /FullReport/rooms

  - Data extraction -> Explicit XPath

  - extracted elements : 

    - ./room/roomTitle/text() -> roomTitle. (unique)

    - ./room/roomID/text() -> roomID  (unique)

    - ./users/fullName/text() -> users

- use a Prepare recipe to fold the users array with the Fold processor

Level 2
Author

Hi, I played with it last night a bit more and I ended up doing the same thing you suggested so thanks for validating the road I took. It's a pity you can't extract the values into rows as part of the load as this can speed up the file load. But for my use case the file wasn't that big so the extra recipe didn't take that long. Thanks!

Labels (3)