read selected columns

davidmakovoz
davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron

I have a csv files with 9 columns, but I only need 7. Is there a way to read only the columns I need while creating a dataset?

I tried to modify the schema and remove those columns, but they were read anyway, and a warning message was shown that the number of columns in the data doesn't match the schema.

Thanks,

David

Tagged:

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    Answer ✓

    When you say "removed". To clarify I think that you are meaning Removed from the Schema on the Schema Tap.

    But if you were to look at the data, directly in the data file. Both columns remain in the source data file.

    It is these columns that you wished for DSS not to import.

    As I'm explaining this, I'm realizing that DSS keeps a Schema to understand what to expect in a data file. Not to control the contents of the results. When we delete the columns from the Schema and there are actually columns it by default gets confused, and it's telling you so.

    However, there is a way to deal with this. Set the following value on the preview screen and Save.

    Schema column mismatch.jpg

    The system will complain. But, at least for me it is doing what you want.

    That said DSS is working very hard to help you get all of the data.

    Hope this helps.

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @davidmakovoz
    ,

    As you probably know, there are a bunch of different ways to grab a CSV file with Dataiku DSS. You don't mention which approach you are using.

    Different way to upload CSV Files.jpg

    Here are a few that come to mind, some which can make changes prior to loading into DSS

    • R Scripts that can make changes before saving into a DSS managed datastore
    • Python Script that can make changes before saving to DSS.

    Some that I don't know how to use to make changes to CSV files prior to loading into at least one DSS data store.

    • Download Recipe
    • Upload file system recipes

    Because a CSV file is a character-oriented text file something has to read the entire file, to parse out the columns you do not want, before loading the columns that you do want into a DSS managed data repository.

    There are lots of other ways to load datasets into DSS. Some have an active process that might be able to do this processing.

    Data Import Methods.jpg

    If I have to use processing within DSS to adjust my initial set of columns. (Which I have to say I almost never do.) I guess that I would typically have two data sets. One with the original data. The other with the subset of data that I need to share. With a process node of some type between the two copies of the data.

    Can you share a bit more about your concerns about downloading all 9 columns and then cutting down to 7 columns.

    Does, anyone else know a way to do this?

  • davidmakovoz
    davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron

    I read the csv file from my filesystem.

    Yes, I can later in the prepare step remove those columns, but the warning messages continue about mismatches.

    Also, why I am even allowed to remove those columns from the schema, if it doesn't affect the way the data is read?

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    Can you clarify more about "Mismatches?"

    And Yeah. The interface to remove the columns from the Schema does seem kind of strange. I guess if I manually added reference in the Schema, then I might want to delete a Schema row that is not needed. I did a quick look at the documentation and did not see anything obvious.

    We will see if anyone comes back with any more useful insights.

  • davidmakovoz
    davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron
    edited July 17

    Here is the message, geopoint and col_8 are the columns removed from the csv files.

    INPUT_DATA_EXTRA_COLUMNS occurred 2 times
    file=/data/dataiku/dss_data/uploads/BIPDEMO/datasets/us_zip_code_latitude_and_longitude/us-zip-code-latitude-and-longitude.csv line=1 Unexpected column found in header:geopoint
    file=/data/dataiku/dss_data/uploads/BIPDEMO/datasets/us_zip_code_latitude_and_longitude/us-zip-code-latitude-and-longitude.csv line=1 Unexpected column found in header:col_8

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @davidmakovoz

    Have you tried this later approach? Has this worked for you?

  • davidmakovoz
    davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron

    Yes, it works, thank you

  • davidmakovoz
    davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron

    Btw, even after I selected 'Discard (silent)' I still see the column names in the created dataset, see the attachment, which is kind of annoying.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron

    Hi @davidmakovoz
    . I feel your frustration, but I think these kind of barriers that forces us to think or do things in a different way are good in the long term!

    I can't say that this will work for everybody, but it has been working for me and my group: Whenever there is something in the DSS that seems to be missing or not working as we expected, we ask ourselves the question "is there a reason that something that seems so obvious to us is not supported?"

    In 90% of the cases we have learnt something new. For example, in this case that you mention, my take is that by uploading a raw file coming from an external DSS data source, it is a good practice to use a "Prepare recipe" to make explicit the filtering and selections I did over the original raw data, so whenever other person uses the same raw file and tries to repeats my analysis it will be obvious to the other person that I did a selection and some filtering first.

    Hope this helps!

  • NickPedersen
    NickPedersen Registered, Frontrunner 2022 Finalist, Frontrunner 2022 Participant Posts: 6 ✭✭✭✭

    Hi,

    I actually have a similar requirement. I am trying to read a delta file, but the table has an array type column in it and SparkSQL is not supporting that column type. So I can´t read that table as a delta. A workaround could be to not read that column from the source. So is there any way to omit reading columns from the source (in this case its ADLS through HDFS)?

Setup Info
    Tags
      Help me…