read selected columns

davidmakovoz · ‎10-01-2020

I have a csv files with 9 columns, but I only need 7. Is there a way to read only the columns I need while creating a dataset?

I tried to modify the schema and remove those columns, but they were read anyway, and a warning message was shown that the number of columns in the data doesn't match the schema.

Thanks,

David

tgb417 · ‎10-01-2020

When you say "removed". To clarify I think that you are meaning Removed from the Schema on the Schema Tap.

But if you were to look at the data, directly in the data file. Both columns remain in the source data file.

It is these columns that you wished for DSS not to import.

As I'm explaining this, I'm realizing that DSS keeps a Schema to understand what to expect in a data file. Not to control the contents of the results. When we delete the columns from the Schema and there are actually columns it by default gets confused, and it's telling you so.

However, there is a way to deal with this. Set the following value on the preview screen and Save.

The system will complain. But, at least for me it is doing what you want.

That said DSS is working very hard to help you get all of the data.

Hope this helps.

--Tom

View solution in original post

tgb417 · ‎10-01-2020

@davidmakovoz ,

As you probably know, there are a bunch of different ways to grab a CSV file with Dataiku DSS. You don't mention which approach you are using.

Here are a few that come to mind, some which can make changes prior to loading into DSS

R Scripts that can make changes before saving into a DSS managed datastore
Python Script that can make changes before saving to DSS.

Some that I don't know how to use to make changes to CSV files prior to loading into at least one DSS data store.

Download Recipe
Upload file system recipes

Because a CSV file is a character-oriented text file something has to read the entire file, to parse out the columns you do not want, before loading the columns that you do want into a DSS managed data repository.

There are lots of other ways to load datasets into DSS. Some have an active process that might be able to do this processing.

If I have to use processing within DSS to adjust my initial set of columns. (Which I have to say I almost never do.) I guess that I would typically have two data sets. One with the original data. The other with the subset of data that I need to share. With a process node of some type between the two copies of the data.

Can you share a bit more about your concerns about downloading all 9 columns and then cutting down to 7 columns.

Does, anyone else know a way to do this?

--Tom

davidmakovoz · ‎10-01-2020

I read the csv file from my filesystem.

Yes, I can later in the prepare step remove those columns, but the warning messages continue about mismatches.

Also, why I am even allowed to remove those columns from the schema, if it doesn't affect the way the data is read?

tgb417 · ‎10-01-2020

Can you clarify more about "Mismatches?"

And Yeah. The interface to remove the columns from the Schema does seem kind of strange. I guess if I manually added reference in the Schema, then I might want to delete a Schema row that is not needed. I did a quick look at the documentation and did not see anything obvious.

We will see if anyone comes back with any more useful insights.

--Tom

davidmakovoz · ‎10-01-2020

Here is the message, geopoint and col_8 are the columns removed from the csv files.

INPUT_DATA_EXTRA_COLUMNS occurred 2 times
file=/data/dataiku/dss_data/uploads/BIPDEMO/datasets/us_zip_code_latitude_and_longitude/us-zip-code-latitude-and-longitude.csv line=1 Unexpected column found in header:geopoint
file=/data/dataiku/dss_data/uploads/BIPDEMO/datasets/us_zip_code_latitude_and_longitude/us-zip-code-latitude-and-longitude.csv line=1 Unexpected column found in header:col_8

tgb417 · ‎10-01-2020

When you say "removed". To clarify I think that you are meaning Removed from the Schema on the Schema Tap.

But if you were to look at the data, directly in the data file. Both columns remain in the source data file.

It is these columns that you wished for DSS not to import.

As I'm explaining this, I'm realizing that DSS keeps a Schema to understand what to expect in a data file. Not to control the contents of the results. When we delete the columns from the Schema and there are actually columns it by default gets confused, and it's telling you so.

However, there is a way to deal with this. Set the following value on the preview screen and Save.

The system will complain. But, at least for me it is doing what you want.

That said DSS is working very hard to help you get all of the data.

Hope this helps.

--Tom

tgb417 · ‎10-02-2020

@davidmakovoz

Have you tried this later approach? Has this worked for you?

--Tom

davidmakovoz · ‎10-05-2020

Yes, it works, thank you

davidmakovoz · ‎10-05-2020

Btw, even after I selected 'Discard (silent)' I still see the column names in the created dataset, see the attachment, which is kind of annoying.

Ignacio_Toledo · ‎10-05-2020

Hi @davidmakovoz. I feel your frustration, but I think these kind of barriers that forces us to think or do things in a different way are good in the long term!

I can't say that this will work for everybody, but it has been working for me and my group: Whenever there is something in the DSS that seems to be missing or not working as we expected, we ask ourselves the question "is there a reason that something that seems so obvious to us is not supported?"

In 90% of the cases we have learnt something new. For example, in this case that you mention, my take is that by uploading a raw file coming from an external DSS data source, it is a good practice to use a "Prepare recipe" to make explicit the filtering and selections I did over the original raw data, so whenever other person uses the same raw file and tries to repeats my analysis it will be obvious to the other person that I did a selection and some filtering first.

Hope this helps!

NickPedersen · ‎06-07-2021

Hi,

I actually have a similar requirement. I am trying to read a delta file, but the table has an array type column in it and SparkSQL is not supporting that column type. So I can´t read that table as a delta. A workaround could be to not read that column from the source. So is there any way to omit reading columns from the source (in this case its ADLS through HDFS)?

read selected columns

read selected columns

Here is the message, geopoint and col_8 are the columns removed from the csv files.

Labels

Datasets

Sign up to take part

read selected columns

read selected columns

Here is the message, geopoint and col_8 are the columns removed from the csv files.

Labels

Datasets