I have a csv files with 9 columns, but I only need 7. Is there a way to read only the columns I need while creating a dataset?
I tried to modify the schema and remove those columns, but they were read anyway, and a warning message was shown that the number of columns in the data doesn't match the schema.
As you probably know, there are a bunch of different ways to grab a CSV file with Dataiku DSS. You don't mention which approach you are using.
Here are a few that come to mind, some which can make changes prior to loading into DSS
Some that I don't know how to use to make changes to CSV files prior to loading into at least one DSS data store.
Because a CSV file is a character-oriented text file something has to read the entire file, to parse out the columns you do not want, before loading the columns that you do want into a DSS managed data repository.
There are lots of other ways to load datasets into DSS. Some have an active process that might be able to do this processing.
If I have to use processing within DSS to adjust my initial set of columns. (Which I have to say I almost never do.) I guess that I would typically have two data sets. One with the original data. The other with the subset of data that I need to share. With a process node of some type between the two copies of the data.
Can you share a bit more about your concerns about downloading all 9 columns and then cutting down to 7 columns.
Does, anyone else know a way to do this?
I read the csv file from my filesystem.
Yes, I can later in the prepare step remove those columns, but the warning messages continue about mismatches.
Also, why I am even allowed to remove those columns from the schema, if it doesn't affect the way the data is read?
Can you clarify more about "Mismatches?"
And Yeah. The interface to remove the columns from the Schema does seem kind of strange. I guess if I manually added reference in the Schema, then I might want to delete a Schema row that is not needed. I did a quick look at the documentation and did not see anything obvious.
We will see if anyone comes back with any more useful insights.
INPUT_DATA_EXTRA_COLUMNS occurred 2 times file=/data/dataiku/dss_data/uploads/BIPDEMO/datasets/us_zip_code_latitude_and_longitude/us-zip-code-latitude-and-longitude.csv line=1 Unexpected column found in header:geopoint file=/data/dataiku/dss_data/uploads/BIPDEMO/datasets/us_zip_code_latitude_and_longitude/us-zip-code-latitude-and-longitude.csv line=1 Unexpected column found in header:col_8
When you say "removed". To clarify I think that you are meaning Removed from the Schema on the Schema Tap.
But if you were to look at the data, directly in the data file. Both columns remain in the source data file.
It is these columns that you wished for DSS not to import.
As I'm explaining this, I'm realizing that DSS keeps a Schema to understand what to expect in a data file. Not to control the contents of the results. When we delete the columns from the Schema and there are actually columns it by default gets confused, and it's telling you so.
However, there is a way to deal with this. Set the following value on the preview screen and Save.
The system will complain. But, at least for me it is doing what you want.
That said DSS is working very hard to help you get all of the data.
Hope this helps.
Hi @davidmakovoz. I feel your frustration, but I think these kind of barriers that forces us to think or do things in a different way are good in the long term!
I can't say that this will work for everybody, but it has been working for me and my group: Whenever there is something in the DSS that seems to be missing or not working as we expected, we ask ourselves the question "is there a reason that something that seems so obvious to us is not supported?"
In 90% of the cases we have learnt something new. For example, in this case that you mention, my take is that by uploading a raw file coming from an external DSS data source, it is a good practice to use a "Prepare recipe" to make explicit the filtering and selections I did over the original raw data, so whenever other person uses the same raw file and tries to repeats my analysis it will be obvious to the other person that I did a selection and some filtering first.
Hope this helps!