Empty exported file from not empty dataset

Eduard
Eduard Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3 ✭✭✭

Hi,

I've been trying to export some datasets with thousands or millions of rows to CSV or XLSX files.

Even if the datasets are not empty in the DSS flow, the exported files contain each time the expected number of rows, but all the columns are empty, as in the picture below:

Capture d’écran 2022-03-15 114744.png

The dataset are stored as parquet files in Hadoop, the recipe engine is Hive. Partition is not activated.

I've noticed that this is project dependent, it does not happen in other projects in the same DSS environment.

Do you have any idea how to deal with this issue?

Thanks in advance.

EDIT:
I've noticed that the dataset exported as an empty file has no types on columns, only meanings... I cannot understand why...

Before prep recipe (export OK) :

Capture d’écran 2022-03-15 133845.png

After prep recipe (export as empty file) :

Capture d’écran 2022-03-15 133842.png

UPDATE:
The types are lost when the dataset is stored as Parquet. When stored as CSV (Hive compatible) the types are kept and the export is succesful.

This solves my issue, but I'd like to understand why is this happening... Any ideas?


Operating system used: Windows 10

Tagged:

Answers

  • MikeG
    MikeG Dataiker, Registered Posts: 15 Dataiker
    edited July 17

    Hi @Pasavento
    ,

    I’m taking a look at this. I haven’t been able to reproduce the behavior you’re describing where you’re exporting a hdfs parquet dataset to .csv in DSS and observing the resulting .csv file containing empty columns.

    In my lab setup I have a table called `orders_prepared_copy_parquet` in hive stored on hdfs using parquet file format:

    0: jdbc:hive2://hdp3.local:2181/default> describe formatted orders_prepared_copy_parquet;
    [...]
    +-------------------------------+----------------------------------------------------+----------------------------------------------------+
    |           col_name            |                     data_type                      |                      comment                       |
    +-------------------------------+----------------------------------------------------+----------------------------------------------------+
    | # col_name                    | data_type                                          | comment                                            |
    | order_date                    | string                                             |                                                    |
    | pages_visited                 | bigint                                             |                                                    |
    | order_id                      | string                                             |                                                    |
    | customer_id                   | string                                             |                                                    |
    | tshirt_category               | string                                             |                                                    |
    | tshirt_price                  | double                                             |                                                    |
    | tshirt_quantity               | bigint                                             |                                                    |
    |                               | NULL                                               | NULL                                               |
    | # Detailed Table Information  | NULL                                               | NULL                                               |
    | Database:                     | dataiku                                            | NULL                                               |
    | OwnerType:                    | USER                                               | NULL                                               |
    | Owner:                        | dataiku                                            | NULL                                               |
    | CreateTime:                   | Thu May 05 04:55:52 UTC 2022                       | NULL                                               |
    | LastAccessTime:               | UNKNOWN                                            | NULL                                               |
    | Retention:                    | 0                                                  | NULL                                               |
    | Location:                     | hdfs://hdp3.local:8020/user/dataiku/dss_managed_datasets/COMMUNITY_23816/orders_prepared_copy_parquet | NULL                                               |
    | Table Type:                   | EXTERNAL_TABLE                                     | NULL                                               |
    | Table Parameters:             | NULL                                               | NULL                                               |
    |                               | COLUMN_STATS_ACCURATE                              | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"customer_id\":\"true\",\"order_date\":\"true\",\"order_id\":\"true\",\"pages_visited\":\"true\",\"tshirt_category\":\"true\",\"tshirt_price\":\"true\",\"tshirt_quantity\":\"true\"}} |
    |                               | DSS_DEFAUT_HIVE_TABLE_TAG                          | eyJtZ3QiOnRydWUsImZtdCI6IlgwTi9HTiIsInBhcnQiOiIxQjJNMlkiLCJsb2MiOiI3OXB2RWIifQ== |
    |                               | EXTERNAL                                           | TRUE                                               |
    |                               | bucketing_version                                  | 2                                                  |
    |                               | discover.partitions                                | true                                               |
    |                               | numFiles                                           | 1                                                  |
    |                               | numRows                                            | 26552                                              |
    |                               | parquet.compression                                | SNAPPY                                             |
    |                               | rawDataSize                                        | 185864                                             |
    |                               | totalSize                                          | 465555                                             |
    |                               | transient_lastDdlTime                              | 1651726566                                         |
    |                               | NULL                                               | NULL                                               |
    | # Storage Information         | NULL                                               | NULL                                               |
    | SerDe Library:                | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL                                               |
    | InputFormat:                  | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL                                               |
    | OutputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL                                               |
    | Compressed:                   | No                                                 | NULL                                               |
    | Num Buckets:                  | -1                                                 | NULL                                               |
    | Bucket Columns:               | []                                                 | NULL                                               |
    | Sort Columns:                 | []                                                 | NULL                                               |
    | Storage Desc Params:          | NULL                                               | NULL                                               |
    |                               | serialization.format                               | 1                                                  |
    +-------------------------------+----------------------------------------------------+----------------------------------------------------+
    41 rows selected (0.045 seconds)

    I am using Hive 3.1.0.3.1.5.0-152 and Hadoop 3.1.1.3.1.5.0-152 (and DSS 10.0.5):

    [centos@hdp3 ~]$ hive --version
    [...]
    Hive 3.1.0.3.1.5.0-152
    [...]
    From source with checksum d17a651ba7b1cd4feb847dc38547bf07
    [centos@hdp3 ~]$ hdfs version
    Hadoop 3.1.1.3.1.5.0-152
    [...]

    In DSS I add the hdfs dataset `orders_prepared_copy_parquet ` to the Flow, then I export it to .csv:

    Screen Shot 2022-05-04 at 10.09.34 PM.png

    In the resulting .csv file I can see expected values in each column (i.e. from DSS I am able to successfully export an hdfs dataset stored in parquet file format to a .csv file):

    mgallegos@MacBookPro 22-05-04 22:10  ~/Downloads$ head orders_prepared_copy_parquet.csv
    order_date,pages_visited,order_id,customer_id,tshirt_category,tshirt_price,tshirt_quantity
    2016/09/04,9,HTS-038040-0002,038040,White T-Shirt M,20.0,1
    2014/11/14,11,HTS-801797-0001,801797,White T-Shirt M,20.0,1
    2017/02/26,10,HTS-vft1eu-0003,vft1eu,White T-Shirt F,18.0,3
    2013/12/01,10,HTS-914324-0001,914324,Wh Tshirt F,18.0,1
    2015/10/22,12,HTS-88ua9r-0001,88ua9r,White T-Shirt M,20.0,1
    2016/01/15,9,HTS-061311-0003,061311,Black T-Shirt F,17.5,4
    2014/11/25,6,HTS-479441-0001,479441,Hoodie,23.0,19
    2014/11/01,10,HTS-352809-0001,352809,Black T-Shirt M,19.0,2
    2013/11/03,10,HTS-494332-0001,494332,Wh Tshirt F,18.0,2

    ---

    Next actions

    Can you post the output of `describe formatted` from hive for the parquet table of interest? (i.e. the table that when, in DSS, exported to .csv produces empty column values)

    It may be worth trying this workflow on latest DSS (10.0.5) to see if it reproduces there for you. If it does we’ll need to start looking more closely at an affected project’s configuration to see what is different between your setup and my setup.

    Please let me know if I have misunderstood the issue you're encountering.

    Thank you,
    Mike

Setup Info
    Tags
      Help me…