Spark dataframe with illegal characters in column names

thunt
thunt Registered Posts: 2 ✭✭✭✭
When I try and run a recipe that uses a dataframe that has a column with a space inside the name (like 'Number of Entries'), the recipe crashes with an exception: org.apache.spark.sql.AnalysisException, saying the column name has invalid characters.

Is there a way to change this in the Dataiku settings page for the dataset? I tried to edit the override variables under advanced, with something like 'schema.columns[0].name', but this did not appear to have any effect. What is the best solution to deal with these kinds of problems?

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Hi,

    The Parquet writer in Spark cannot handle special characters in column names at all, it's unsupported.

    If you are in a code recipe, you'll need to rename your column in your code using select, alias or withColumnRenamed.
    If you are in a visual recipe, you'll need to rename your column prior to this recipe, for example with a prepare recipe.

    Other options can include using CSV instead of Parquet

    Generally speaking, given the multiple idiosyncrasies and differences of behaviors between engines, we strongly recommend that as soon as your data enter the Hadoop/Spark world, you should only use lowercased column names without any special characters just_like_that.
  • jmccartin
    jmccartin Registered Posts: 19 ✭✭✭✭
    Actually, the 'withColumnRenamed' trick doesn't appear to work, at least on the data we have. I used to get around this problem by manually specifying the schema when making the 'sparkContext.read.parquet()' function call. Since dataiku is now doing the reading for us, will you ever add support for schema overrides in the future?

    I agree in general though, column names with whitespace are generally a bad thing, and are best avoided all-together.
Setup Info
    Tags
      Help me…