Is there a maximum length for the dynamic SQL generated for moving data into a SQL based table?

Options
pyousefi
pyousefi Registered Posts: 10 ✭✭✭✭
I am writing a python recipe to join two csv datasets and write to a postgres table.

Here is the first error:

An invalid argument has been encountered : Column name not matching between query and schema (Total Randomized or Receiving Treatment-United States of Americ, expected Total Randomized or Receiving Treatment-United States of America) SELECT * FROM "CLINICALTRIALCOSTPROJECTION_sitespatientsvisitsbyregion_t" LIMIT 10000. Please check dataset schema.

The length of the column ("Total Randomized or Receiving Treatment-United States of America") is 64



Here is another error after the script runs when I try to explore the data (I get a similar error to the one above) and then go to created tabe settings:

Connection OK. Table exists, but schema does not match : Name mismatch for column 39 : '1st Patient Recruited _FPI_' in dataset, '1st Patient Recruited (FPI)' in table
Tagged:

Answers

  • kgdku
    kgdku Dataiker Alumni, Registered Posts: 3 ✭✭✭✭
    Options
    Hello,

    The problem is very likely to come from your database.

    Assuming you are using Postgres, there is a default length for labels.

    It works in such a way that you can write with a long string for this label, but Postgres by default will truncate it as per the documentation

    https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

    DSS did write his schema with the label you provided when creating the output dataset, while Postgres truncated it, that is why after the recipe completes and DSS tries to synchronize his schema (with the accurate label) with the Postgres schema (with the truncated label), it gives an error.

    As per Postgres documentation you can raise this limit in the Postgres config

    The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.

    Cheers from DKU
  • pyousefi
    pyousefi Registered Posts: 10 ✭✭✭✭
    Options
    Thanks @kgdku. What about my follow up comment?
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    What method do you use to write the SQL table?
  • pyousefi
    pyousefi Registered Posts: 10 ✭✭✭✭
    Options
    Hi @Alex. I had two uploaded csv files and I was using a python script recipe to join them and write them to a postgres table. I used pgAdmin to check the structure of the table in postgres and it was fine. I was not able to import the table as a new dataset.
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Can you paste the part of the python code where you write the data back?
  • pyousefi
    pyousefi Registered Posts: 10 ✭✭✭✭
    Options
    Here you go:

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu

    # Read recipe inputs
    globalCombined = dataiku.Dataset("GlobalCombined")
    globalCombined_df = globalCombined.get_dataframe()
    sitesPatientsVisitsbyRegion = dataiku.Dataset("SitesPatientsVisitsbyRegion")
    sitesPatientsVisitsbyRegion_df = sitesPatientsVisitsbyRegion.get_dataframe()

    joined_python_df = globalCombined_df.set_index('OPPORTUNITY_NUMBER__C', drop=False).join(sitesPatientsVisitsbyRegion_df.set_index("OPPORTUNITY_NUMBER__C"), how='left')

    # Write recipe outputs
    joined_python = dataiku.Dataset("joined_python")
    joined_python.write_with_schema(joined_python_df)
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Hi Pouya, try using dropAndCreate=True as parameter of https://doc.dataiku.com/dss/latest/api/python/datasets.html?highlight=write_with_schema#dataiku.Dataset.write_with_schema
    Note that doing a join in-memory using Python is not very efficient. Can you use a Join recipe instead to leverage in-database computation?
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Hi, did you solve your issue?
Setup Info
    Tags
      Help me…