When Exploring a dataset please can you show more row information

Peter_R_Knight
Peter_R_Knight Registered Posts: 32 ✭✭✭✭

currently it shows number of rows and cols. but it would be useful to split this into:

  • rows in source e.g. 25684
  • rows in sample e.g. 10000
  • rows with filter (if filter applied) e.g. 58

similarly with cols - when you display only a subset of columns this could also be shown (e.g. showing 23 of 45 cols)

so final displayed text would change from: 10000 rows,58 cols

to: 58 filtered rows from sample of 10000 from 25684 source rows, 23 of 45 cols

Tagged:

Answers

  • Mattsco
    Mattsco Dataiker, Registered Posts: 125 Dataiker

    Hi Peter,

    Actually, you have access to this information!

    Capture d’écran 2020-10-06 à 14.59.22.png

    Here on the left, I can see I'm working on a sample of 10000 rows and 22 columns.
    After editing the data in a visual prepare step I can see I have now only 6042 records left and 21 columns.
    And more specifically: those 6042 are being edited by one of the steps and 3958 are removed.

    For the total number of records, we don't necessarily have access to this information, don't forget the dataset can have billions of billions of records stored. It would take a long time to compute this information in some cases.
    That's why you must go in "Status" tab and click on compute to run this (maybe long) process.

  • Peter_R_Knight
    Peter_R_Knight Registered Posts: 32 ✭✭✭✭

    Hi Mattsco,

    Thanks for your reply- it is great that the prepare recipes show some of this information. However, I was meaning just within the Explore window.

    dataiku explroe dataset.png

    I just spotted that it shows most of what I asked for on the right hand side (blue circle) - it just wasn't very obvious as I expected all that information to be together where the red circle is.

    The issue that caused me to rias this is I spent about an hour trying to work out why a query wasn't returning as many rows as I expected. I had set my row sample to 100,000 rows and it only brought back 30,000 rows, so I assumed that was the full dataset, but actually it was being limited by memory usage. Usually in databases doing a count(*) from a table is a pretty quick operation to just get the total rows - even if it has billions of records. So I just thinking making that clear would be helpful.

    Just an idea that I think would make it easier to understand your dataset as you work with it.

Setup Info
    Tags
      Help me…