How to stack columns from one dataset

Theo_from_EPSI
Theo_from_EPSI Registered Posts: 2 ✭✭

Hi,
Here is a simplified schema of a basic dataset structure I need to reshape:

firstname

name

vote

col4

col5

col6

col7

col8

col9

etc..

ARTHAUD

Nathalie

5

ARMAND

Thierry

9

ARNAUD

Bernard

6

etc..

ARTHAUD

Nathalie

7

ARMAND

Thierry

3

ARNAUD

Bernard

8

etc..

The number of columns in this is variable but it will always be a multiple of 3.

The expected output:

firstname

name

vote

ARTHAUD

Nathalie

5

ARTHAUD

Nathalie

7

ARMAND

Thierry

9

ARMAND

Thierry

3

ARNAUD

Bernard

6

ARNAUD

Bernard

8

Feel free to suggest a solution in Python, but I would prefer visual recipes.

Dataiku version used: 14.3.3

Best Answer

  • ÁngelÁlvarez
    ÁngelÁlvarez Registered Posts: 5 ✭✭
    edited February 4 Answer ✓

    Hi!

    Interesting challenge. I'm afraid I haven't been able to do it with visual recipes. However, using the Python recipe has been child's play.

    Using Visual Recipes

    I initially tried using Prepare Recipe with the Fold processor, but I couldn’t find a way to preserve the original column order. Even when switching to a Python processor, each row was received as a dictionary whose keys and values didn’t follow the input schema, causing the triplets to become scrambled.

    Using Python Recipes

    To ensure 1:1 positional integrity, the most robust approach is a Python recipe using NumPy. By treating the table as a raw memory block and "reshaping" it, we bypass dictionary hashing entirely.

    CenterMediumAccessibilityDelete

    Step 1: The Synthetic Data Generator

    Use this script to create a "Wide" dataset to test the logic. It generates a schema where every three columns represent a group.

    import pandas as pd
    import numpy as np
    import dataiku
    
    # --- PARAMETERS ---
    num_rows = 100        # How many rows of data to create
    num_triplets = 3      # How many sets of (firstname, name, vote)
    # ------------------
    
    def generate_synthetic_data(rows, triplets):
        firstnames = ["ARTHAUD", "ARMAND", "ARNAUD"]
        names = ["Nathalie", "Thierry", "Bernard"]
        
        data = {}
        
        for i in range(triplets):
            # We name the first set normally, and subsequent sets col4, col5... 
            # to match Theo's request schema.
            if i == 0:
                data['firstname'] = np.random.choice(firstnames, rows)
                data['name'] = np.random.choice(names, rows)
                data['vote'] = np.random.randint(0, 11, rows)
            else:
                base_idx = (i * 3) + 1
                data[f'col{base_idx}'] = np.random.choice(firstnames, rows)
                data[f'col{base_idx + 1}'] = np.random.choice(names, rows)
                data[f'col{base_idx + 2}'] = np.random.randint(0, 11, rows)
                
        return pd.DataFrame(data)
    
    # Generate the dataframe
    df = generate_synthetic_data(num_rows, num_triplets)
    
    # Write to Dataiku dataset
    output_dataset = dataiku.Dataset("synthetic_stacked_data")
    output_dataset.write_with_schema(df)
    

    Step 2: The Positional Fold Script

    This recipe captures the first 3 column names dynamically and "folds" the entire dataset mathematically.

    import pandas as pd
    import numpy as np
    import dataiku
    
    # 1. Read input
    input_dataset = dataiku.Dataset("synthetic_stacked_data")
    input_df = input_dataset.get_dataframe()
    
    # 2. Capture target headers (first 3 column names)
    target_headers = input_df.columns[:3].tolist()
    
    # 3. HIGH PERFORMANCE RESHAPE
    # Convert to NumPy array
    data_array = input_df.values 
    
    # Calculate the number of triplets per row
    # (Total columns / 3) * Total Rows = Total new rows
    num_rows, num_cols = data_array.shape
    num_triplets = num_cols // 3
    
    # Reshape: We take the flat data and force it into (-1, 3) 
    # -1 tells NumPy to calculate the number of rows automatically
    reshaped_data = data_array.reshape(-1, 3)
    
    # 4. Rebuild the DataFrame
    # This happens in one single memory allocation
    final_df = pd.DataFrame(reshaped_data, columns=target_headers)
    
    # 6. Write result
    output_dataset = dataiku.Dataset("final_stacked_results_coding")
    output_dataset.write_with_schema(final_df)
    

    Why this works:

    • Precision: By using iloc and NumPy, we rely on the column index rather than keys, ensuring the data never scrambles.
    • Speed: reshape is significantly faster than looping through columns or rows because it doesn't copy the data in memory; it just changes the "view" of it.
    • Scalability: While get_dataframe() is RAM-dependent and pulls the entire dataset into the RAM, the logic can easily be adapted to iter_dataframes(chunksize=...) for massive datasets.

    I hope this helps anyone else struggling with non-standard unpivoting tasks!

Answers

Setup Info
    Tags
      Help me…