How to stack columns from one dataset

Theo_from_EPSI · February 2

Hi,
Here is a simplified schema of a basic dataset structure I need to reshape:

firstname	name	vote	col4	col5	col6	col7	col8	col9	etc..
ARTHAUD	Nathalie	5	ARMAND	Thierry	9	ARNAUD	Bernard	6	etc..
ARTHAUD	Nathalie	7	ARMAND	Thierry	3	ARNAUD	Bernard	8	etc..

The number of columns in this is variable but it will always be a multiple of 3.

The expected output:

firstname	name	vote
ARTHAUD	Nathalie	5
ARTHAUD	Nathalie	7
ARMAND	Thierry	9
ARMAND	Thierry	3
ARNAUD	Bernard	6
ARNAUD	Bernard	8

Feel free to suggest a solution in Python, but I would prefer visual recipes.

Dataiku version used: 14.3.3

ÁngelÁlvarez · February 4

Hi!

Interesting challenge. I'm afraid I haven't been able to do it with visual recipes. However, using the Python recipe has been child's play.

Using Visual Recipes

I initially tried using Prepare Recipe with the Fold processor, but I couldn’t find a way to preserve the original column order. Even when switching to a Python processor, each row was received as a dictionary whose keys and values didn’t follow the input schema, causing the triplets to become scrambled.

Using Python Recipes

To ensure 1:1 positional integrity, the most robust approach is a Python recipe using NumPy. By treating the table as a raw memory block and "reshaping" it, we bypass dictionary hashing entirely.

CenterMediumAccessibilityDelete

Step 1: The Synthetic Data Generator

Use this script to create a "Wide" dataset to test the logic. It generates a schema where every three columns represent a group.

import pandas as pd
import numpy as np
import dataiku

# --- PARAMETERS ---
num_rows = 100        # How many rows of data to create
num_triplets = 3      # How many sets of (firstname, name, vote)
# ------------------

def generate_synthetic_data(rows, triplets):
    firstnames = ["ARTHAUD", "ARMAND", "ARNAUD"]
    names = ["Nathalie", "Thierry", "Bernard"]
    
    data = {}
    
    for i in range(triplets):
        # We name the first set normally, and subsequent sets col4, col5... 
        # to match Theo's request schema.
        if i == 0:
            data['firstname'] = np.random.choice(firstnames, rows)
            data['name'] = np.random.choice(names, rows)
            data['vote'] = np.random.randint(0, 11, rows)
        else:
            base_idx = (i * 3) + 1
            data[f'col{base_idx}'] = np.random.choice(firstnames, rows)
            data[f'col{base_idx + 1}'] = np.random.choice(names, rows)
            data[f'col{base_idx + 2}'] = np.random.randint(0, 11, rows)
            
    return pd.DataFrame(data)

# Generate the dataframe
df = generate_synthetic_data(num_rows, num_triplets)

# Write to Dataiku dataset
output_dataset = dataiku.Dataset("synthetic_stacked_data")
output_dataset.write_with_schema(df)

Step 2: The Positional Fold Script

This recipe captures the first 3 column names dynamically and "folds" the entire dataset mathematically.

import pandas as pd
import numpy as np
import dataiku

# 1. Read input
input_dataset = dataiku.Dataset("synthetic_stacked_data")
input_df = input_dataset.get_dataframe()

# 2. Capture target headers (first 3 column names)
target_headers = input_df.columns[:3].tolist()

# 3. HIGH PERFORMANCE RESHAPE
# Convert to NumPy array
data_array = input_df.values 

# Calculate the number of triplets per row
# (Total columns / 3) * Total Rows = Total new rows
num_rows, num_cols = data_array.shape
num_triplets = num_cols // 3

# Reshape: We take the flat data and force it into (-1, 3) 
# -1 tells NumPy to calculate the number of rows automatically
reshaped_data = data_array.reshape(-1, 3)

# 4. Rebuild the DataFrame
# This happens in one single memory allocation
final_df = pd.DataFrame(reshaped_data, columns=target_headers)

# 6. Write result
output_dataset = dataiku.Dataset("final_stacked_results_coding")
output_dataset.write_with_schema(final_df)

Why this works:

Precision: By using iloc and NumPy, we rely on the column index rather than keys, ensuring the data never scrambles.
Speed: reshape is significantly faster than looping through columns or rows because it doesn't copy the data in memory; it just changes the "view" of it.
Scalability: While get_dataframe() is RAM-dependent and pulls the entire dataset into the RAM, the logic can easily be adapted to iter_dataframes(chunksize=...) for massive datasets.

I hope this helps anyone else struggling with non-standard unpivoting tasks!

Theo_from_EPSI · February 9

https://community.dataiku.com/discussion/comment/47141#Comment_47141

Thanks for the reply

Oh yeah I can tell it is faster than for loops 😊, thank you.
Those line seems unused in your script

num_rows, num_cols = data_array.shape

num_triplets = num_cols // 3

Turribeach · February 15

https://community.dataiku.com/discussion/comment/47154#Comment_47154

This is probably there to perform a check that num_triplets is an integer number, therefore that you are missing any data for each of the new column values.

How to stack columns from one dataset

Best Answer

Using Visual Recipes

Using Python Recipes

Step 1: The Synthetic Data Generator

Step 2: The Positional Fold Script

Why this works:

Answers

Categories

Setup Info

Tags