How to stack columns from one dataset
Hi,
Here is a simplified schema of a basic dataset structure I need to reshape:
firstname | name | vote | col4 | col5 | col6 | col7 | col8 | col9 | etc.. |
|---|---|---|---|---|---|---|---|---|---|
ARTHAUD | Nathalie | 5 | ARMAND | Thierry | 9 | ARNAUD | Bernard | 6 | etc.. |
ARTHAUD | Nathalie | 7 | ARMAND | Thierry | 3 | ARNAUD | Bernard | 8 | etc.. |
The number of columns in this is variable but it will always be a multiple of 3.
The expected output:
firstname | name | vote |
|---|---|---|
ARTHAUD | Nathalie | 5 |
ARTHAUD | Nathalie | 7 |
ARMAND | Thierry | 9 |
ARMAND | Thierry | 3 |
ARNAUD | Bernard | 6 |
ARNAUD | Bernard | 8 |
Feel free to suggest a solution in Python, but I would prefer visual recipes.
Dataiku version used: 14.3.3
Best Answer
-
Hi!
Interesting challenge. I'm afraid I haven't been able to do it with visual recipes. However, using the Python recipe has been child's play.
Using Visual Recipes
I initially tried using Prepare Recipe with the Fold processor, but I couldn’t find a way to preserve the original column order. Even when switching to a Python processor, each row was received as a dictionary whose keys and values didn’t follow the input schema, causing the triplets to become scrambled.
Using Python Recipes
To ensure 1:1 positional integrity, the most robust approach is a Python recipe using NumPy. By treating the table as a raw memory block and "reshaping" it, we bypass dictionary hashing entirely.
CenterMediumAccessibilityDelete
Step 1: The Synthetic Data Generator
Use this script to create a "Wide" dataset to test the logic. It generates a schema where every three columns represent a group.
import pandas as pd import numpy as np import dataiku # --- PARAMETERS --- num_rows = 100 # How many rows of data to create num_triplets = 3 # How many sets of (firstname, name, vote) # ------------------ def generate_synthetic_data(rows, triplets): firstnames = ["ARTHAUD", "ARMAND", "ARNAUD"] names = ["Nathalie", "Thierry", "Bernard"] data = {} for i in range(triplets): # We name the first set normally, and subsequent sets col4, col5... # to match Theo's request schema. if i == 0: data['firstname'] = np.random.choice(firstnames, rows) data['name'] = np.random.choice(names, rows) data['vote'] = np.random.randint(0, 11, rows) else: base_idx = (i * 3) + 1 data[f'col{base_idx}'] = np.random.choice(firstnames, rows) data[f'col{base_idx + 1}'] = np.random.choice(names, rows) data[f'col{base_idx + 2}'] = np.random.randint(0, 11, rows) return pd.DataFrame(data) # Generate the dataframe df = generate_synthetic_data(num_rows, num_triplets) # Write to Dataiku dataset output_dataset = dataiku.Dataset("synthetic_stacked_data") output_dataset.write_with_schema(df)Step 2: The Positional Fold Script
This recipe captures the first 3 column names dynamically and "folds" the entire dataset mathematically.
import pandas as pd import numpy as np import dataiku # 1. Read input input_dataset = dataiku.Dataset("synthetic_stacked_data") input_df = input_dataset.get_dataframe() # 2. Capture target headers (first 3 column names) target_headers = input_df.columns[:3].tolist() # 3. HIGH PERFORMANCE RESHAPE # Convert to NumPy array data_array = input_df.values # Calculate the number of triplets per row # (Total columns / 3) * Total Rows = Total new rows num_rows, num_cols = data_array.shape num_triplets = num_cols // 3 # Reshape: We take the flat data and force it into (-1, 3) # -1 tells NumPy to calculate the number of rows automatically reshaped_data = data_array.reshape(-1, 3) # 4. Rebuild the DataFrame # This happens in one single memory allocation final_df = pd.DataFrame(reshaped_data, columns=target_headers) # 6. Write result output_dataset = dataiku.Dataset("final_stacked_results_coding") output_dataset.write_with_schema(final_df)Why this works:
- Precision: By using
ilocand NumPy, we rely on the column index rather than keys, ensuring the data never scrambles. - Speed:
reshapeis significantly faster than looping through columns or rows because it doesn't copy the data in memory; it just changes the "view" of it. - Scalability: While
get_dataframe()is RAM-dependent and pulls the entire dataset into the RAM, the logic can easily be adapted toiter_dataframes(chunksize=...)for massive datasets.
I hope this helps anyone else struggling with non-standard unpivoting tasks!
- Precision: By using
Answers
-
Thanks for the reply
Oh yeah I can tell it is faster than for loops 😊, thank you.
Those line seems unused in your scriptnum_rows, num_cols = data_array.shape
num_triplets = num_cols // 3 -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,664 NeuronThis is probably there to perform a check that num_triplets is an integer number, therefore that you are missing any data for each of the new column values.