Given data sets are not sufficient at the assessment questions of Advance Designer Certification.

Yan · August 2021

Hi,

I am working on the assessment of Advance Designer for the certification,

I don't find any value of column "Stock_code" starting with "Test" in the tow tow given datasets Retail_I and Retail_II. As the request of the problem, i should write a regular express for column value starting with "TEST"

I will try to solve the problem in the assessment of Advance Designer Certification, but tow given data sets are not sufficient. miss values in given column.

https://academy.dataiku.com/advanced-designer-certificate/677454

Given Dataset:

Online Retail Data Set and Online Retail II Data Set .

2. Formulas and Regular Express:

This dataset includes some stock codes that follow a particular naming pattern: namely, one that starts with the word “TEST” and has any number of digits that follow -- for example, "TEST001".
- Using a "Filter rows" processing step with the match mode set to "Regular expression", define a regular expression that will match all stock_code so that you can efficiently remove rows that start with the word "TEST" and have any number of digits that follow.
- Note: This is tricky to confirm whether your result is working. One strategy to confirm the changes is to alter the Design Sample to use one of the Stratified sampling techniques on the original column name: StockCode.
- Note: If you are unable to implement this step, you can still continue the exercise as it will not impact other results downstream.

Sean · August 2021

Hi @Yan
, thanks for reaching out to us!

First, can you confirm that you are using the starter project created by clicking +New Project > DSS Tutorials > Advanced Designer > Advanced Designer Assessment?

The initial datasets from the UCI Machine Learning Repository are the same, but you will require the rest of the Flow.

Second, you should see in the Prepare recipe, there is a "Rename column" step that renames the "StockCode" column to "stock_code". "stock_code" is the column to which you should apply a regular expression according to the instructions.

Screen Shot 2021-08-25 at 9.46.54 AM.png

Please let me know if this helps clarify things!

Yan · August 2021

Hi, Sean

Thanks for quick reply.

I don't have issue for rename. The issue is i can't apply Regx for the column, since i don't find any value starting with "TEST" on the column.

Could you check the data for the column is there any value with "TEST"?

I am looking forward reply so that i can verify my work and continue next .

Thanks

Yan

Sean · August 2021

Hi @Yan
, now I understand your issue. Thanks for the explanation.

I can confirm that there are values that begin with "TEST" in this column. However, these values are likely not included in the sample shown in the Explore tab or Prepare recipe. That's why we have the first note about how this step is a little tricky.

You could confirm you have written the correct regex in a number of ways. One is just checking that the output has the correct number of records. A more satisfying way to really see these rows is to adjust the sample method as suggested and then temporarily apply a filter.

The key learning point here is that what is in your sample is not necessarily representative of what's in your dataset.

I hope that helps!

Yan · August 2021

Hi, Sean

Do you mean that there is no value of "TEST" in the stockcode column in raw data either on_Retail_I or Online_retails_II, after the tricky one on the Scenario, the data populate "TEST" for the column?

I made fix for the scenario error, also add project variable for the project, but before and after i can't find "TEST" values for datasets.

I don't find the "TEST" value on the column in the row data, could you please confirm can you find "TEST" in raw data source?

Thanks

Yan

Sean · August 2021

Hi @Yan
, the entire value isn't "TEST". The values you're looking for in the stock_code column start with "TEST", such as "TEST001".

These values are definitely in the raw data source (Online_Retail_II to be exact). But you don't necessarily need to go looking for them there. They'll also be present in the stacked dataset.

Using a certain regular expression on the stock_code column will help you confirm these rows do in fact exist.

Yan · August 2021

Hi, Sean

I download the raw data again, and explore by Excel and Datiaku, i still don't find any value for StockCode start with "TEST" or "TEST***"

Please see the attached file for the explore for the data. I try to explore in Excel and Dataiku for raw data of online_Retail_II, I could not find "TEST" or "Test001" in the column,stockcode.

1.Explore raw data in Excel

Download and explore raw data in Excel; apply filter for Stockcode in Excel

Screenshot all distinct value which start with character, I can’t find any value start with“TEST**”

2. Explore the raw data in Dataiku

In raw data of Onlinedetail_II download from the URL

I apply a filer as Regular Express for Online_Retaill_II, I don’t find any value

But if I change other for filter value “gift_0001_##” , start with “gift_0001”, I can extract the value

As below

Thanks

Yan

Sean · August 2021

Hi Yan,

For your searches in Dataiku, note how it says 10000 rows and "Viewing dataset sample". So the interactive filter you are using is only searching the present sample. (That's how it's able to return results so quickly). The rows you're looking for are NOT included in the sample, but they are in the dataset.

When you run a recipe, such as the Prepare recipe, the recipe is applied to the entire dataset (not just the sample) and produces a new output dataset.

Also looking at the titles of your dataset (Retail II), it looks like you are not using the starter project, which might create other issues.

I'm less familiar with Excel so it's difficult for me to see what's been done in your screenshot. But I can ensure you a small number of rows begin with the string TEST. The original datasets (as I believe you have seen) are .xlsx files so you should be able to verify the presence of these rows with something like COUNTIF.

To reduce your doubts, here's a screenshot of some of the rows in question found by reading the original .xlsx file into RStudio and filtering for rows that have a StockCode starting with "TEST".

Screen Shot 2021-08-25 at 2.29.30 PM.png

Yan · August 2021

Hi,Sean

I figured out what wrong with the raw dataset, in ONlin_Retail_II, there are tow worksheets contain raw data as Year 2009-2021 and Year 2010-2011; in 2009 data set, there are values starting with "TEST" on stockcode column; in 2010 dataset, there is not "TEST" data, and i worked previously on 2010, that's reason i could not find the proper dataset for solving the assessment problem.

Thanks so much for being always with me for the troubleshooting.

Have a great day!

Yan

Sean · August 2021

Great news! I'm glad you were able to work it out. Good luck on the rest of the assessment!

sdkayb · November 2023

Hello,

i am trying to solve this step but i did not succeed, i have tried multiple ways with formulas but i am not able to solve it, can you please give me a hint or the right documentation part where i can understand what needs to be done more correctly

Thank you in advance

Sean · November 2023

Hi @sdkayb
, you wouldn't need to use a formula here. For this step, consider that you need to filter out rows that have a certain value. Look in the processor library for a step built for exactly that and will let you do so using a regular expression pattern.

faith · December 2023

@Yan
@SeanA

Thanks for your post on this - I'm having the same issue. For me the confusing part was hat when I expanded the sample to 1,100,000 rows (to make sure it reflected everything) I still didn't get any "TEST***" values in the preview, and got a really funky row count because the preview was actually limited to only ~360,000 rows due to the memory limitation field.

After increasing the preview and memory limitation I was able to see the effects of removing 12 rows on the sample.

Sean · December 2023

Hi @faith
, I'm glad you were able to solve it! But it's also important to recognize that there are better ways to solve this kind of problem. This is just a training example of course, but you can imagine having to work with even larger datasets in a real-world situation. You won't always be able to increase the sample size to include everything.

Given data sets are not sufficient at the assessment questions of Advance Designer Certification.

Answers

Categories

Setup Info

Tags