Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
I am working on the assessment of Advance Designer for the certification,
I don't find any value of column "Stock_code" starting with "Test" in the tow tow given datasets Retail_I and Retail_II. As the request of the problem, i should write a regular express for column value starting with "TEST"
I will try to solve the problem in the assessment of Advance Designer Certification, but tow given data sets are not sufficient. miss values in given column.
https://academy.dataiku.com/advanced-designer-certificate/677454
Given Dataset:
Online Retail Data Set and Online Retail II Data Set .
2. Formulas and Regular Express:
Hi @Yan , thanks for reaching out to us!
First, can you confirm that you are using the starter project created by clicking +New Project > DSS Tutorials > Advanced Designer > Advanced Designer Assessment?
The initial datasets from the UCI Machine Learning Repository are the same, but you will require the rest of the Flow.
Second, you should see in the Prepare recipe, there is a "Rename column" step that renames the "StockCode" column to "stock_code". "stock_code" is the column to which you should apply a regular expression according to the instructions.
Please let me know if this helps clarify things!
Hi, Sean
Thanks for quick reply.
I don't have issue for rename. The issue is i can't apply Regx for the column, since i don't find any value starting with "TEST" on the column.
Could you check the data for the column is there any value with "TEST"?
I am looking forward reply so that i can verify my work and continue next .
Thanks
Yan
Hi @Yan , now I understand your issue. Thanks for the explanation.
I can confirm that there are values that begin with "TEST" in this column. However, these values are likely not included in the sample shown in the Explore tab or Prepare recipe. That's why we have the first note about how this step is a little tricky.
You could confirm you have written the correct regex in a number of ways. One is just checking that the output has the correct number of records. A more satisfying way to really see these rows is to adjust the sample method as suggested and then temporarily apply a filter.
The key learning point here is that what is in your sample is not necessarily representative of what's in your dataset.
I hope that helps!
Hi, Sean
Do you mean that there is no value of "TEST" in the stockcode column in raw data either on_Retail_I or Online_retails_II, after the tricky one on the Scenario, the data populate "TEST" for the column?
I made fix for the scenario error, also add project variable for the project, but before and after i can't find "TEST" values for datasets.
I don't find the "TEST" value on the column in the row data, could you please confirm can you find "TEST" in raw data source?
Thanks
Yan
Hi @Yan , the entire value isn't "TEST". The values you're looking for in the stock_code column start with "TEST", such as "TEST001".
These values are definitely in the raw data source (Online_Retail_II to be exact). But you don't necessarily need to go looking for them there. They'll also be present in the stacked dataset.
Using a certain regular expression on the stock_code column will help you confirm these rows do in fact exist.
Hi, Sean
I download the raw data again, and explore by Excel and Datiaku, i still don't find any value for StockCode start with "TEST" or "TEST***"
Please see the attached file for the explore for the data. I try to explore in Excel and Dataiku for raw data of online_Retail_II, I could not find "TEST" or "Test001" in the column,stockcode.
1.Explore raw data in Excel
Download and explore raw data in Excel; apply filter for Stockcode in Excel
Screenshot all distinct value which start with character, I can’t find any value start with“TEST**”
2. Explore the raw data in Dataiku
In raw data of Onlinedetail_II download from the URL
I apply a filer as Regular Express for Online_Retaill_II, I don’t find any value
But if I change other for filter value “gift_0001_##” , start with “gift_0001”, I can extract the value
As below
Thanks
Yan
Hi Yan,
For your searches in Dataiku, note how it says 10000 rows and "Viewing dataset sample". So the interactive filter you are using is only searching the present sample. (That's how it's able to return results so quickly). The rows you're looking for are NOT included in the sample, but they are in the dataset.
When you run a recipe, such as the Prepare recipe, the recipe is applied to the entire dataset (not just the sample) and produces a new output dataset.
Also looking at the titles of your dataset (Retail II), it looks like you are not using the starter project, which might create other issues.
I'm less familiar with Excel so it's difficult for me to see what's been done in your screenshot. But I can ensure you a small number of rows begin with the string TEST. The original datasets (as I believe you have seen) are .xlsx files so you should be able to verify the presence of these rows with something like COUNTIF.
To reduce your doubts, here's a screenshot of some of the rows in question found by reading the original .xlsx file into RStudio and filtering for rows that have a StockCode starting with "TEST".
Hi,Sean
I figured out what wrong with the raw dataset, in ONlin_Retail_II, there are tow worksheets contain raw data as Year 2009-2021 and Year 2010-2011; in 2009 data set, there are values starting with "TEST" on stockcode column; in 2010 dataset, there is not "TEST" data, and i worked previously on 2010, that's reason i could not find the proper dataset for solving the assessment problem.
Thanks so much for being always with me for the troubleshooting.
Have a great day!
Yan
Great news! I'm glad you were able to work it out. Good luck on the rest of the assessment!