Error on loading excel file
I'm facing an issue while processing an excel file. The error log is very difficult to understand.
I hope that someone faced this issue before and help me solve it.
[15:20:26] [DEBUG] [com.monitorjbl.xlsx.impl.StreamingWorkbookReader] - Deleting tmp file [/home/dataiku/dataiku/tmp/tmp-15330357160647418901.xlsx]
[15:20:26] [ERROR] [dku.input.push] - Push failed, cleanup resources
java.lang.NumberFormatException: For input string: "1e6"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.handleEvent(StreamingSheetReader.java:118)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32)
at com.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402)
at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:148)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:177)
at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:234)
at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[15:20:26] [INFO] [dku.output.sql.pglike] - Aborting transaction
[15:20:26] [INFO] [dip.connection.share] - Give connection refCount=1
[15:20:26] [INFO] [dip.connection.share] - > closing connection with failure
[15:20:26] [DEBUG] [dku.connections.sql.provider] - Rollback conn=Dataiku_DB-bLHcHzb
[15:20:26] [DEBUG] [dku.connections.sql.provider] - Close conn=Dataiku_DB-bLHcHzb
[15:20:26] [DEBUG] [dku.resourceusage] - Reporting completion of CRU:{"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"POLICYDATA","jobId":"Build_OP01_New_prepared__NP__2021-12-10T08-17-38.256","activityId":"compute_OP01_New_prepared_NP","activityType":"recipe","recipeType":"shaker","recipeName":"compute_OP01_New_prepared"},"type":"SQL_CONNECTION","id":"sJbVBA5JxXDiClCt","startTime":1639124261558,"sqlConnection":{"connection":"Dataiku_DB"}}
[15:20:26] [INFO] [dku.flow.activity] - Run thread failed for activity compute_OP01_New_prepared_NP
java.lang.NumberFormatException: For input string: "1e6"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.handleEvent(StreamingSheetReader.java:118)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32)
at com.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402)
at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:148)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:177)
at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:234)
at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[15:20:26] [INFO] [dku.flow.activity] running compute_OP01_New_prepared_NP - activity is finished
[15:20:26] [ERROR] [dku.flow.activity] running compute_OP01_New_prepared_NP - Activity failed
java.lang.NumberFormatException: For input string: "1e6"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.handleEvent(StreamingSheetReader.java:118)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32)
at com.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402)
at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:148)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:177)
at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:234)
at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[15:20:26] [INFO] [dku.flow.activity] running compute_OP01_New_prepared_NP - Executing default post-activity lifecycle hook
[15:20:26] [INFO] [dku.flow.activity] running compute_OP01_New_prepared_NP - Removing samples for POLICYDATA.OP01_New_prepared
[15:20:26] [INFO] [dku.flow.activity] running compute_OP01_New_prepared_NP - Done post-activity tasks
Operating system used: Ubuntu
Operating system used: Ubuntu
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,607 Neuron
In looking at your error message I see you are using an MS Excel file.
I noticed the following line.
[15:20:26] [ERROR] [dku.flow.activity] running compute_OP01_New_prepared_NP - Activity failed java.lang.NumberFormatException: For input string: "1e6"
I'm not fully clear about what is going on here and exactly what your use case might be. Are you exporting the file out of Dataiku or importing it into Dataiku. The image seems to suggest Export. So, I'm thinking that there is something in your data that is confusing DSS as it tries to export your data.
- I'm wondering if you could try to export to .CSV rather than .XLS does that work any better.
- It looks like a column may have a scientific notation in it. 1e6 would mean 1000,000. Given your data does that make any sense? At the top of the column with that data what is the storage type? Does the storage type match the type of data you have in the column?
Let us here on the community know how you getting.
--Tom
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
Hi!
I have not experienced difficulties with mixed scientific/nonscientific numberformats when importing sheets Tom (@tgb417
). Scientific notation in excel normally behaves as a numberformat which does not alter the actual cell-content (as stated by Microsoft and as can be seen in the formula-bar when that cell is selected). Occasional datatype-misses (4.25ee9 for example, =text because of the second e) are handled well in DSS, no errors whatsoever.So i think something else is happening here.. @nuvitu9999
could you elaborate a little on what you are trying to accomplish ? Thanx! -
Thank you all for your suggestion. I just import an excel file and tried to read it ( both read and export faced the same issue).
today I will try again and provide more detailed information.
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
When you create a new dataset from an uploaded file you can do a test and a preview. See attached screenshot for what normally should happen ..
-
@Jurre
Test and preview file are ok. The error only happen when I tried to load full data -
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
ok, thank you for your feedback. Could this have something to do with the Recipe engine you are using to process your recipe ? See attached screenshot : below the green run-button, next to "local stream" there is a gearset which you can click for selection of an engine.
As i use the community edition of DSS this is rapidly moving out of my comfort-zone so hopefully someone else with more/broader experience can step in here...
-
@Jurre
I used the Local stream (like your picture).Do you think we have a limitation with big excel file, and Dataiku cannot process it?
I Posted the issue log again
[23:59:31] [INFO] [dku.output.sql.pglike] - Written rows=10000 rowsWithFailedCells=0 ..... [00:01:32] [INFO] [dku.output.sql.pglike] - Written rows=900000 rowsWithFailedCells=0 [00:01:44] [DEBUG] [com.monitorjbl.xlsx.impl.StreamingWorkbookReader] - Deleting tmp file [/home/dataiku/dataiku/tmp/tmp-1692972438535469037.xlsx] [00:01:44] [ERROR] [dku.input.push] - Push failed, cleanup resources java.lang.NumberFormatException: For input string: "1e6"
- The issue always happened after row 900000
- I tried to open the file by excel application, and delete the first 800.000 rows => Save to a new file. After that, I use Dataiku to process the new file (after removed the first 800k rows), it ran well. So I think error 1e6 does not relate to the data issue (data type, data format...)
Hope that someone can help me solve this issue soon
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
Interesting (and a bit worrisome) feedback @nuvitu9999
, i'll try to reproduce it tomorrow as this might be something i could encounter with future projects. Result will be posted here.In the mean time possibly a csv-version of your dataset might pass; excelfiles seem to use quite a bit of memory when loading.. So -in Excel- export the file (not reduced, all of it) to csv and try to import/process that.
-
The issue happened when I tried to load full data.
Example:
1. Create a new dataset and test => it's ok ( I put all columns in String)
2. Try to move data to a new dataset (just an action to read full data) => errors appear
-
Add more logs. missing data for column "Coverage ID"
But I check carefully, row 899989, there is no problem with data (Coverage ID=4), all values are in the same type as other rows. I think it's a big problem when use Dataiku to process big excel file
[2021/12/17-00:01:32.095] [FRT-39-FlowRunnable] [INFO] [dku.output.sql.pglike] - Written rows=900000 rowsWithFailedCells=0
[2021/12/17-00:01:44.315] [FRT-39-FlowRunnable] [DEBUG] [com.monitorjbl.xlsx.impl.StreamingWorkbookReader] - Deleting tmp file [/home/dataiku/dataiku/tmp/tmp-1692972438535469037.xlsx]
[2021/12/17-00:01:44.359] [FRT-39-FlowRunnable] [ERROR] [dku.input.push] - Push failed, cleanup resources
java.lang.NumberFormatException: For input string: "1e6"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.handleEvent(StreamingSheetReader.java:118)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32)
at com.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402)
at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:148)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:177)
at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:234)
at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[2021/12/17-00:01:44.360] [FRT-39-FlowRunnable] [INFO] [dku.output.sql.pglike] - Aborting transaction
[2021/12/17-00:01:44.360] [FRT-39-FlowRunnable] [INFO] [dip.connection.share] - Give connection refCount=1
[2021/12/17-00:01:44.361] [FRT-39-FlowRunnable] [INFO] [dip.connection.share] - > closing connection with failure
[2021/12/17-00:01:44.361] [FRT-39-FlowRunnable] [DEBUG] [dku.connections.sql.provider] - Rollback conn=Dataiku_DB-IuhRyfw
[2021/12/17-00:01:44.364] [Thread-17] [ERROR] [dku.output.sql.pglike] - Copy thread failed
java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.dataiku.dip.datasets.sql.PGCopySQLTableOutput$PGCopySQLTableOutputWriter$1.run(PGCopySQLTableOutput.java:174)
Caused by: org.postgresql.util.PSQLException: ERROR: missing data for column "Coverage ID"
Where: COPY TEST_op01_new_123, line 899989: "899989,190000081916,Bancas,2210011648,Tùy Tầ Thị,BCSAGT,,,,BB Tay Son,,,MB00023324,Phạm Du..."
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2553)
at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1212)
at org.postgresql.core.v3.QueryExecutorImpl.endCopy(QueryExecutorImpl.java:1017)
at org.postgresql.core.v3.CopyInImpl.endCopy(CopyInImpl.java:49)
at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:227)
at org.postgresql.copy.CopyManager.copyIn(CopyManager.java:203)
... 5 more
[2021/12/17-00:01:44.372] [FRT-39-FlowRunnable] [DEBUG] [dku.connections.sql.provider] - Close conn=Dataiku_DB-IuhRyfw
[2021/12/17-00:01:44.374] [FRT-39-FlowRunnable] [DEBUG] [dku.resourceusage] - Reporting completion of CRU:{"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"TEST","jobId":"Build_OP01_New_123__NP__2021-12-16T16-58-59.128","activityId":"compute_OP01_New_123_NP","activityType":"recipe","recipeType":"shaker","recipeName":"compute_OP01_New_123"},"type":"SQL_CONNECTION","id":"dOH012rKr9BfkXzY","startTime":1639673939995,"sqlConnection":{"connection":"Dataiku_DB"}}
[2021/12/17-00:01:44.375] [FRT-39-FlowRunnable] [INFO] [dku.flow.activity] - Run thread failed for activity compute_OP01_New_123_NP
java.lang.NumberFormatException: For input string: "1e6"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.handleEvent(StreamingSheetReader.java:118)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32)
at com.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402)
at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:148)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:177)
at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:234)
at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[2021/12/17-00:01:44.510] [ActivityExecutor-32] [INFO] [dku.flow.activity] running compute_OP01_New_123_NP - activity is finished
[2021/12/17-00:01:44.512] [ActivityExecutor-32] [ERROR] [dku.flow.activity] running compute_OP01_New_123_NP - Activity failed
java.lang.NumberFormatException: For input string: "1e6"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.handleEvent(StreamingSheetReader.java:118)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71)
at com.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32)
at com.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402)
at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:148)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:177)
at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:234)
at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[2021/12/17-00:01:44.513] [ActivityExecutor-32] [INFO] [dku.flow.activity] running compute_OP01_New_123_NP - Executing default post-activity lifecycle hook
[2021/12/17-00:01:44.519] [ActivityExecutor-32] [INFO] [dku.flow.activity] running compute_OP01_New_123_NP - Removing samples for TEST.OP01_New_123
[2021/12/17-00:01:44.528] [ActivityExecutor-32] [INFO] [dku.flow.activity] running compute_OP01_New_123_NP - Done post-activity tasks -
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,607 Neuron
Would you be willing to share a bit more about, how big your MS Excel file is? It looks like you are close to the 1,048,576 rows limit set by MS Excel.
I'm wondering if you could share some further information about How many columns do you have?
What is the overall MS Excel file size are you working with?
As a test of data file size vs an MS Excel file corruption. I see you tried the split file test. Can you try, recombining the two separate files File A and File B not using the original file? Does the re-combined file work?
Let us know how you are getting on. Also, see below about placing a support ticket.
--Tom
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,607 Neuron
If you are using a paid-for edition of DSS you might also put in a support ticket. The support team members at Dataiku are very helpful.
--Tom
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,607 Neuron
@Jurre
,The post you are referencing is quite old from 2016. That would have been likely DSS V2 or V3. We are now on Dataiku DSS V10 almost 1/2 decade later.
I know that I've successfully done 900,000+ row spreadsheets without problem on an 8GB RAM Intel Mac. However, the width of those tables was quite narrow. Just a few columns wide.
I can't think of a time where I've done a single MS Excel Spreadsheet pushing up against the Excel Size Limits on both rows and columns.
--Tom
-
I have more than 50 columns and about 1,035,561 rows. The file size is 400Mb
-
Update: I tried a new method, using a python recipe to load data to DB. It ran well, but missing data (only 999971 rows loaded instead of 1035561)
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
Hi,
Importing&processing big Excelsheets works fine on my side (size : 1.048.575records, 15cols). Earlier work on extremely wide sets (3000+ columns) did not raise issues. Still i would probably check a csv version as splitting up sets beforehand is not very convenient.
And could it be possible that the result of your recipe still holds some strange values which slip out of view because of sample size settings, and cause trouble when you try to load that ?
-
I raised a ticket and provided a sample file to the support team. I will inform the result latter
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
This "NumberFormatException: For input string x" intrigues.. i ran a small test with an uploaded csv to check how different values are recognized, see attached screenshot. Not sure if this is a valid test for that but it seems that a value with format "<number>e<number>" qualifies as numeric even when it's datatype is recognised as Text/string by DSS.
Other possibilities for throwing that exception are when there are spaces in the value and possibly when trying to parse it as an Integer when it actually is a double or something else. Maybe a person with some Java-background could comment on this.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,607 Neuron
That is a substantial MS Excel file.
I'm now starting to wonder about available resources on the computer that is running Dataiku DSS.
I know that there are some processes that I can attempt on my small design node that just won't have enough RAM. And depending on how the swap is set up on your node. You may actually be running out of RAM resources. If this happens your process will die.
If you have administrative access to the node running your instance of Dataiku DSS you might try running a "top" or "htop" command and look at the available RAM resources. (I'm not asking about your local computer necessarily, but the computer that is actually running Dataiku DSS. In an attempt to evaluate this, top command on Ubuntu Linux looks like this. The area highlighted in yellow shows available RAM. If this gets to 0, any process that you are running that wants more memory will crash.
I know that in a few cases on this tiny DSS Design node Ive worked with, I can run out of memory and that will kill my process. This is particularly because linux swap is not being used on this node. Macs for example are less prone to this because they come by default setup to use Disk swap when memory is running low.
-
Thank you for your idea, but I'm sure that I have enough memory to process this file. I have a 64Gb Computer, just to run Dataiku.
-
Data quality is not the problem. I tried to divide the file into 2 parts and it ran well