How much memory can python recipe use?

Options
UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker

I have been experiencing OutOfMemory errors when running a python recipe when it tried to load a json dataset from file system. The files are about 2.5GB on disk and the host has 64G memory and there are plenty available when the error occurred. Does DSS limit the memory available for python recipe? Can I change it?

Here is the error stack:


[15:02:21] [INFO] [dku.utils] - *************** Recipe code failed **************
[15:02:21] [INFO] [dku.utils] - Begin Python stack
[15:02:21] [INFO] [dku.utils] - Traceback (most recent call last):
[15:02:21] [INFO] [dku.utils] - File "/home/dataiku/dss/jobs/WORKFUSIONDP/Build_Generated_Invoices_Flattened_2017-06-29T15-01-44.657/compute_Generated-InvoiceFlattenedFeature_NP/pyrecipenHw7EQsGvQh7/python-exec-wrapper.py", line 3, in <module>
[15:02:21] [INFO] [dku.utils] - execfile(sys.argv[1])
[15:02:21] [INFO] [dku.utils] - File "/home/dataiku/dss/jobs/WORKFUSIONDP/Build_Generated_Invoices_Flattened_2017-06-29T15-01-44.657/compute_Generated-InvoiceFlattenedFeature_NP/pyrecipenHw7EQsGvQh7/script.py", line 49, in <module>
[15:02:21] [INFO] [dku.utils] - vendorNameRawFeatures_df = vendorNameRawFeatures.get_dataframe()
[15:02:21] [INFO] [dku.utils] - File "/home/dataiku/dataiku-dss-4.0.3/python/dataiku/core/dataset.py", line 412, in get_dataframe
[15:02:21] [INFO] [dku.utils] - parse_dates=parse_date_columns)
[15:02:21] [INFO] [dku.utils] - File "/home/dataiku/dataiku-dss-4.0.3/python.packages/pandas/io/parsers.py", line 562, in parser_f
[15:02:21] [INFO] [dku.utils] - return _read(filepath_or_buffer, kwds)
[15:02:21] [INFO] [dku.utils] - File "/home/dataiku/dataiku-dss-4.0.3/python.packages/pandas/io/parsers.py", line 325, in _read
[15:02:21] [INFO] [dku.utils] - return parser.read()
[15:02:21] [INFO] [dku.utils] - File "/home/dataiku/dataiku-dss-4.0.3/python.packages/pandas/io/parsers.py", line 815, in read
[15:02:21] [INFO] [dku.utils] - ret = self._engine.read(nrows)
[15:02:21] [INFO] [dku.utils] - File "/home/dataiku/dataiku-dss-4.0.3/python.packages/pandas/io/parsers.py", line 1314, in read
[15:02:21] [INFO] [dku.utils] - data = self._reader.read(nrows)
[15:02:21] [INFO] [dku.utils] - File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
[15:02:21] [INFO] [dku.utils] - File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
[15:02:21] [INFO] [dku.utils] - File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
[15:02:21] [INFO] [dku.utils] - File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
[15:02:21] [INFO] [dku.utils] - File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
[15:02:21] [INFO] [dku.utils] - CParserError: Error tokenizing data. C error: out of memory
Tagged:

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Options
    DSS does not limit memory available to Python recipes. Don't forget that a large expansion factor from CSV to in-memory Pandas representation is possible, especially if the input files are gzipped.
  • quincybatten
    quincybatten Registered Posts: 1 ✭✭✭
    Options

    In most cases, it might be an issue with:

    • the delimiters in your data.
    • confused by the headers/column of the file.

    The error tokenizing data may arise when you're using separator (for eg. comma ',') as a delimiter and you have more separator than expected (more fields in the error row than defined in the header). So you need to either remove the additional field or remove the extra separator if it's there by mistake. The better solution is to investigate the offending file and to fix it manually so you don't need to skip the error lines.

Setup Info
    Tags
      Help me…