Opened 4 years ago

Closed 9 months ago

#2071 closed enhancement (done)

Refactor TableFileParser

Reported by: abeham Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.14
Component: Problems.Instances Version: 3.3.8
Keywords: Cc:

Description (last modified by gkronber)

Currently, the table file parser uses a heuristic to determine the separator character which works for many cases of CSV files but fails for some files. Recently, we improved the dialog for the CSV import to allow specification of the separator character, so the heuristic is not needed anymore.

Additionally the table file parser is implemented very sloppily and from time to time bugs appear that are hard to track down, additionally the performance of the importer is not ideal.

These issues should be resolved by refactoring or better rewriting the table file parser.

Change History (37)

comment:1 Changed 4 years ago by gkronber

  • Milestone changed from HeuristicLab 3.3.9 to HeuristicLab 3.3.x Backlog

comment:2 Changed 4 years ago by gkronber

  • Description modified (diff)
  • Owner changed from abeham to gkronber
  • Status changed from new to assigned

comment:3 Changed 4 years ago by gkronber

Probably we should use TextFieldParser from the framework.

comment:4 Changed 4 years ago by abeham

Meh... it's implemented in Microsoft.VisualBasic.dll. There seems to be mono support for this though.

comment:5 Changed 16 months ago by gkronber

  • Status changed from assigned to accepted

The table-file parser is slow as hell. I did a quick profile and found the feared list.RemoveAt(0) pattern. 90% runtime spent in RemoveAt()!

comment:6 Changed 16 months ago by gkronber

  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14

comment:7 Changed 16 months ago by gkronber

Also the CSV import dialog reads the whole file to display the preview and then reads everything again when the settings are changed (different separator character) and finally again when the user clicks OK to import the data.

comment:8 Changed 16 months ago by gkronber

r13411: minor refactoring of TableFileParser (let's see if the unit tests pass..., today I'm daring...)

comment:9 Changed 16 months ago by gkronber

r13413: only preview first 500 lines of data in CSV import dialog.

comment:10 Changed 16 months ago by gkronber

r13414: added progress reporting when importing regression problem data from csv files.

... and btw. the ProblemInstanceProviders and correpsonding views are a mess (for the data analysis stack)

The progress reporting only works for regression problem instances (not for classification, time-series, clustering) because the code for importing is duplicated in the specific ProblemInstanceProviderViews

comment:11 Changed 16 months ago by gkronber

It's not necessary to clone the variable values (in the constructor of Dataset) when the values have been imported using the TableFileParser

comment:12 Changed 16 months ago by gkronber

r13415: added plugin dependency for HeuristicLab.Common

comment:13 Changed 16 months ago by gkronber

r13419: removed cloning of values in dataset (TODO: review data preprocessing)

comment:14 Changed 16 months ago by gkronber

r13440:

  • improved memory efficiency in TableFileParser by removing duplicate storage of all columns,
  • added heuristic to estimate the necessary capacity of columns

comment:15 Changed 16 months ago by gkronber

r13441:

  • added statement to set number of rows parsed which was lost in one of the previous changes
  • added an necessary invoke call in RegressionInstanceProviderView

comment:16 follow-up: Changed 16 months ago by gkronber

Tests: file size: 41MB, 130.000 rows x 66 columns (~5 byte per value), loading takes 6s, Working Set 450MB (HL total) (~55 byte per value) file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)

why so much additional space? I would expect slightly more than 8 byte for each value. Probably, because we need to allocate very large arrays?

comment:17 Changed 16 months ago by gkronber

r13442: better estimation of number of rows and explicit call of GC.Collect

comment:18 in reply to: ↑ 16 ; follow-up: Changed 16 months ago by gkronber

Replying to gkronber: Measured again

Tests: file size: 41MB, 130.000 rows x 66 columns (~5 byte per value), loading takes 6s, Working Set 450MB (HL total) (~55 byte per value)

loading takes 6s, LOH Size 236MB (~55 byte per value), (Working Set 550MB, HL total) Explanation: dataset contains three string and one DateTime columns

file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)

loading takes 33s, LOH Size 966MB (~9,4 byte per value), (Working Set 1.2GB, HL total)

comment:19 Changed 16 months ago by gkronber

r13445: corrected disposal of StreamReader

comment:20 Changed 16 months ago by gkronber

  • Status changed from accepted to reviewing

comment:21 Changed 16 months ago by gkronber

r13447: more refactoring of TableFileParser to remove boxing of double values and to remove production of separator tokens

comment:22 in reply to: ↑ 18 Changed 16 months ago by gkronber

  • Owner changed from gkronber to mkommend

Replying to gkronber:

Replying to gkronber: Measured again

Tests: file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)

loading takes 33s, LOH Size 966MB (~9,4 byte per value), (Working Set 1.2GB, HL total)

Measured again loading takes 23s, LOH Size 966MB, most effort in string.Split and double.TryParse(), both cannot easily be removed.

comment:23 Changed 15 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

comment:24 Changed 15 months ago by gkronber

r13525: added unit test for type conversion of columns

comment:25 Changed 15 months ago by gkronber

r13526: added code for type conversion of columns to the table file parser and made some other minor changes

comment:26 Changed 15 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to reviewing

comment:27 Changed 15 months ago by gkronber

In r13525 and r13526 two new unit test have been introduced for testing conversion of double columns to string columns. This should also work correctly even if the column contains missing values, NaN or infinity values (for de-DE and invariant culture). Both unit tests pass on my Win7 machine. However, the de-DE unit test fails on our build server.

Reason: Apparently, Microsoft decided that the de-DE string representation of double.NaN is now also "NaN" instead of "n. def." (also cf. http://stackoverflow.com/questions/28159572/localization-of-nan-in-dede-returns-nan-rather-than-n-def-in-windows-8-ser).

Builder runs on Windows Server 2012 so the unit test fails.

Simple test case:

new double[] {1.0, double.NaN, double.PositiveInfinity, double.NegativeInfinity}
  .Select(d => d.ToString(System.Globalization.CultureInfo.GetCultureInfo("de-DE")))

On Windows Server 2012:

1
NaN 
+unendlich 
-unendlich 

On Windows 7:

1
n. def. 
+unendlich 
-unendlich 

comment:28 Changed 15 months ago by gkronber

r13529: fix of unit test fail on builder

comment:29 Changed 14 months ago by mkommend

r13584: Added possibility to specify the encoding when importing files with the table file parser and changed export functionality to use the system's default encoding for creating CSV files.

comment:30 follow-up: Changed 14 months ago by pfleck

On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.

This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.

comment:31 Changed 9 months ago by mkommend

r13901: Added replacement fallback for exporting undefined characters (e.g., Infinity symbol in ASCII).

comment:32 in reply to: ↑ 30 ; follow-up: Changed 9 months ago by mkommend

  • Owner changed from mkommend to pfleck
  • Status changed from reviewing to assigned

Replying to pfleck:

On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.

This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.

Please test if the conversion to the digit eight "8" still occurs.

comment:33 in reply to: ↑ 32 Changed 9 months ago by pfleck

  • Owner changed from pfleck to mkommend
  • Status changed from assigned to reviewing

Replying to mkommend:

Replying to pfleck:

On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.

This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.

Please test if the conversion to the digit eight "8" still occurs.

Tested it with the local settings of my machine and now the infinity symbol becomes "*" after exporting. Thus, the column type will be string when imported afterwards.

comment:34 Changed 9 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to readytorelease

comment:35 Changed 9 months ago by gkronber

r13925: TableFileParser: check if it is possible to seek in file before trying to access position (=> reduce number of exceptions)

comment:37 Changed 9 months ago by gkronber

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.