Context Navigation

#2071 closed enhancement (done)

Refactor TableFileParser

Reported by:	abeham	Owned by:	gkronber
Priority:	medium	Milestone:	HeuristicLab 3.3.14
Component:	Problems.Instances	Version:	3.3.8
Keywords:		Cc:

Description (last modified by gkronber)

Currently, the table file parser uses a heuristic to determine the separator character which works for many cases of CSV files but fails for some files. Recently, we improved the dialog for the CSV import to allow specification of the separator character, so the heuristic is not needed anymore.

Additionally the table file parser is implemented very sloppily and from time to time bugs appear that are hard to track down, additionally the performance of the importer is not ideal.

These issues should be resolved by refactoring or better rewriting the table file parser.

Change History (37)

comment:1 Changed 11 years ago by gkronber

Milestone changed from HeuristicLab 3.3.9 to HeuristicLab 3.3.x Backlog

comment:2 Changed 11 years ago by gkronber

Description modified (diff)
Owner changed from abeham to gkronber
Status changed from new to assigned

comment:3 Changed 11 years ago by gkronber

Probably we should use TextFieldParser from the framework.

comment:4 Changed 11 years ago by abeham

Meh... it's implemented in Microsoft.VisualBasic.dll. There seems to be mono support for this though.

comment:5 Changed 8 years ago by gkronber

Status changed from assigned to accepted

The table-file parser is slow as hell. I did a quick profile and found the feared list.RemoveAt(0) pattern. 90% runtime spent in RemoveAt()!

comment:6 Changed 8 years ago by gkronber

Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14

comment:7 Changed 8 years ago by gkronber

Also the CSV import dialog reads the whole file to display the preview and then reads everything again when the settings are changed (different separator character) and finally again when the user clicks OK to import the data.

comment:8 Changed 8 years ago by gkronber

r13411: minor refactoring of TableFileParser (let's see if the unit tests pass..., today I'm daring...)

comment:9 Changed 8 years ago by gkronber

r13413: only preview first 500 lines of data in CSV import dialog.

comment:10 Changed 8 years ago by gkronber

r13414: added progress reporting when importing regression problem data from csv files.

... and btw. the ProblemInstanceProviders and correpsonding views are a mess (for the data analysis stack)

The progress reporting only works for regression problem instances (not for classification, time-series, clustering) because the code for importing is duplicated in the specific ProblemInstanceProviderViews

comment:11 Changed 8 years ago by gkronber

It's not necessary to clone the variable values (in the constructor of Dataset) when the values have been imported using the TableFileParser

comment:12 Changed 8 years ago by gkronber

r13415: added plugin dependency for HeuristicLab.Common

comment:13 Changed 8 years ago by gkronber

r13419: removed cloning of values in dataset (TODO: review data preprocessing)

comment:14 Changed 8 years ago by gkronber

r13440:

improved memory efficiency in TableFileParser by removing duplicate storage of all columns,
added heuristic to estimate the necessary capacity of columns

comment:15 Changed 8 years ago by gkronber

r13441:

added statement to set number of rows parsed which was lost in one of the previous changes
added an necessary invoke call in RegressionInstanceProviderView

comment:16 follow-up: ↓ 18 Changed 8 years ago by gkronber

Tests: file size: 41MB, 130.000 rows x 66 columns (~5 byte per value), loading takes 6s, Working Set 450MB (HL total) (~55 byte per value) file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)

why so much additional space? I would expect slightly more than 8 byte for each value. Probably, because we need to allocate very large arrays?

comment:17 Changed 8 years ago by gkronber

r13442: better estimation of number of rows and explicit call of GC.Collect

comment:18 in reply to: ↑ 16 ; follow-up: ↓ 22 Changed 8 years ago by gkronber

Replying to gkronber: Measured again

Tests: file size: 41MB, 130.000 rows x 66 columns (~5 byte per value), loading takes 6s, Working Set 450MB (HL total) (~55 byte per value)

loading takes 6s, LOH Size 236MB (~55 byte per value), (Working Set 550MB, HL total) Explanation: dataset contains three string and one DateTime columns

file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)

loading takes 33s, LOH Size 966MB (~9,4 byte per value), (Working Set 1.2GB, HL total)

comment:19 Changed 8 years ago by gkronber

r13445: corrected disposal of StreamReader

comment:20 Changed 8 years ago by gkronber

Status changed from accepted to reviewing

comment:21 Changed 8 years ago by gkronber

r13447: more refactoring of TableFileParser to remove boxing of double values and to remove production of separator tokens

comment:22 in reply to: ↑ 18 Changed 8 years ago by gkronber

Owner changed from gkronber to mkommend

Replying to gkronber:

Replying to gkronber: Measured again

Tests: file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)

loading takes 33s, LOH Size 966MB (~9,4 byte per value), (Working Set 1.2GB, HL total)

Measured again loading takes 23s, LOH Size 966MB, most effort in string.Split and double.TryParse(), both cannot easily be removed.

comment:23 Changed 8 years ago by mkommend

Owner changed from mkommend to gkronber
Status changed from reviewing to assigned

comment:24 Changed 8 years ago by gkronber

r13525: added unit test for type conversion of columns

comment:25 Changed 8 years ago by gkronber

r13526: added code for type conversion of columns to the table file parser and made some other minor changes

comment:26 Changed 8 years ago by gkronber

Owner changed from gkronber to mkommend
Status changed from assigned to reviewing

comment:27 Changed 8 years ago by gkronber

In r13525 and r13526 two new unit test have been introduced for testing conversion of double columns to string columns. This should also work correctly even if the column contains missing values, NaN or infinity values (for de-DE and invariant culture). Both unit tests pass on my Win7 machine. However, the de-DE unit test fails on our build server.

Reason: Apparently, Microsoft decided that the de-DE string representation of double.NaN is now also "NaN" instead of "n. def." (also cf. http://stackoverflow.com/questions/28159572/localization-of-nan-in-dede-returns-nan-rather-than-n-def-in-windows-8-ser).

Builder runs on Windows Server 2012 so the unit test fails.

Simple test case:

new double[] {1.0, double.NaN, double.PositiveInfinity, double.NegativeInfinity}
  .Select(d => d.ToString(System.Globalization.CultureInfo.GetCultureInfo("de-DE")))

On Windows Server 2012:

1
NaN 
+unendlich 
-unendlich

On Windows 7:

1
n. def. 
+unendlich 
-unendlich

comment:28 Changed 8 years ago by gkronber

r13529: fix of unit test fail on builder

comment:29 Changed 8 years ago by mkommend

r13584: Added possibility to specify the encoding when importing files with the table file parser and changed export functionality to use the system's default encoding for creating CSV files.

comment:30 follow-up: ↓ 32 Changed 8 years ago by pfleck

On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.

This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.

comment:31 Changed 8 years ago by mkommend

r13901: Added replacement fallback for exporting undefined characters (e.g., Infinity symbol in ASCII).

comment:32 in reply to: ↑ 30 ; follow-up: ↓ 33 Changed 8 years ago by mkommend

Owner changed from mkommend to pfleck
Status changed from reviewing to assigned

Replying to pfleck:

On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.

This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.

Please test if the conversion to the digit eight "8" still occurs.

comment:33 in reply to: ↑ 32 Changed 8 years ago by pfleck

Owner changed from pfleck to mkommend
Status changed from assigned to reviewing

Replying to mkommend:

Replying to pfleck:

On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.

This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.

Please test if the conversion to the digit eight "8" still occurs.

Tested it with the local settings of my machine and now the infinity symbol becomes "*" after exporting. Thus, the column type will be string when imported afterwards.

comment:34 Changed 8 years ago by mkommend

Owner changed from mkommend to gkronber
Status changed from reviewing to readytorelease

comment:35 Changed 8 years ago by gkronber

r13925: TableFileParser: check if it is possible to seek in file before trying to access position (=> reduce number of exceptions)

comment:36 Changed 8 years ago by gkronber

r13974: merged r13411,r13413,r13414,r13415,r13419,r13440,r13441,r13442,r13445,r13447,r13525,r13526,r13529,r13584,r13901,r13925 from trunk to stable

comment:37 Changed 8 years ago by gkronber

Resolution set to done
Status changed from readytorelease to closed

Note: See TracTickets for help on using tickets.

Download in other formats: