Opened 11 years ago
Closed 8 years ago
#2071 closed enhancement (done)
Refactor TableFileParser
Reported by: | abeham | Owned by: | gkronber |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.14 |
Component: | Problems.Instances | Version: | 3.3.8 |
Keywords: | Cc: |
Description (last modified by gkronber)
Currently, the table file parser uses a heuristic to determine the separator character which works for many cases of CSV files but fails for some files. Recently, we improved the dialog for the CSV import to allow specification of the separator character, so the heuristic is not needed anymore.
Additionally the table file parser is implemented very sloppily and from time to time bugs appear that are hard to track down, additionally the performance of the importer is not ideal.
These issues should be resolved by refactoring or better rewriting the table file parser.
Change History (37)
comment:1 Changed 11 years ago by gkronber
- Milestone changed from HeuristicLab 3.3.9 to HeuristicLab 3.3.x Backlog
comment:2 Changed 11 years ago by gkronber
- Description modified (diff)
- Owner changed from abeham to gkronber
- Status changed from new to assigned
comment:3 Changed 11 years ago by gkronber
comment:4 Changed 11 years ago by abeham
Meh... it's implemented in Microsoft.VisualBasic.dll. There seems to be mono support for this though.
comment:5 Changed 9 years ago by gkronber
- Status changed from assigned to accepted
The table-file parser is slow as hell. I did a quick profile and found the feared list.RemoveAt(0) pattern. 90% runtime spent in RemoveAt()!
comment:6 Changed 9 years ago by gkronber
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14
comment:7 Changed 9 years ago by gkronber
Also the CSV import dialog reads the whole file to display the preview and then reads everything again when the settings are changed (different separator character) and finally again when the user clicks OK to import the data.
comment:8 Changed 9 years ago by gkronber
r13411: minor refactoring of TableFileParser (let's see if the unit tests pass..., today I'm daring...)
comment:9 Changed 9 years ago by gkronber
r13413: only preview first 500 lines of data in CSV import dialog.
comment:10 Changed 9 years ago by gkronber
r13414: added progress reporting when importing regression problem data from csv files.
... and btw. the ProblemInstanceProviders and correpsonding views are a mess (for the data analysis stack)
The progress reporting only works for regression problem instances (not for classification, time-series, clustering) because the code for importing is duplicated in the specific ProblemInstanceProviderViews
comment:11 Changed 9 years ago by gkronber
It's not necessary to clone the variable values (in the constructor of Dataset) when the values have been imported using the TableFileParser
comment:12 Changed 9 years ago by gkronber
r13415: added plugin dependency for HeuristicLab.Common
comment:13 Changed 9 years ago by gkronber
r13419: removed cloning of values in dataset (TODO: review data preprocessing)
comment:14 Changed 9 years ago by gkronber
- improved memory efficiency in TableFileParser by removing duplicate storage of all columns,
- added heuristic to estimate the necessary capacity of columns
comment:15 Changed 9 years ago by gkronber
- added statement to set number of rows parsed which was lost in one of the previous changes
- added an necessary invoke call in RegressionInstanceProviderView
comment:16 follow-up: ↓ 18 Changed 9 years ago by gkronber
Tests: file size: 41MB, 130.000 rows x 66 columns (~5 byte per value), loading takes 6s, Working Set 450MB (HL total) (~55 byte per value) file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)
why so much additional space? I would expect slightly more than 8 byte for each value. Probably, because we need to allocate very large arrays?
comment:17 Changed 9 years ago by gkronber
r13442: better estimation of number of rows and explicit call of GC.Collect
comment:18 in reply to: ↑ 16 ; follow-up: ↓ 22 Changed 9 years ago by gkronber
Replying to gkronber: Measured again
Tests: file size: 41MB, 130.000 rows x 66 columns (~5 byte per value), loading takes 6s, Working Set 450MB (HL total) (~55 byte per value)
loading takes 6s, LOH Size 236MB (~55 byte per value), (Working Set 550MB, HL total) Explanation: dataset contains three string and one DateTime columns
file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)
loading takes 33s, LOH Size 966MB (~9,4 byte per value), (Working Set 1.2GB, HL total)
comment:19 Changed 9 years ago by gkronber
r13445: corrected disposal of StreamReader
comment:20 Changed 9 years ago by gkronber
- Status changed from accepted to reviewing
comment:21 Changed 9 years ago by gkronber
r13447: more refactoring of TableFileParser to remove boxing of double values and to remove production of separator tokens
comment:22 in reply to: ↑ 18 Changed 9 years ago by gkronber
- Owner changed from gkronber to mkommend
Replying to gkronber:
Replying to gkronber: Measured again
Tests: file size: 220MB, 41.000 rows x 2631 columns (~2,1 byte per value), loading takes 34s, Working Set 2.6GB (HL total) (~25 byte per value)
loading takes 33s, LOH Size 966MB (~9,4 byte per value), (Working Set 1.2GB, HL total)
Measured again loading takes 23s, LOH Size 966MB, most effort in string.Split and double.TryParse(), both cannot easily be removed.
comment:23 Changed 9 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
comment:24 Changed 9 years ago by gkronber
r13525: added unit test for type conversion of columns
comment:25 Changed 9 years ago by gkronber
r13526: added code for type conversion of columns to the table file parser and made some other minor changes
comment:26 Changed 9 years ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from assigned to reviewing
comment:27 Changed 9 years ago by gkronber
In r13525 and r13526 two new unit test have been introduced for testing conversion of double columns to string columns. This should also work correctly even if the column contains missing values, NaN or infinity values (for de-DE and invariant culture). Both unit tests pass on my Win7 machine. However, the de-DE unit test fails on our build server.
Reason: Apparently, Microsoft decided that the de-DE string representation of double.NaN is now also "NaN" instead of "n. def." (also cf. http://stackoverflow.com/questions/28159572/localization-of-nan-in-dede-returns-nan-rather-than-n-def-in-windows-8-ser).
Builder runs on Windows Server 2012 so the unit test fails.
Simple test case:
new double[] {1.0, double.NaN, double.PositiveInfinity, double.NegativeInfinity} .Select(d => d.ToString(System.Globalization.CultureInfo.GetCultureInfo("de-DE")))
On Windows Server 2012:
1 NaN +unendlich -unendlich
On Windows 7:
1 n. def. +unendlich -unendlich
comment:28 Changed 9 years ago by gkronber
r13529: fix of unit test fail on builder
comment:29 Changed 9 years ago by mkommend
r13584: Added possibility to specify the encoding when importing files with the table file parser and changed export functionality to use the system's default encoding for creating CSV files.
comment:30 follow-up: ↓ 32 Changed 9 years ago by pfleck
On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.
This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.
comment:31 Changed 8 years ago by mkommend
r13901: Added replacement fallback for exporting undefined characters (e.g., Infinity symbol in ASCII).
comment:32 in reply to: ↑ 30 ; follow-up: ↓ 33 Changed 8 years ago by mkommend
- Owner changed from mkommend to pfleck
- Status changed from reviewing to assigned
Replying to pfleck:
On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.
This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.
Please test if the conversion to the digit eight "8" still occurs.
comment:33 in reply to: ↑ 32 Changed 8 years ago by pfleck
- Owner changed from pfleck to mkommend
- Status changed from assigned to reviewing
Replying to mkommend:
Replying to pfleck:
On systems where infinity is displayed with the infinity symbol "∞", infinity is converted to "8" when exporting a dataset.
This is caused by using the default encoding (ANSI) in the DataAnalysisInstanceProvider.
Please test if the conversion to the digit eight "8" still occurs.
Tested it with the local settings of my machine and now the infinity symbol becomes "*" after exporting. Thus, the column type will be string when imported afterwards.
comment:34 Changed 8 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to readytorelease
comment:35 Changed 8 years ago by gkronber
r13925: TableFileParser: check if it is possible to seek in file before trying to access position (=> reduce number of exceptions)
comment:36 Changed 8 years ago by gkronber
comment:37 Changed 8 years ago by gkronber
- Resolution set to done
- Status changed from readytorelease to closed
Probably we should use TextFieldParser from the framework.