Opened 10 months ago

Last modified 7 months ago

#2661 accepted defect

Remove bugs in TableFileParser

Reported by: mkommend Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Problems.Instances Version: 3.3.14
Keywords: Cc:

Description (last modified by mkommend)

During modeling and importing CSV files in HeuristicLab I discovered three bugs in the current TableFileParser:

  • Converting the data types of columns could change the values (test_csv_importer_conversion). The first missing values are replaced with double.NaN while the last ones with string.Empty.
  • Errors in files after the look a head limit (per default 500 lines) are not reported, the progress dialog hangs and occurring exceptions are swallowed.
    • test_csv_importer_conversion2: column y contains a character after line 500 -> Exception when setting y as the target variable.
    • test_csv_importer_columns: row 500 has one additional value -> Exception while importing

I am unsure whether these bugs were introduced with #2071 or if they have already been present.

Attachments (3)

csv_importer_test_columns.csv (3.5 KB) - added by mkommend 10 months ago.
csv_importer_test_conversion.csv (117 bytes) - added by mkommend 10 months ago.
csv_importer_test_conversion2.csv (3.5 KB) - added by mkommend 10 months ago.

Download all attachments as: .zip

Change History (13)

Changed 10 months ago by mkommend

Changed 10 months ago by mkommend

Changed 10 months ago by mkommend

comment:1 Changed 10 months ago by mkommend

  • Description modified (diff)

comment:2 Changed 9 months ago by gkronber

r14284: added unit tests to reproduce the described bugs

comment:3 Changed 9 months ago by gkronber

r14285:14286:

  • make sure the progressbar is removed when the TableFileParser throws an exception
  • TableFileParser throws IOException instead of DataFormatException
Last edited 9 months ago by gkronber (previous) (diff)

comment:4 Changed 9 months ago by gkronber

r14288: fixed build fail because of deletion of DataFormatException class

comment:5 Changed 9 months ago by gkronber

r14296: implemented fixes for several problems in the TableFileParser.

We now also store the original string representation of all tokens and use those when we detect that a column cannot be read as DateTime / double.

Unfortunately, more memory is required for this solution. However, I found no better way because we cannot restart reading the input stream. Converting the parsed values back to a string representation involves so many edge cases that it would be hard to guarantee that we always produce an output which matches the original input string.

comment:6 Changed 9 months ago by gkronber

  • Status changed from new to accepted

comment:7 Changed 9 months ago by gkronber

r14298: catch and handle all Exceptions because otherwise they are swallowed by the TPL

comment:8 Changed 9 months ago by gkronber

The current implementation is problematic because exceptions are handled directly by showing a message box. However, if the problem is loaded programmatically, e.g. from the 'Generate Experiment'-dialog this completely blocks the UI.

This must be refactored completely.

Last edited 9 months ago by gkronber (previous) (diff)

comment:9 follow-up: Changed 7 months ago by mkommend

During my work with new datasets I discovered an additional bug in the TableFileParser. DateTime values contained in a CSV file can only be imported as string values. r14408 fixes the bug that occurred due to wrong logical junction in an if clause (and instead of or) that was introduced with r14298. A result of that bug was that no DateTime columns containing a date in the first row, which day is the first, or a date in January or any date in 0001, could be parsed correctly.

r14408: Fixed a bug in the TableFileparser that omits the correct detection of DateTime columns.

Last edited 7 months ago by mkommend (previous) (diff)

comment:10 in reply to: ↑ 9 Changed 7 months ago by gkronber

Replying to mkommend:

r14408: Fixed a bug in the TableFileparser that omits the correct detection of DateTime columns.

Reviewed fix. Thanks for correcting this.

Note: See TracTickets for help on using tickets.