Opened 18 months ago

Closed 9 months ago

#2950 closed feature request (done)

Support hash-based simplification of symbolic expressions

Reported by: bburlacu Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.16
Component: Problems.DataAnalysis.Symbolic Version: trunk
Keywords: merged Cc:

Description

Hashing of symbolic expression trees consists of assigning each node with a unique integer hash value, such that arithmetically-equivalent nodes get the same value.

This approach would have the advantage of allowing additional simplification rules, as well as identifying equivalent trees by bottom-up calculation of hash values. For example, a simple tree represented as

Addition
    ├──Multiplication
    │   ├──x
    │   └──Multiplication
    │       ├──y
    │       └──z
    └──Multiplication
        ├──Multiplication
        │   ├──x
        │   └──y
        └──z

cannot be fully simplified by the existing tree simplifier, resulting in:

Addition
    ├──Multiplication
    │   ├──y
    │   ├──z
    │   └──x
    └──Multiplication
        ├──x
        ├──y
        └──z

In this case hash-based simplification detects that the two multiplication terms are identical and is able to further simplify the tree:

Multiplication
    ├──z
    ├──y
    └──x

Additionally, hash values could serve a similar purpose as genetic markers (1), enabling the development of additional diversity-preserving measures and genetic operators.

(1) Burks and Punch, "An analysis of the genetic marker diversity algorithm for genetic programming" https://link.springer.com/content/pdf/10.1007%2Fs10710-016-9281-9.pdf

Change History (41)

comment:1 Changed 18 months ago by bburlacu

  • Status changed from new to accepted

comment:2 Changed 18 months ago by bburlacu

r16218: Initial commit of hashing functionality as well as simplification rules for symbolic expression trees. As this is still in development the public api is not yet established (all methods public for now).

comment:3 Changed 18 months ago by bburlacu

r16252: Minor refactor of HashExtensions.cs to allow method chaining. Minor refactor in SymbolicExpressionTreeHash.cs.

comment:4 Changed 18 months ago by bburlacu

r16255:

  • Implement first version of hash-based building blocks analyzer.
  • Minor performance improvement in HashExtensions.cs.
  • Fix bug in SymbolicExpressionTreeHash.cs with simplification for Multiplication nodes inadvertently altering constant values.

comment:5 Changed 18 months ago by bburlacu

r16258: Simplify code in SymbolicDataAnalysisBuildingBlockAnalyzer and fix build error.

comment:6 Changed 18 months ago by bburlacu

r16259: Add storable constructor.

comment:7 Changed 18 months ago by bburlacu

r16260: Refactor HashExtensions: simplify Reduce method.

comment:8 Changed 18 months ago by bburlacu

r16261: Fix bug in HashUtil.ToByteArray(). Improve hashing performance (10-15% gain) by avoiding array allocations for child node indices.

comment:9 Changed 17 months ago by bburlacu

r16263: Refactor hashing to use unsigned long for hashes. Implement new DiversityPreservingCrossover which prevents subtrees with the same hash value from being swapped.

comment:10 Changed 17 months ago by bburlacu

r16267: Rename HashNode.IsChild property to IsLeaf

r16270: Fix compilation error in SymbolicDataAnalysisExpressionDiversityPreservingCrossover

r16271: Fix SymbolicDataAnalysisBuildingBlockAnalyzer compilation error.

r16272: Refactor hash extensions and utility methods

  • hashes are now computed from byte[] arrays
  • Simplify() now accepts an argument specifying which hash function to use.
  • Update SymbolicDataAnalysisBuildingBlockAnalyzer and SymbolicDataAnalysisExpressionDiversityPreservingCrossover.

r16273: Improve hashing performance.

Last edited 17 months ago by bburlacu (previous) (diff)

comment:11 Changed 17 months ago by bburlacu

r16284: Add the ability to compute the structural similarity between symbolic expression trees.

comment:12 Changed 17 months ago by gkronber

r16290: adjusted scaling code for SymbolicDataAnalysisModels because with the new hashing code and simplification we cannot assume that scale and offset nodes are at the same locations

comment:13 Changed 17 months ago by bburlacu

r16291: Fix typo in ComputeAverageSimilarity

comment:14 Changed 17 months ago by bburlacu

r16302: Add support for strict hashing (taking constants and variable weights into account)

comment:15 Changed 17 months ago by gkronber

  • Is "strict hashing" still "hashing"?
  • Please, try to limit changes to the trunk. I have the feeling that this feature expands more and more. Larger features should be implemented in a branch and then merged back. The ticket concern was initially "Support hash-based simplification of symbolic expressions", but the more recent changes are concerned with similarities of symbolic expressions. These are in my point of view separate concerns.

comment:16 Changed 17 months ago by bburlacu

  • Yes, "strict" is just an extra flag to take the coefficients of leaf nodes into account when we assign their initial hash value. No other changes involved.
  • Agreed, the additional operators (crossover and analyzer) should probably be moved to a branch.

comment:17 Changed 17 months ago by gkronber

Ok, since you still call this "hashing" I assume you create a random bitvector for each different real-valued constant

  • How do you determine whether two real-valued constants are equal?
  • How do you detect that e.g. 2*1.0 should have the same hash-value as 2.0?
  • At which point is this "hashing function" quasi equivalent to evaluating the expression for a number of different random inputs?
Last edited 17 months ago by gkronber (previous) (diff)

comment:18 Changed 17 months ago by bburlacu

  • a double is 8 bytes, using one of the hash functions that takes a byte[] as input will determine that
  • of course, hashing would not return the same hash value (regardless of "strict"), unless the constants are folded in the simplification step.
  • why would evaluation be preferable? it would be much more work and not as reliable. my idea was that we already have scenarios where hashing should not return the same value (because there are some coefficients involved). i thought "strict" could be useful in some of those cases.

comment:19 Changed 17 months ago by bburlacu

r16305: Change Simplify inside HashNode to a delegate (instead of an Action) so that the nodes array can be passed as ref. This enables us to resize/alter the nodes array during simplification (eg, by performing term expansion or similar operations)

comment:20 Changed 16 months ago by bburlacu

r16382: Change signature of ComputeSimilarity methods to accept a generic list of trees. This enables us to directly pass HL ItemAray or ItemList without overhead.

Last edited 16 months ago by bburlacu (previous) (diff)

comment:21 Changed 16 months ago by gkronber

Please try to complete your changes on the trunk until the end of the year, so that we can prepare for the next release.

comment:22 Changed 15 months ago by bburlacu

r16478: Reorganize code in SymbolicExpressionTreeHash.cs.

comment:23 Changed 12 months ago by gkronber

Is this ready for review?

comment:24 Changed 12 months ago by gkronber

Please make the required changes and move to review phase.

comment:25 Changed 11 months ago by bburlacu

r16979: Simplify symbol comparison (use only calculated hash value). Run simplification in a loop (successive simplification steps until no more changes).

comment:26 Changed 11 months ago by bburlacu

r16980: Remove building block analyzer (does not belong here), minor refactor in DiversityCrossover.

comment:27 Changed 11 months ago by bburlacu

r16983: Remove obsolete Comparer for T in HashNode<T>

comment:28 Changed 11 months ago by bburlacu

Current status

  • this ticket implements the functionality for expression hashing (trees or symreg sentences) which makes up the foundation for hash-based simplification
  • the SymReg algorithm depends on this functionality but implements its own simplification rules
  • so far we have basic support for simplification of GP trees (more detailed rules should be developed)
  • a DiversityCrossover was added that prevents swapping of subtrees with the same hash value
  • Hash-based diversity and building block analyzer are removed (this topic will be developed in a branch instead)
Last edited 11 months ago by bburlacu (previous) (diff)

comment:29 Changed 11 months ago by bburlacu

  • Owner changed from bburlacu to gkronber
  • Status changed from accepted to reviewing

comment:30 Changed 9 months ago by mkommend

This ticket blocks the reintegration of the persistence into stable and should be prioritized.

comment:31 Changed 9 months ago by gkronber

Reviewed changesets connected to this ticket and made some changes in r17076.

comment:32 Changed 9 months ago by gkronber

  • Status changed from reviewing to readytorelease

comment:33 Changed 9 months ago by abeham

r17090: merged revisions 16218, 16252, 16255, 16258, 16259, 16260, 16261, 16263, 16267, 16270, 16271, 16272, 16273, 16284, 16290, 16291, 16302, 16305, 16382, 16478

Last edited 9 months ago by mkommend (previous) (diff)

comment:34 Changed 9 months ago by abeham

  • Keywords depends-2520 added

comment:35 Changed 9 months ago by mkommend

  • Keywords merged added; depends-2520 removed

r17099: Merged 16979, 16980, 16983 into stable.

comment:36 follow-up: Changed 9 months ago by bburlacu

r17132: Small refactor in HashExtensions.cs (better comparison, consistent use of the same hash function). Fix bug and improve robustness of SimplifyMultiplication in SymbolicExpressionHashExtensions.cs

Last edited 9 months ago by abeham (previous) (diff)

comment:37 in reply to: ↑ 36 Changed 9 months ago by abeham

Replying to bburlacu:

r17132: Small refactor in HashExtensions.cs (better comparison, consistent use of the same hash function). Fix bug and improve robustness of SimplifyMultiplication in SymbolicExpressionHashExtensions.cs

This ticket has been merged to stable and is finished, please make further changes to #3015. We have code reviews for all changes. If these bugs are critical please talk to either Gabriel or Michael to review them.

comment:38 Changed 9 months ago by abeham

r17076 has been forgotton to merge. This changeset inolves DiversitySelector.cs which was introduced in #2991 and which has not yet been merged to stable.

comment:39 Changed 9 months ago by abeham

  • Keywords depends-2991 partially-merged added; merged removed

comment:40 Changed 9 months ago by abeham

  • Keywords merged added; depends-2991 partially-merged removed

r17141: merged 17076 to stable

comment:41 Changed 9 months ago by abeham

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.