Opened 9 months ago

Last modified 5 weeks ago

#2950 reviewing feature request

Support hash-based simplification of symbolic expressions

Reported by: bburlacu Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.16
Component: Problems.DataAnalysis.Symbolic Version: trunk
Keywords: Cc:

Description

Hashing of symbolic expression trees consists of assigning each node with a unique integer hash value, such that arithmetically-equivalent nodes get the same value.

This approach would have the advantage of allowing additional simplification rules, as well as identifying equivalent trees by bottom-up calculation of hash values. For example, a simple tree represented as

Addition
    ├──Multiplication
    │   ├──x
    │   └──Multiplication
    │       ├──y
    │       └──z
    └──Multiplication
        ├──Multiplication
        │   ├──x
        │   └──y
        └──z

cannot be fully simplified by the existing tree simplifier, resulting in:

Addition
    ├──Multiplication
    │   ├──y
    │   ├──z
    │   └──x
    └──Multiplication
        ├──x
        ├──y
        └──z

In this case hash-based simplification detects that the two multiplication terms are identical and is able to further simplify the tree:

Multiplication
    ├──z
    ├──y
    └──x

Additionally, hash values could serve a similar purpose as genetic markers (1), enabling the development of additional diversity-preserving measures and genetic operators.

(1) Burks and Punch, "An analysis of the genetic marker diversity algorithm for genetic programming" https://link.springer.com/content/pdf/10.1007%2Fs10710-016-9281-9.pdf

Change History (29)

comment:1 Changed 9 months ago by bburlacu

  • Status changed from new to accepted

comment:2 Changed 9 months ago by bburlacu

r16218: Initial commit of hashing functionality as well as simplification rules for symbolic expression trees. As this is still in development the public api is not yet established (all methods public for now).

comment:3 Changed 8 months ago by bburlacu

r16252: Minor refactor of HashExtensions.cs to allow method chaining. Minor refactor in SymbolicExpressionTreeHash.cs.

comment:4 Changed 8 months ago by bburlacu

r16255:

  • Implement first version of hash-based building blocks analyzer.
  • Minor performance improvement in HashExtensions.cs.
  • Fix bug in SymbolicExpressionTreeHash.cs with simplification for Multiplication nodes inadvertently altering constant values.

comment:5 Changed 8 months ago by bburlacu

r16258: Simplify code in SymbolicDataAnalysisBuildingBlockAnalyzer and fix build error.

comment:6 Changed 8 months ago by bburlacu

r16259: Add storable constructor.

comment:7 Changed 8 months ago by bburlacu

r16260: Refactor HashExtensions: simplify Reduce method.

comment:8 Changed 8 months ago by bburlacu

r16261: Fix bug in HashUtil.ToByteArray(). Improve hashing performance (10-15% gain) by avoiding array allocations for child node indices.

comment:9 Changed 8 months ago by bburlacu

r16263: Refactor hashing to use unsigned long for hashes. Implement new DiversityPreservingCrossover which prevents subtrees with the same hash value from being swapped.

comment:10 Changed 8 months ago by bburlacu

r16267: Rename HashNode.IsChild property to IsLeaf

r16270: Fix compilation error in SymbolicDataAnalysisExpressionDiversityPreservingCrossover

r16271: Fix SymbolicDataAnalysisBuildingBlockAnalyzer compilation error.

r16272: Refactor hash extensions and utility methods

  • hashes are now computed from byte[] arrays
  • Simplify() now accepts an argument specifying which hash function to use.
  • Update SymbolicDataAnalysisBuildingBlockAnalyzer and SymbolicDataAnalysisExpressionDiversityPreservingCrossover.

r16273: Improve hashing performance.

Last edited 8 months ago by bburlacu (previous) (diff)

comment:11 Changed 8 months ago by bburlacu

r16284: Add the ability to compute the structural similarity between symbolic expression trees.

comment:12 Changed 8 months ago by gkronber

r16290: adjusted scaling code for SymbolicDataAnalysisModels because with the new hashing code and simplification we cannot assume that scale and offset nodes are at the same locations

comment:13 Changed 8 months ago by bburlacu

r16291: Fix typo in ComputeAverageSimilarity

comment:14 Changed 7 months ago by bburlacu

r16302: Add support for strict hashing (taking constants and variable weights into account)

comment:15 Changed 7 months ago by gkronber

  • Is "strict hashing" still "hashing"?
  • Please, try to limit changes to the trunk. I have the feeling that this feature expands more and more. Larger features should be implemented in a branch and then merged back. The ticket concern was initially "Support hash-based simplification of symbolic expressions", but the more recent changes are concerned with similarities of symbolic expressions. These are in my point of view separate concerns.

comment:16 Changed 7 months ago by bburlacu

  • Yes, "strict" is just an extra flag to take the coefficients of leaf nodes into account when we assign their initial hash value. No other changes involved.
  • Agreed, the additional operators (crossover and analyzer) should probably be moved to a branch.

comment:17 Changed 7 months ago by gkronber

Ok, since you still call this "hashing" I assume you create a random bitvector for each different real-valued constant

  • How do you determine whether two real-valued constants are equal?
  • How do you detect that e.g. 2*1.0 should have the same hash-value as 2.0?
  • At which point is this "hashing function" quasi equivalent to evaluating the expression for a number of different random inputs?
Last edited 7 months ago by gkronber (previous) (diff)

comment:18 Changed 7 months ago by bburlacu

  • a double is 8 bytes, using one of the hash functions that takes a byte[] as input will determine that
  • of course, hashing would not return the same hash value (regardless of "strict"), unless the constants are folded in the simplification step.
  • why would evaluation be preferable? it would be much more work and not as reliable. my idea was that we already have scenarios where hashing should not return the same value (because there are some coefficients involved). i thought "strict" could be useful in some of those cases.

comment:19 Changed 7 months ago by bburlacu

r16305: Change Simplify inside HashNode to a delegate (instead of an Action) so that the nodes array can be passed as ref. This enables us to resize/alter the nodes array during simplification (eg, by performing term expansion or similar operations)

comment:20 Changed 6 months ago by bburlacu

r16382: Change signature of ComputeSimilarity methods to accept a generic list of trees. This enables us to directly pass HL ItemAray or ItemList without overhead.

Last edited 6 months ago by bburlacu (previous) (diff)

comment:21 Changed 6 months ago by gkronber

Please try to complete your changes on the trunk until the end of the year, so that we can prepare for the next release.

comment:22 Changed 6 months ago by bburlacu

r16478: Reorganize code in SymbolicExpressionTreeHash.cs.

comment:23 Changed 3 months ago by gkronber

Is this ready for review?

comment:24 Changed 2 months ago by gkronber

Please make the required changes and move to review phase.

comment:25 Changed 5 weeks ago by bburlacu

r16979: Simplify symbol comparison (use only calculated hash value). Run simplification in a loop (successive simplification steps until no more changes).

comment:26 Changed 5 weeks ago by bburlacu

r16980: Remove building block analyzer (does not belong here), minor refactor in DiversityCrossover.

comment:27 Changed 5 weeks ago by bburlacu

r16983: Remove obsolete Comparer for T in HashNode<T>

comment:28 Changed 5 weeks ago by bburlacu

Current status

  • this ticket implements the functionality for expression hashing (trees or symreg sentences) which makes up the foundation for hash-based simplification
  • the SymReg algorithm depends on this functionality but implements its own simplification rules
  • so far we have basic support for simplification of GP trees (more detailed rules should be developed)
  • a DiversityCrossover was added that prevents swapping of subtrees with the same hash value
  • Hash-based diversity and building block analyzer are removed (this topic will be developed in a branch instead)
Last edited 5 weeks ago by bburlacu (previous) (diff)

comment:29 Changed 5 weeks ago by bburlacu

  • Owner changed from bburlacu to gkronber
  • Status changed from accepted to reviewing
Note: See TracTickets for help on using tickets.