Opened 2 months ago

Last modified 40 hours ago

#2950 accepted feature request

Support hash-based simplification of symbolic expressions

Reported by: bburlacu Owned by: bburlacu
Priority: medium Milestone: HeuristicLab 3.3.16
Component: Problems.DataAnalysis.Symbolic Version: trunk
Keywords: Cc:

Description

Hashing of symbolic expression trees consists of assigning each node with a unique integer hash value, such that arithmetically-equivalent nodes get the same value.

This approach would have the advantage of allowing additional simplification rules, as well as identifying equivalent trees by bottom-up calculation of hash values. For example, a simple tree represented as

Addition
    ├──Multiplication
    │   ├──x
    │   └──Multiplication
    │       ├──y
    │       └──z
    └──Multiplication
        ├──Multiplication
        │   ├──x
        │   └──y
        └──z

cannot be fully simplified by the existing tree simplifier, resulting in:

Addition
    ├──Multiplication
    │   ├──y
    │   ├──z
    │   └──x
    └──Multiplication
        ├──x
        ├──y
        └──z

In this case hash-based simplification detects that the two multiplication terms are identical and is able to further simplify the tree:

Multiplication
    ├──z
    ├──y
    └──x

Additionally, hash values could serve a similar purpose as genetic markers (1), enabling the development of additional diversity-preserving measures and genetic operators.

(1) Burks and Punch, "An analysis of the genetic marker diversity algorithm for genetic programming" https://link.springer.com/content/pdf/10.1007%2Fs10710-016-9281-9.pdf

Change History (20)

comment:1 Changed 2 months ago by bburlacu

  • Status changed from new to accepted

comment:2 Changed 2 months ago by bburlacu

r16218: Initial commit of hashing functionality as well as simplification rules for symbolic expression trees. As this is still in development the public api is not yet established (all methods public for now).

comment:3 Changed 8 weeks ago by bburlacu

r16252: Minor refactor of HashExtensions.cs to allow method chaining. Minor refactor in SymbolicExpressionTreeHash.cs.

comment:4 Changed 8 weeks ago by bburlacu

r16255:

  • Implement first version of hash-based building blocks analyzer.
  • Minor performance improvement in HashExtensions.cs.
  • Fix bug in SymbolicExpressionTreeHash.cs with simplification for Multiplication nodes inadvertently altering constant values.

comment:5 Changed 7 weeks ago by bburlacu

r16258: Simplify code in SymbolicDataAnalysisBuildingBlockAnalyzer and fix build error.

comment:6 Changed 7 weeks ago by bburlacu

r16259: Add storable constructor.

comment:7 Changed 7 weeks ago by bburlacu

r16260: Refactor HashExtensions: simplify Reduce method.

comment:8 Changed 7 weeks ago by bburlacu

r16261: Fix bug in HashUtil.ToByteArray(). Improve hashing performance (10-15% gain) by avoiding array allocations for child node indices.

comment:9 Changed 7 weeks ago by bburlacu

r16263: Refactor hashing to use unsigned long for hashes. Implement new DiversityPreservingCrossover which prevents subtrees with the same hash value from being swapped.

comment:10 Changed 6 weeks ago by bburlacu

r16267: Rename HashNode.IsChild property to IsLeaf

r16270: Fix compilation error in SymbolicDataAnalysisExpressionDiversityPreservingCrossover

r16271: Fix SymbolicDataAnalysisBuildingBlockAnalyzer compilation error.

r16272: Refactor hash extensions and utility methods

  • hashes are now computed from byte[] arrays
  • Simplify() now accepts an argument specifying which hash function to use.
  • Update SymbolicDataAnalysisBuildingBlockAnalyzer and SymbolicDataAnalysisExpressionDiversityPreservingCrossover.

r16273: Improve hashing performance.

Last edited 6 weeks ago by bburlacu (previous) (diff)

comment:11 Changed 6 weeks ago by bburlacu

r16284: Add the ability to compute the structural similarity between symbolic expression trees.

comment:12 Changed 5 weeks ago by gkronber

r16290: adjusted scaling code for SymbolicDataAnalysisModels because with the new hashing code and simplification we cannot assume that scale and offset nodes are at the same locations

comment:13 Changed 5 weeks ago by bburlacu

r16291: Fix typo in ComputeAverageSimilarity

comment:14 Changed 5 weeks ago by bburlacu

r16302: Add support for strict hashing (taking constants and variable weights into account)

comment:15 Changed 4 weeks ago by gkronber

  • Is "strict hashing" still "hashing"?
  • Please, try to limit changes to the trunk. I have the feeling that this feature expands more and more. Larger features should be implemented in a branch and then merged back. The ticket concern was initially "Support hash-based simplification of symbolic expressions", but the more recent changes are concerned with similarities of symbolic expressions. These are in my point of view separate concerns.

comment:16 Changed 4 weeks ago by bburlacu

  • Yes, "strict" is just an extra flag to take the coefficients of leaf nodes into account when we assign their initial hash value. No other changes involved.
  • Agreed, the additional operators (crossover and analyzer) should probably be moved to a branch.

comment:17 Changed 4 weeks ago by gkronber

Ok, since you still call this "hashing" I assume you create a random bitvector for each different real-valued constant

  • How do you determine whether two real-valued constants are equal?
  • How do you detect that e.g. 2*1.0 should have the same hash-value as 2.0?
  • At which point is this "hashing function" quasi equivalent to evaluating the expression for a number of different random inputs?
Last edited 4 weeks ago by gkronber (previous) (diff)

comment:18 Changed 4 weeks ago by bburlacu

  • a double is 8 bytes, using one of the hash functions that takes a byte[] as input will determine that
  • of course, hashing would not return the same hash value (regardless of "strict"), unless the constants are folded in the simplification step.
  • why would evaluation be preferable? it would be much more work and not as reliable. my idea was that we already have scenarios where hashing should not return the same value (because there are some coefficients involved). i thought "strict" could be useful in some of those cases.

comment:19 Changed 4 weeks ago by bburlacu

r16305: Change Simplify inside HashNode to a delegate (instead of an Action) so that the nodes array can be passed as ref. This enables us to resize/alter the nodes array during simplification (eg, by performing term expansion or similar operations)

comment:20 Changed 40 hours ago by bburlacu

r16382: Change signature of ComputeSimilarity methods to accept a generic list of trees. This enables us to directly pass HL ItemAray or ItemList without overhead.

Last edited 40 hours ago by bburlacu (previous) (diff)
Note: See TracTickets for help on using tickets.