Dhrusham Patel ([email protected])

The objective of tablediff is to compare two tables:
an old table and a new table, where the new table is assumed
to have been derived from the old table by way of row-edits,
insertions and deletions. The result should
therefore identify and distinguish between these types of
modification. To achieve this, the function compares two
tables and returns a pair of aligned tables. In
addition to aligning the matched rows, the aligned tables
contain empty rows corresponding to insertions and
deletions.
The process of matching rows is based on solving the Longest
Common Subsequence (LCS) problem[1] for
rows of the tables.

Here is an example of the function in operation:

old new2  (old tablediff new)
  A  A  A    v  v  v     A  A  A
  B  B  B    w  w  w                v  v  v
  C  C  C    -  B  B                w  w  w
  D  D  D    C  -  C     B  B  B    -  B  B
  E  E  E    -  D  -     C  C  C    C  -  C
  F  F  F    x  x  x     D  D  D    -  D  -
  G  G  G    y  y  y     E  E  E
  H  H  H    H  H  H     F  F  F
  I  I  I    D  I  D     G  G  G
  J  J  J    M  M  M                x  x  x
  K  K  K    z  z  z                y  y  y
  L  L  L                H  H  H    H  H  H
  M  M  M                I  I  I    D  I  D
  N  N  N                J  J  J
  O  O  O                K  K  K
                         L  L  L
                         M  M  M    M  M  M
                         N  N  N
                         O  O  O
                                    z  z  z

As can be seen above: rows align according to where there is
a match, exact or partial. Empty rows in the old and new tables represent row
insertions and row deletions respectively. Note that partial
matches have also been aligned; representing where rows have
been edited. In particular, notice the instances where
partial matches have been aligned despite better matches
being available, as doing so would yields a better alignment
for the tables as a whole.

The cells contain strings, i.e. character vectors.
In this example all the strings have length 1,
merely for convenience in presentation.

The key variation in this function from the standard LCS problem
is that it tolerates partial matches, thereby allowing the function
to trace minor row edits. By calculating degree of match
in the range 0 – 1, we find the highest-scoring common subsequence.

The strategy is to match rows in new to their originals in
old; then expand the two tables to align them.

The function has four steps:

  1. Find row matches: both partial and exact.
  2. Generate and select candidate solutions to evaluate.
  3. Find the best match: i.e. the highest-scoring common subsequence.
  4. Align matching rows by creating a pair of boolean expansion vectors.

Step 1: Tabulate all row matches

      matches←(↓old)∘.≡↓new

Split the tables and use an outer-product match to find
which rows in new match which rows in
old. But we want to honour partial matches,
where a row has been edited.

      matches←(↓old)∘.(≡¨)↓new

Now each cell of the result is a 3-element boolean. Sum each
cell and divide by the number of columns to get a score for
each match in the range 0 – 1.

      rnew cnew←⍴new
      ms←(+/¨(↓old)∘.(≡¨)↓new)÷cnew ⍝ match scores

But this can be written more simply[2]:

      ms←(old+.≡⍉new)÷cnew ⍝ match scores

The resulting table of row match scores ms represents the
likeness of each row from the old table compared with each
row of the new table.

Step 2: Generate and select candidate solutions to evaluate

Figure 1: Two solution paths through the match scores table.

The above two figures show two possible match paths.
Origin 0, the first matches rows 2 3 4 7 8 9 of
new to rows 1 2 3 7 8 12 of old.
The second matches rows 2 3 7 8 of new to 1 2 3
12 of old.

Because rows are not moved (only inserted, deleted or
edited) a match path must specify progressively rising
indexes of old and new.

The challenge is to identify all possible match paths and
find the highest-scoring.

We start by tabulating all possible selections of rows of
new. In the figures above, the selections from
new correspond to columns with matches: in both
cases the selection is 0 0 1 1 1 0 0 1 1 1 0.

The expression (rnew/2)⊤⍳2*rnew gives all
possible selections from new:

      disp←{'.⎕'[⍵]}
      disp 60↑[1] (rnew/2)⊤⍳2*rnew ⍝ first 60 of 2048 cols
............................................................
............................................................
............................................................
............................................................
............................................................
...............................⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
...............⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕................⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
.......⎕⎕⎕⎕⎕⎕⎕⎕........⎕⎕⎕⎕⎕⎕⎕⎕........⎕⎕⎕⎕⎕⎕⎕⎕........⎕⎕⎕⎕⎕
...⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕
.⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕.
⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.

We can eliminate selections that select rows of
new that have no matches at all.

      disp {⍵/⍨∧⌿~(~∨⌿×ms)⌿⍵} (rnew/2)⊤⍳2*rnew
................................................................
................................................................
................................⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
................⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕................⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
........⎕⎕⎕⎕⎕⎕⎕⎕........⎕⎕⎕⎕⎕⎕⎕⎕........⎕⎕⎕⎕⎕⎕⎕⎕........⎕⎕⎕⎕⎕⎕⎕⎕
................................................................
................................................................
....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕....⎕⎕⎕⎕
..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕..⎕⎕
.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕.⎕
................................................................

That reduces the number of selections to test from
2048 to 64.

The more matches the better. So we’ll try the most promising
selections first.

      disp sstt←{⍵[;⍒+⌿⍵]} {⍵/⍨∧⌿~(~∨⌿×ms)⌿⍵} (rnew/2)⊤⍳2*rnew
................................................................
................................................................
⎕.⎕⎕⎕⎕⎕.....⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕..........⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕..........⎕⎕⎕⎕⎕.....⎕.
⎕⎕.⎕⎕⎕⎕.⎕⎕⎕⎕....⎕⎕⎕⎕⎕⎕....⎕⎕⎕⎕⎕⎕......⎕⎕⎕⎕......⎕⎕⎕⎕....⎕....⎕..
⎕⎕⎕.⎕⎕⎕⎕.⎕⎕⎕.⎕⎕⎕...⎕⎕⎕.⎕⎕⎕...⎕⎕⎕...⎕⎕⎕...⎕...⎕⎕⎕...⎕...⎕....⎕...
................................................................
................................................................
⎕⎕⎕⎕.⎕⎕⎕⎕.⎕⎕⎕.⎕⎕.⎕⎕..⎕⎕.⎕⎕.⎕⎕..⎕.⎕⎕..⎕..⎕..⎕⎕..⎕..⎕...⎕....⎕....
⎕⎕⎕⎕⎕.⎕⎕⎕⎕.⎕⎕⎕.⎕⎕.⎕.⎕.⎕⎕.⎕⎕.⎕.⎕.⎕.⎕.⎕..⎕..⎕.⎕.⎕..⎕...⎕....⎕.....
⎕⎕⎕⎕⎕⎕.⎕⎕⎕⎕.⎕⎕⎕.⎕⎕.⎕..⎕⎕⎕.⎕⎕.⎕..⎕⎕.⎕..⎕...⎕⎕.⎕..⎕...⎕....⎕......
................................................................

From 2048 possible selections of rows of new,
we’ve found 64 to evaluate, and sorted the most promising to
the left.

Step 3: Find best solution

A candidate solution is a selection of rows from
new, all of which match rows from
old.
Because table rows have not been reordered, the solution
must pick out successive rows of old.

Each selection of new will produce zero or more
match paths.
The criterion that row matches are monotonically rising
(i.e. terms of solution must be consecutive, but not
necessarily contiguous, terms of the original sequences)
means that it is possible for there to be zero match paths.
The longest match path corresponds to the longest common
subsequence.

Scoring the matches introduces an additional variable for
consideration: we now want the highest-scoring common
subsequence.
Note that if there are partial row matches the longest
common subsequence is not necessarily the highest-scoring
common subsequence.

Each selection of rows of new gives a corresponding
selection of the match-scores table, in which to look for
match paths. The soln function returns the highest-scoring
match path, and its score.

soln←{
    ⍝ best solution (origin-0) and score for match scores ⍵ (matrix)
     ⎕IO←0
     where←{⍵/⍳⍴⍵}
     cmbn←{↑,⊃∘.,/⍵,⊂⊂⍬}                   ⍝ combine lists
     rr←{∧/↑>/1 ¯1↓[1]¨⊂⍵}                 ⍝ rising rows
     mrr←{⍵⌿⍨(rr ⍵)∧∧/⍵=⌈\⍵}               ⍝ monotonically rising rows
     rows←mrr cmbn where¨↓[0]×⍵            ⍝ solns are sequences of rows
     0∊⍴rows:⍬ 0                           ⍝ no solution, zero score
     nc←⊃⌽⍴⍵                               ⍝ count cols in ⍵
     scores←+/(,⍵)[(rows×nc)+[1]⍳nc]       ⍝ score by soln
     (scores⍳⌈/scores)∘⊃¨(↓rows)(scores)
 }

To find the highest-scoring match path we apply
soln to the selections in turn, starting with
the most promising.
But we don’t need to evaluate every selection. We can stop
when the next selection could not produce a higher-scoring
match path. We don’t need to apply soln to see that. If the
next selection has only four flags and the highest score so
far is 4.66, then we need look no further, because the
maximum score for a selection with four flags is 4.

i←0 ⋄ end←↑⌽⍴sstt ⋄ (sc mo mn)←0 ⍬ ⍬      ⍝ score maskold masknew
:while sc<+/sstt[;i]                      ⍝ scope to improve?
    ∆←(soln ∆/ms),⊂∆←sstt[;i]             ⍝ evaluate next
    (sc mo mn)←(sc<↑∆) ⌽ (sc mo mn) ∆
:until end=i←i+1

After the loop, mo and mn contain
a pair of boolean vectors which represent the positions of
matching row numbers for each of the old and new tables.

      mo mn    ⍝  match booleans
┌→────────────────────────────────────────────────────────┐
│ ┌→────────────────────────────┐ ┌→────────────────────┐ │
│ │0 1 1 1 0 0 0 1 1 0 0 0 1 0 0│ │0 0 1 1 1 0 0 1 1 1 0│ │
│ └~────────────────────────────┘ └~────────────────────┘ │
└∊────────────────────────────────────────────────────────┘

Step 4: Align matching rows

Once the best solution is found, our final step is to create
two boolean expansion vectors that will align the two tables.

Three ways to do this:

Looping function
This function recognises patterns in the
two booleans mo and mn and
assembles (column-wise) a 2-row table representing a
pair of expansion vectors.

Z←2 0⍴⍬
:Repeat
  :Select ⊃¨mo mn
     :Case 0 0 ⋄ x←1 0   ⍝ Row deleted
     :Case 0 1 ⋄ x←1 0   ⍝ Row deleted
     :Case 1 0 ⋄ x←0 1   ⍝ Row inserted
     :Case 1 1 ⋄ x←1 1   ⍝ match
  :EndSelect
 Z,←x
 (mo mn)↓⍨¨←x
:Until ∨/⊃¨0=⍴¨ mo mn
Z,←↑~ mo mn
expansion←↓Z

The select structure above
represents a mapping that can be expressed as a
short D function:

x←{0 0≡⍵:1 0 ⋄ ⌽⍵}⊃¨ mo mn
Recursive function
The loop can be rewritten concisely as a D function with tail recursion.

stephen←{
         0∊≢¨⍵:↑~⍵
         x←(0 1)(1 1)(1 0)⊃⍨(1 0)(1 1)⍳⊂⊃¨⍵  ⍝ inserted, preserved, deleted
         (⍪x),∇ x↓¨⍵
     }
expansion←↓stephen mo mn
Morten’s non-looping function
morten←{
     rn← ≢ ¨⍵
     ord←⍋(2×∊+\¨⍵)-(∊⍵)
     O m←(⊂⊂ord)⌷¨(rn/0 1)(∊⍵)
     {((~m∧O≠⍵)/O=⍵)}¨0 1
 }
expansion←morten mo mn

This function takes the booleans as its right
argument. It “orders the enlisted booleans so
matching rows are adjacent.” Morten has discussed
his function in more detail over on a blog post
[3].

Finally, the resulting expansion vectors are used to align the tables.

      ,¨/expansion (expansion⍀¨old new)
  1  A  A  A   0
  0            1  v  v  v
  0            1  w  w  w
  1  B  B  B   1  -  B  B
  1  C  C  C   1  C  -  C
  1  D  D  D   1  -  D  -
  1  E  E  E   0
  1  F  F  F   0
  1  G  G  G   0
  0            1  x  x  x
  0            1  y  y  y
  1  H  H  H   1  H  H  H
  1  I  I  I   1  D  I  D
  1  J  J  J   0
  1  K  K  K   0
  1  L  L  L   0
  1  M  M  M   1  M  M  M
  1  N  N  N   0
  1  O  O  O   0
  0            1  z  z  z

Listing

Putting the four steps together:

∇ Z←old tablediff new;⎕IO;rnew;cnew;ms;aps;wps;sstt;i;end;sc;mo;mn;∆
 ⎕IO←0
 rnew cnew←⍴new

 ⍝ 1. Tabulate match scores
 ms←cnew÷⍨old+.≡⍉new

 ⍝ 2. Generate subsequences to test
 aps←{(⍵/2)⊤⍳2*⍵}                                ⍝ all possible selections
 wps←{⍵/⍨∧⌿~(~∨⌿×ms)⌿⍵}                          ⍝ with possible solutions
 sstt←{⍵[;⍒+⌿⍵]} wps aps rnew                    ⍝ subsequences to test

 ⍝ 3. Select subsequence with highest score
 i←0 ⋄ end←↑⌽⍴sstt ⋄ (sc mo mn)←0 ⍬ ⍬            ⍝ score, maskold, masknew
 :while sc<+/sstt[;i]                            ⍝ scope to improve?
    ∆←(soln ∆/ms),⊂∆←sstt[;i]                    ⍝ evaluate next
    (sc mo mn)←(sc<↑∆) ⌽ (sc mo mn) ∆
 :until end=i←i+1

 ⍝ 4. Align the tables vertically
 Z←old new⍀⍨¨morten mo mn
∇

Scope for improvement

  1. The problem is symmetrical. That is:
    old tablediff new ←→ ⌽ new tablediff old
    

    The longest common subsequence cannot be longer than
    ⌊/⊃¨⍴¨old new. So only the shorter table should
    be searched for subsequences. In this example new
    is the shorter table.
    tablediff can be improved by switching
    old and new if the latter
    is longer then correspondingly reversing the
    2-element result.

  2. cmbn←{↑,⊃∘.,/⍵,⊂⊂⍬} is space hungry and
    inefficient for large tables. Instead, Morten recommends
    a depth-first search.[4].
  3. The match scores might be calculated faster if the
    strings were first hashed to integers. Or, if there are
    many repeated elements, to indexes to a list of the
    unique elements.

Acknowledgements

I am indebted to Morten Kromberg of Dyalog, my colleague Stephen Taylor,
Mike Thomas of Bedarra and Arthur Whitney of Kx
for help with this work.
Any errors are of course mine.

References


  1. Longest Common Subsequence
    en.wikipedia.org/wiki/Longest_common_subsequence_problem

  2. I have Morten Kromberg to thank for spotting this equivalence.
    Watch out: this inner product returns wrong results
    in both APL+Win 12.0 and APLX 4.1.6.
    In these interpreters use the longer outer-product expression,
    which returns a correct result.

  3. Morten’s blog post:
    dyalog.com/blog/2014/07/aligning-diff-output-2/

  4. John Scholes on Depth-first searching in D
    youtube.com/watch?v=DsZdfnlh_d0