Class: Ast::Merge::ContentMatchRefiner

Inherits:
MatchRefinerBase show all
Defined in:
lib/ast/merge/content_match_refiner.rb

Overview

Match refiner for text content-based fuzzy matching.

This refiner uses Levenshtein distance to pair nodes that have similar
but not identical text content. It’s useful for matching nodes where
the content has been slightly modified (typos, rewording, etc.).

Unlike signature-based matching which requires exact content hashes,
this refiner allows fuzzy matching based on text similarity. This is
particularly useful for:

  • Paragraphs with minor edits
  • Headings with slight rewording
  • Comments with updated text
  • Any text-based node type

Examples:

Basic usage

refiner = ContentMatchRefiner.new(threshold: 0.7)
matches = refiner.call(template_nodes, dest_nodes)

With specific node types

# Only match paragraphs and headings
refiner = ContentMatchRefiner.new(
  threshold: 0.6,
  node_types: [:paragraph, :heading]
)

With custom content extractor

refiner = ContentMatchRefiner.new(
  threshold: 0.7,
  content_extractor: ->(node) { node.text_content.downcase.strip }
)

Combined with other refiners

merger = SmartMerger.new(
  template,
  destination,
  match_refiner: [
    ContentMatchRefiner.new(threshold: 0.7, node_types: [:paragraph]),
    TableMatchRefiner.new(threshold: 0.5)
  ]
)

See Also:

Constant Summary collapse

DEFAULT_WEIGHTS =

Default weights for content similarity scoring

{
  content: 0.7,   # Text content similarity (Levenshtein)
  length: 0.15,   # Length similarity
  position: 0.15, # Position similarity in document
}.freeze

Constants inherited from MatchRefinerBase

MatchRefinerBase::DEFAULT_THRESHOLD

Instance Attribute Summary collapse

Attributes inherited from MatchRefinerBase

#node_types, #threshold

Instance Method Summary collapse

Methods inherited from MatchRefinerBase

#handles_type?

Constructor Details

#initialize(threshold: DEFAULT_THRESHOLD, node_types: [], weights: {}, content_extractor: nil, **options) ⇒ ContentMatchRefiner

Initialize a content match refiner.

Parameters:

  • threshold (Float) (defaults to: DEFAULT_THRESHOLD)

    Minimum score to accept a match (default: 0.5)

  • node_types (Array<Symbol>) (defaults to: [])

    Node types to process (empty = all)

  • weights (Hash) (defaults to: {})

    Custom scoring weights

  • content_extractor (Proc, nil) (defaults to: nil)

    Custom function to extract text from nodes
    Should accept a node and return a String

  • options (Hash)

    Additional options for forward compatibility



69
70
71
72
73
74
75
76
77
78
79
# File 'lib/ast/merge/content_match_refiner.rb', line 69

def initialize(
  threshold: DEFAULT_THRESHOLD,
  node_types: [],
  weights: {},
  content_extractor: nil,
  **options
)
  super(threshold: threshold, node_types: node_types, **options)
  @weights = DEFAULT_WEIGHTS.merge(weights)
  @content_extractor = content_extractor
end

Instance Attribute Details

#content_extractorProc? (readonly)

Returns Custom content extraction function.

Returns:

  • (Proc, nil)

    Custom content extraction function



59
60
61
# File 'lib/ast/merge/content_match_refiner.rb', line 59

def content_extractor
  @content_extractor
end

#weightsHash (readonly)

Returns Scoring weights.

Returns:

  • (Hash)

    Scoring weights



56
57
58
# File 'lib/ast/merge/content_match_refiner.rb', line 56

def weights
  @weights
end

Instance Method Details

#call(template_nodes, dest_nodes, context = {}) ⇒ Array<MatchResult>

Find matches between unmatched nodes based on content similarity.

Parameters:

  • template_nodes (Array)

    Unmatched nodes from template

  • dest_nodes (Array)

    Unmatched nodes from destination

  • context (Hash) (defaults to: {})

    Additional context (may contain :template_analysis, :dest_analysis)

Returns:

  • (Array<MatchResult>)

    Array of content-based matches



87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/ast/merge/content_match_refiner.rb', line 87

def call(template_nodes, dest_nodes, context = {})
  template_filtered = filter_nodes(template_nodes)
  dest_filtered = filter_nodes(dest_nodes)

  return [] if template_filtered.empty? || dest_filtered.empty?

  # Build position information for scoring
  total_template = template_filtered.size
  total_dest = dest_filtered.size

  greedy_match(template_filtered, dest_filtered) do |t_node, d_node|
    t_idx = template_filtered.index(t_node) || 0
    d_idx = dest_filtered.index(d_node) || 0

    compute_content_similarity(
      t_node,
      d_node,
      t_idx,
      d_idx,
      total_template,
      total_dest,
    )
  end
end