How to Compare Text Differences? A Complete Guide to Diff Algorithms and Text Comparison Tools

Every time you save a document, commit code, or send a contract for review, one question inevitably follows: "What exactly changed between this version and the last?" Text diffing — commonly called "diff" — is the technical answer to that question. From the Unix diff command to Git commit history to the Track Changes feature in word processors, the same core logic runs underneath them all. This guide takes you from algorithm fundamentals to practical real-world applications.

1. What Is a Diff?

A diff (short for "difference") is a computation that finds the minimal set of changes between two pieces of text. Given an "old version" (A) and a "new version" (B), a diff algorithm identifies:

  • Which lines (or characters) were added in B
  • Which lines (or characters) were deleted from A
  • Which lines (or characters) remain unchanged

The goal is to find the smallest such change set — this is the classic "Minimum Edit Distance" problem and is the optimization target for most diff algorithms.

Terminology note
"Diff" can refer to both the technique itself and its output (a report describing changes). As a verb: "Let me diff these two files." As a noun: "This diff has three changes."

2. Core Algorithms: LCS and Myers Diff

2.1 Longest Common Subsequence (LCS)

Most diff algorithms are founded on the Longest Common Subsequence (LCS) problem. An LCS is the longest sequence that appears in both inputs while preserving relative order.

Example:

  • Sequence A: cat, dog, fish, bird
  • Sequence B: cat, fish, rabbit, bird
  • LCS: cat, fish, bird (length 3)

Once the LCS is found, elements in A that are not in the LCS are "deletions," and elements in B not in the LCS are "insertions." The LCS algorithm runs in O(mn) time, where m and n are the lengths of the two inputs.

2.2 The Myers Diff Algorithm

In 1986, Eugene Myers published a diff algorithm significantly more efficient than a pure LCS approach — now known as Myers diff. Its key properties:

  • Best-case time complexity of O(n + d²), where d is the number of differences
  • When differences are few (the most common real-world case), it dramatically outperforms O(mn)
  • Tends to produce diffs that "preserve the original structure," making them more readable

Myers diff is the default algorithm in Git, GNU diff, and most mainstream diff tools.

2.3 Patience Diff

Patience diff, designed by Bram Cohen (creator of BitTorrent), is available in Git via --diff-algorithm=patience. Its core idea:

  1. Identify "unique" lines — those that appear exactly once in each version — as anchor points
  2. Split the files into segments around those anchors
  3. Recursively apply LCS to each segment

In practice, Patience diff produces significantly more readable output for code refactors where many functions have been moved around, compared to Myers diff.

AlgorithmTime ComplexityBest ForDefault In
LCS (dynamic programming)O(mn)Theoretical foundation, small files
Myers diffO(n + d²)General code comparisonGit, GNU diff
Patience diffO(n log n)Code refactoring, function movesBazaar (optional)
Histogram diffO(n log n)Improved Myers variantGit (optional)

3. The Unified Diff Format

Regardless of which algorithm is used, diff output is most commonly presented in the Unified diff format — the industry standard and the default for git diff.

A typical Unified diff looks like this:

--- a/hello.txt
+++ b/hello.txt
@@ -1,5 +1,5 @@
 Line one unchanged
-Old line two
+New line two
 Line three unchanged
-Deleted line four
 Line five unchanged
+New line six

Format breakdown:

  • ---: The old version (a)
  • +++: The new version (b)
  • @@: Hunk header; -1,5 means the old version starts at line 1 and spans 5 lines; +1,5 for the new version
  • Lines starting with a space: Unchanged context lines (3 shown by default)
  • Lines starting with -: Deleted lines
  • Lines starting with +: Inserted lines

4. Character-Level vs. Line-Level Comparison

Standard diff works at the line level, but many tools offer finer granularity:

ModeUnitBest For
Line diffEntire lineCode, config files, general documents
Word diffWordNatural language text, Markdown
Character diffSingle characterMinor spelling corrections, legal contracts

Git supports word-level diffing with git diff --word-diff. Online text comparison tools typically highlight both line-level and character-level changes simultaneously, so you can see exactly what was modified at a glance.

5. Practical Use Cases

5.1 Version Control (Git)

Every Git commit implicitly stores a diff. Commands like git diff, git show, and git log -p let you inspect those changes. Pull request review interfaces on GitHub and GitLab are essentially visual diff viewers with collaboration features layered on top.

5.2 Document Revision (Track Changes)

Microsoft Word's "Track Changes" and Google Docs' "Suggesting" mode are both diff in disguise — they record who changed what and when, letting reviewers accept or reject each modification individually.

5.3 Contracts and Legal Documents

Contract revisions require attorneys to verify every word change. Character-level diffing makes subtle phrasing changes like "originally" → "previously" immediately visible, preventing any modification from being overlooked.

5.4 Config Files and Infrastructure as Code (IaC)

Tools like Terraform and Ansible display a "plan diff" before applying changes, letting engineers confirm which cloud resources will be created, modified, or destroyed — avoiding accidental operations.

5.5 Data Comparison

Comparing CSV or JSON exports across time periods using diff quickly surfaces which fields were modified and which records were added or removed — ideal for data auditing and debugging.

6. How to Use an Online Text Diff Tool

Online text diff tools (like the one on this site) require no installation and are perfect for quick plain-text comparisons:

  1. Paste your text: Put the old version in the left panel and the new version in the right panel
  2. Choose comparison mode: Line-level or character-level (depending on the tool)
  3. Review the results: Deleted content is highlighted in red, added content in green, unchanged text stays neutral
  4. Navigate changes: Use "Previous" / "Next" buttons to jump between each difference

Online tools are especially useful for non-technical users who need document comparison without learning command-line tools.

7. Common Questions

7.1 Can diff compare images or PDFs?

Standard diff is a plain-text tool and cannot directly compare binary files such as images or PDFs. PDFs must first be converted to plain text for text comparison. Image comparison requires specialized visual diff tools that compare pixel values — a fundamentally different concept from text diff.

7.2 Why does the diff output sometimes look unexpected?

The most common causes:

  • Whitespace differences: Indentation style (tabs vs. spaces) or line endings (Windows CRLF vs. Unix LF) can generate false positives. Use -w (ignore all whitespace) or -b (ignore trailing whitespace) to filter them out.
  • Encoding mismatch: Files using different character encodings (e.g., UTF-8 vs. Latin-1) should be normalized to the same encoding before comparing.
  • Algorithm choice: Different algorithms "cut" the same difference differently. Switching algorithms sometimes yields a more intuitive result.

7.3 How are diff and merge related?

Merge is diff taken one step further: when two people independently modify the same original file, a merge tool runs diff on each branch separately, then attempts to combine both change sets. When both modifications touch the same line, the result is a "merge conflict" that requires manual resolution.

8. Summary

Text diffing looks deceptively simple, but it rests on classic algorithm problems. From LCS to Myers diff, from line-level to character-level granularity, different comparison strategies suit different situations. Understanding these fundamentals will help you use Git and other version control tools more effectively — and track every change more precisely in document collaboration, contract review, and everyday editing workflows.