Every time you save a document, commit code, or send a contract for review, one question inevitably follows: "What exactly changed between this version and the last?" Text diffing — commonly called "diff" — is the technical answer to that question. From the Unix diff command to Git commit history to the Track Changes feature in word processors, the same core logic runs underneath them all. This guide takes you from algorithm fundamentals to practical real-world applications.
1. What Is a Diff?
A diff (short for "difference") is a computation that finds the minimal set of changes between two pieces of text. Given an "old version" (A) and a "new version" (B), a diff algorithm identifies:
- Which lines (or characters) were added in B
- Which lines (or characters) were deleted from A
- Which lines (or characters) remain unchanged
The goal is to find the smallest such change set — this is the classic "Minimum Edit Distance" problem and is the optimization target for most diff algorithms.
"Diff" can refer to both the technique itself and its output (a report describing changes). As a verb: "Let me diff these two files." As a noun: "This diff has three changes."
2. Core Algorithms: LCS and Myers Diff
2.1 Longest Common Subsequence (LCS)
Most diff algorithms are founded on the Longest Common Subsequence (LCS) problem. An LCS is the longest sequence that appears in both inputs while preserving relative order.
Example:
- Sequence A:
cat, dog, fish, bird - Sequence B:
cat, fish, rabbit, bird - LCS:
cat, fish, bird(length 3)
Once the LCS is found, elements in A that are not in the LCS are "deletions," and elements in B not in the LCS are "insertions." The LCS algorithm runs in O(mn) time, where m and n are the lengths of the two inputs.
2.2 The Myers Diff Algorithm
In 1986, Eugene Myers published a diff algorithm significantly more efficient than a pure LCS approach — now known as Myers diff. Its key properties:
- Best-case time complexity of O(n + d²), where d is the number of differences
- When differences are few (the most common real-world case), it dramatically outperforms O(mn)
- Tends to produce diffs that "preserve the original structure," making them more readable
Myers diff is the default algorithm in Git, GNU diff, and most mainstream diff tools.
2.3 Patience Diff
Patience diff, designed by Bram Cohen (creator of BitTorrent), is available in Git via --diff-algorithm=patience. Its core idea:
- Identify "unique" lines — those that appear exactly once in each version — as anchor points
- Split the files into segments around those anchors
- Recursively apply LCS to each segment
In practice, Patience diff produces significantly more readable output for code refactors where many functions have been moved around, compared to Myers diff.
| Algorithm | Time Complexity | Best For | Default In |
|---|---|---|---|
| LCS (dynamic programming) | O(mn) | Theoretical foundation, small files | — |
| Myers diff | O(n + d²) | General code comparison | Git, GNU diff |
| Patience diff | O(n log n) | Code refactoring, function moves | Bazaar (optional) |
| Histogram diff | O(n log n) | Improved Myers variant | Git (optional) |
3. The Unified Diff Format
Regardless of which algorithm is used, diff output is most commonly presented in the Unified diff format — the industry standard and the default for git diff.
A typical Unified diff looks like this:
--- a/hello.txt
+++ b/hello.txt
@@ -1,5 +1,5 @@
Line one unchanged
-Old line two
+New line two
Line three unchanged
-Deleted line four
Line five unchanged
+New line six
Format breakdown:
---: The old version (a)+++: The new version (b)@@: Hunk header;-1,5means the old version starts at line 1 and spans 5 lines;+1,5for the new version- Lines starting with a space: Unchanged context lines (3 shown by default)
- Lines starting with
-: Deleted lines - Lines starting with
+: Inserted lines
4. Character-Level vs. Line-Level Comparison
Standard diff works at the line level, but many tools offer finer granularity:
| Mode | Unit | Best For |
|---|---|---|
| Line diff | Entire line | Code, config files, general documents |
| Word diff | Word | Natural language text, Markdown |
| Character diff | Single character | Minor spelling corrections, legal contracts |
Git supports word-level diffing with git diff --word-diff. Online text comparison tools typically highlight both line-level and character-level changes simultaneously, so you can see exactly what was modified at a glance.
5. Practical Use Cases
5.1 Version Control (Git)
Every Git commit implicitly stores a diff. Commands like git diff, git show, and git log -p let you inspect those changes. Pull request review interfaces on GitHub and GitLab are essentially visual diff viewers with collaboration features layered on top.
5.2 Document Revision (Track Changes)
Microsoft Word's "Track Changes" and Google Docs' "Suggesting" mode are both diff in disguise — they record who changed what and when, letting reviewers accept or reject each modification individually.
5.3 Contracts and Legal Documents
Contract revisions require attorneys to verify every word change. Character-level diffing makes subtle phrasing changes like "originally" → "previously" immediately visible, preventing any modification from being overlooked.
5.4 Config Files and Infrastructure as Code (IaC)
Tools like Terraform and Ansible display a "plan diff" before applying changes, letting engineers confirm which cloud resources will be created, modified, or destroyed — avoiding accidental operations.
5.5 Data Comparison
Comparing CSV or JSON exports across time periods using diff quickly surfaces which fields were modified and which records were added or removed — ideal for data auditing and debugging.
6. How to Use an Online Text Diff Tool
Online text diff tools (like the one on this site) require no installation and are perfect for quick plain-text comparisons:
- Paste your text: Put the old version in the left panel and the new version in the right panel
- Choose comparison mode: Line-level or character-level (depending on the tool)
- Review the results: Deleted content is highlighted in red, added content in green, unchanged text stays neutral
- Navigate changes: Use "Previous" / "Next" buttons to jump between each difference
Online tools are especially useful for non-technical users who need document comparison without learning command-line tools.
7. Common Questions
7.1 Can diff compare images or PDFs?
Standard diff is a plain-text tool and cannot directly compare binary files such as images or PDFs. PDFs must first be converted to plain text for text comparison. Image comparison requires specialized visual diff tools that compare pixel values — a fundamentally different concept from text diff.
7.2 Why does the diff output sometimes look unexpected?
The most common causes:
- Whitespace differences: Indentation style (tabs vs. spaces) or line endings (Windows CRLF vs. Unix LF) can generate false positives. Use
-w(ignore all whitespace) or-b(ignore trailing whitespace) to filter them out. - Encoding mismatch: Files using different character encodings (e.g., UTF-8 vs. Latin-1) should be normalized to the same encoding before comparing.
- Algorithm choice: Different algorithms "cut" the same difference differently. Switching algorithms sometimes yields a more intuitive result.
7.3 How are diff and merge related?
Merge is diff taken one step further: when two people independently modify the same original file, a merge tool runs diff on each branch separately, then attempts to combine both change sets. When both modifications touch the same line, the result is a "merge conflict" that requires manual resolution.
8. Summary
Text diffing looks deceptively simple, but it rests on classic algorithm problems. From LCS to Myers diff, from line-level to character-level granularity, different comparison strategies suit different situations. Understanding these fundamentals will help you use Git and other version control tools more effectively — and track every change more precisely in document collaboration, contract review, and everyday editing workflows.