Lemmings.world

PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.

If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

For example, the following code returns 2 instead of the correct result, 1:
var_dump(levenshtein('göthe', 'gothe'));
There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.

With the new grapheme_levenshtein() function in PHP 8.5, the code above now correctly returns 1.

Grapheme-Based Comparison

What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9) or as a combination of the letter e (U+0065) and a combining accent (U+0301). In PHP, you can write these as:
$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";
Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein() function correctly recognizes this and returns 0 — meaning no difference.

This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.

Just for fun: what do you think the original levenshtein() function will return for the example above?
var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));

New in PHP 8.5: Levenshtein Comparison for UTF-8 Strings (chrastecky.dev)

submitted 3 months ago by rikudou to c/php@programming.dev

0 comments fedilink

cross-posted from: https://chrastecky.dev/post/15

PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.

If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

For example, the following code returns 2 instead of the correct result, 1:
var_dump(levenshtein('göthe', 'gothe'));
There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.

With the new grapheme_levenshtein() function in PHP 8.5, the code above now correctly returns 1.

Grapheme-Based Comparison

What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9) or as a combination of the letter e (U+0065) and a combining accent (U+0301). In PHP, you can write these as:
$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";
Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein() function correctly recognizes this and returns 0 — meaning no difference.

This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.

Just for fun: what do you think the original levenshtein() function will return for the example above?
var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));

New in PHP 8.5: Levenshtein Comparison for UTF-8 Strings (chrastecky.dev)

submitted 3 months ago* (last edited 3 months ago) by dominik@chrastecky.dev to c/programming@chrastecky.dev

0 comments fedilink

PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.

If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

For example, the following code returns 2 instead of the correct result, 1:

var_dump(levenshtein('göthe', 'gothe'));

There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.

With the new grapheme_levenshtein() function in PHP 8.5, the code above now correctly returns 1.

Grapheme-Based Comparison

What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9) or as a combination of the letter e (U+0065) and a combining accent (U+0301). In PHP, you can write these as:

$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";

Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein() function correctly recognizes this and returns 0 — meaning no difference.

This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.

Just for fun: what do you think the original levenshtein() function will return for the example above?

var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));