Lemmings.world

4,225 readers
36 users here now

General

A general-purpose Lemmy server that anyone can use.

Read the Code of Conduct and follow the rules. There's also the new user's guide.

We have a bot that travels the Fediverse and subscribes to the most popular communities, so that close to all Lemmy content gets synced here.

You can also go chat with others on our Matrix.

We're part of the Fediseer chain of trust:

Fediseer badge showing that we're guaranteed on the Fediseer network

A badge showing the uptime as a percentage

Donations

This instance is funded out of my pocket, if you wish to donate (or just see how much it costs), visit the donations page.

Other

Other Lemmy-related things hosted on Lemmings.world:

founded 2 years ago
ADMINS
1
 
 

cross-posted from: https://chrastecky.dev/post/15

PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.

If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

For example, the following code returns 2 instead of the correct result, 1:

var_dump(levenshtein('göthe', 'gothe'));

There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.

With the new grapheme_levenshtein() function in PHP 8.5, the code above now correctly returns 1.

Grapheme-Based Comparison

What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9) or as a combination of the letter e (U+0065) and a combining accent (U+0301). In PHP, you can write these as:

$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";

Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein() function correctly recognizes this and returns 0 — meaning no difference.

This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.

Just for fun: what do you think the original levenshtein() function will return for the example above?

var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));
2
 
 

cross-posted from: https://chrastecky.dev/post/15

PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.

If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

For example, the following code returns 2 instead of the correct result, 1:

var_dump(levenshtein('göthe', 'gothe'));

There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.

With the new grapheme_levenshtein() function in PHP 8.5, the code above now correctly returns 1.

Grapheme-Based Comparison

What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9) or as a combination of the letter e (U+0065) and a combining accent (U+0301). In PHP, you can write these as:

$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";

Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein() function correctly recognizes this and returns 0 — meaning no difference.

This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.

Just for fun: what do you think the original levenshtein() function will return for the example above?

var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));
3
 
 

PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.

If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

For example, the following code returns 2 instead of the correct result, 1:

var_dump(levenshtein('göthe', 'gothe'));

There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.

With the new grapheme_levenshtein() function in PHP 8.5, the code above now correctly returns 1.

Grapheme-Based Comparison

What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9) or as a combination of the letter e (U+0065) and a combining accent (U+0301). In PHP, you can write these as:

$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";

Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein() function correctly recognizes this and returns 0 — meaning no difference.

This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.

Just for fun: what do you think the original levenshtein() function will return for the example above?

var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));
view more: next ›