Niko's Project Corner

A good presentation about hyphenation in HTML documents can be seen here, but it is client side (JavaScript) oriented. Basically you shouldn't use justified text unless it is hyphenated, because long words will cause huge spaces between words to make the line stretch out the whole width of the element. I found a few PHP scripts such as phpHyphenator 1.5, but typically they weren't implemented as a single stand-alone PHP class. Since the underlying algorithm is fairly simple, I decided to write it from scratch.

The algorithm itself has been known since 1980s, and it is based on pre-calculated patterns, which indicate how fragments of words could be hyphenated. These are fairly simple to understand and debug, and can most likely correctly hyphenate any new previously unseen words as well. The algorithm scans the inputted word to check which patterns can be found, and this is illustrated in Table 1. Here the word to be hyphenated is ''algorithm'', from which the following patterns can be found: 4l1g4, lgo3, 1go, 2ith and 4h1m. During pattern matching these numbers are ignored, and when matches are found then all existing numbers are injected between corresponding characters. Once this is done, the highest numeric value is defined for each column. If the highest number is odd, it indicates that this is a possible position for hyphenation. Typically in T_eX some additional rules are enforced, such as the minimum allowed length for a syllable. In this example, the hyphenation ''algorith-m'' doesn't make much sense.

Table 1: Example result of hyphenation algorithm on the word ''algorithm''.

—

The open source project phpHyphenator has a very good support for multi-byte characters because of the extensive use of mb_-prefix functions such as mb_substr. On a negative side it was not object oriented, uses $GLOBALS for configuration, somewhat inefficient pattern searching and a bit gimmicky handling of HTML tags. Many of these are most likely due to the usage of mb_substr instead of regular expressions.

In contrast my Hyphenator is a stand-alone class with two methods, the __construct which takes the pattern list and the hyphen symbol as a parameter, and hyphenate which returns the hyphenated string. The constructor uses provided patterns to build a trie data structure, which is a tree-like data structure that enables a very efficient string searching, especially in an incremental case. It also makes pattern searching code more clean, when we aren't forced into calculating odd substring offsets and avoiding indexing beyond the string length.

Text hyphenation begins by splitting the string into chunks using sequences of non-alphabetical characters as delimiters. Then these are iterated through, and a counter is used to keep track of the number of open HTML tags. This ensures that we don't add hyphenations into for example CSS class names. When a hyphenation pattern is observed in the string at some position, ''hyphenation numbers'' are stored into a separate table at corresponding ''gap indexes''. After all maximum numbers are found, the word is iterated character-by-character and hyphenations are inserted into the output when possible.

English hyphenation algorithm in Clojure, 2016 Aug (Matching: Blog, GitHub, Hyphenation)
Mustache templates in Clojure, 2017 Jan (Matching: Blog, GitHub)
Chess video search engine, 2021 Jun (Matching: Data Structures)
Blogging platform — What Would T_eX Do?, 2013 Jul (Matching: Blog, Hyphenation, PHP)
Anonymous and secure information storing and sharing, 2015 Apr (Matching: GitHub, PHP)

Home	(Home page)
About	(About me)
Platform	(About this blog)

LinkedIn	(Niko Nyrhilä)
GitHub	(nikonyrh)
Stackoverflow	(nikonyrh)

Bruteforcing Countdown numbe...	(2023 Apr)
Cheating at Bananagrams with...	(2023 Apr)
Introduction to Stable Diffu...	(2022 Nov)
Matching puzzle pieces together	(2022 Jul)
Single channel speech / musi...	(2022 Feb)

Computer Vision	(13)
GitHub	(12)
Databases	(9)
Elasticsearch	(6)
FFT	(5)
Rendering	(5)
Applied mathematics	(4)

English hyphenation algorithm in PHP

Related blog posts:

Home

Navigation

External

Most recent

Most frequent tags

Most frequent languages

Co-occurrence matrix

	Matl	Pyth	C++	Cloj	Bash	Kera
Comput	6	6	3	1	0	5
GitHub	0	2	1	4	3	0
Databa	0	3	2	2	1	0
Render	3	0	3	0	0	0
Nginx	0	1	0	0	4	0
Autoen	0	3	0	1	0	2
Elasti	0	2	0	3	0	0
FFT	3	1	1	0	0	1
Data S	2	1	2	1	0	1
JVM	0	1	0	3	1	0
Docker	0	1	0	0	3	0
FastCG	0	0	3	0	0	0
Applie	2	2	0	0	0	0
Field	2	0	2	0	0	0
Omnidi	2	0	2	0	0	0
Affine	2	0	2	0	0	0
Master	1	0	2	0	0	0
Archit	0	1	0	0	2	0
Visual	1	0	2	0	0	0
Spark	0	1	0	0	2	0
Blog	0	0	0	2	0	0
Hyphen	0	0	0	2	0	0
Stack	0	1	1	0	0	0
SQL	0	0	1	1	0	0
Busine	0	1	0	1	0	0
Signal	0	1	0	0	0	1
Encryp	0	0	0	0	1	0
Git	0	0	0	1	0	0
Stable	0	1	0	0	0	0
Redis	0	1	0	0	0	0
Thrust	0	0	1	0	0	0
Kibana	0	0	0	1	0	0
Astron	1	0	0	0	0	0
Mustac	0	0	1	0	0	0
NAT	0	0	0	0	1	0
jQuery	0	0	1	0	0	0
SSH	0	0	0	0	1	0
Happyh	0	0	1	0	0	0
Backup	0	0	0	0	1	0
Pthrea	0	0	1	0	0	0
AWS	0	0	0	0	1	0
SIFT	0	0	1	0	0	0
SURF	0	0	1	0	0	0
Conjug	0	0	1	0	0	0
Kalman	0	0	1	0	0	0
Partic	0	0	1	0	0	0
Gradie	0	0	1	0	0	0
Simult	0	0	1	0	0	0
Roboti	0	0	1	0	0	0
Princi	1	0	0	0	0	0
Receiv	1	0	0	0	0	0
Linear	1	0	0	0	0	0
Suppor	1	0	0	0	0	0
Machin	1	0	0	0	0	0
Discre	1	0	0	0	0	0