background

Niko's Project Corner

English hyphenation algorithm in PHP

Description Implementing English hyphenation algorithm in PHP.
Languages PHP
Tags Hy­phen­ation
Blog
GitHub
Data Struc­tures
Reg­ular Ex­pres­sions
Duration Summer 2013
Modified 10th July 2013
GitHub nikonyrh/hyphenator-php
thumbnail

A good pre­sen­ta­tion about hy­phen­ation in HTML doc­uments can be seen here, but it is client side (JavaScript) ori­ented. Ba­si­cally you shouldn't use jus­ti­fied text un­less it is hy­phen­ated, be­cause long words will cause huge spaces be­tween words to make the line stretch out the whole width of the el­ement. I found a few PHP scripts such as ph­pHy­phen­ator 1.5, but typ­ically they weren't im­ple­mented as a sin­gle stand-alone PHP class. Since the un­der­ly­ing al­go­rithm is fairly sim­ple, I de­cided to write it from scratch.

The al­go­rithm it­self has been known since 1980s, and it is based on pre-cal­cu­lated pat­terns, which in­di­cate how frag­ments of words could be hy­phen­ated. These are fairly sim­ple to un­der­stand and de­bug, and can most likely cor­rectly hy­phen­ate any new pre­vi­ously un­seen words as well. The al­go­rithm scans the in­putted word to check which pat­terns can be found, and this is il­lus­trated in Table 1. Here the word to be hy­phen­ated is ''al­go­rithm'', from which the fol­low­ing pat­terns can be found: 4l1g4, lgo3, 1go, 2ith and 4h1m. During pattern matching these numbers are ignored, and when matches are found then all existing numbers are injected between corresponding characters. Once this is done, the highest numeric value is defined for each column. If the highest number is odd, it indicates that this is a possible position for hyphenation. Typically in TeX some additional rules are enforced, such as the minimum allowed length for a syllable. In this example, the hyphenation ''algorith-m'' doesn't make much sense.

Table 1: Ex­am­ple re­sult of hy­phen­ation al­go­rithm on the word ''al­go­rithm''.
algorithm
4l1g4
lgo3
1go
2ith
4h1m
a4l1g4o3r2i0t4h1m
algorithm

The open source pro­ject ph­pHy­phen­ator has a very good sup­port for multi-byte char­ac­ters be­cause of the ex­ten­sive use of mb_-pre­fix func­tions such as mb_sub­str. On a neg­ative side it was not ob­ject ori­ented, uses $GLOB­ALS for con­fig­ura­tion, some­what in­ef­fi­cient pat­tern search­ing and a bit gim­micky han­dling of HTML tags. Many of these are most likely due to the us­age of mb_sub­str in­stead of reg­ular ex­pres­sions.

In con­trast my Hy­phen­ator is a stand-alone class with two meth­ods, the __con­struct which takes the pat­tern list and the hy­phen sym­bol as a pa­ram­eter, and hy­phen­ate which re­turns the hy­phen­ated string. The con­struc­tor uses pro­vided pat­terns to build a trie data struc­ture, which is a tree-like data struc­ture that en­ables a very ef­fi­cient string search­ing, es­pe­cially in an in­cre­men­tal case. It also makes pat­tern search­ing code more clean, when we aren't forced into cal­cu­lat­ing odd sub­string off­sets and avoid­ing in­dex­ing be­yond the string length.

Text hy­phen­ation be­gins by split­ting the string into chunks us­ing se­quences of non-al­pha­bet­ical char­ac­ters as de­lim­iters. Then these are it­er­ated through, and a coun­ter is used to keep track of the num­ber of open HTML tags. This en­sures that we don't add hy­phen­ations into for ex­am­ple CSS class names. When a hy­phen­ation pat­tern is ob­served in the string at some po­si­tion, ''hy­phen­ation num­bers'' are stored into a sep­arate table at cor­re­spond­ing ''gap in­dexes''. Af­ter all max­imum num­bers are found, the word is it­er­ated char­ac­ter-by-char­ac­ter and hy­phen­ations are in­serted into the out­put when pos­si­ble.


Related blog posts:

CljHyphenation
CljMustache
Agadmator
BlogPlatform
NoKnowledgeNotes