Niko's Project Corner

English hyphenation algorithm in Clojure

Description Implementing English hyphenation algorithm in Clojure.
Languages Clojure
Tags Hy­phen­ation
Duration Summer 2016
Modified 17th August 2016
GitHub nikonyrh/hyphenator-clj

This is noth­ing that spec­tac­ular (as if any­thing on my blog is), but I still wanted to de­scribe the out­line of the pro­ject of port­ing the hy­phen­ation al­go­rithm from PHP to Clo­jure. The im­ple­men­ta­tion is only about 80 lines of code + com­ments + 20 lines of unit tests. For com­par­ison the orig­inal PHP abom­ina­tion is about is about 160 LoCs, al­though it is a bit bloated by im­ple­ment­ing the pat­terns search via a trie data struc­ture in­stead of us­ing the str­pos func­tion.

This is the ini­tial step to­wards re-writ­ing and port­ing this whole blog plat­form to Clo­jure, which was mo­ti­vated by want­ing to learn new lan­guage and mov­ing away from PHP pro­jects. I man­aged to tol­er­ate it for 10 years and got my ca­reer started but af­ter work­ing with Python 2 and 3 for three years it seems ob­vi­ous that PHP will al­ways lack that ex­pres­sive­ness, well-thought de­sign and vi­sion. The only pos­itive side of PHP is the ease of de­ploy­ment via php-fpm, it took me a while to un­der­stand dif­fer­ences be­tween Fast­GCI and WSGI, and how Ng­inx + Python API com­bi­na­tion had to be set up. Clo­jure runs on the JVM, which is also widely used in large-scale pro­jects such as Elas­tic­search, Neo4j, Apache Hadoop and Spark.

The code starts by read­ing pat­terns of en­glish.txt one-by-one, trans­form­ing lines like "_gen3t4" into {:str "_gent", :dig­its {4 3, 5 4}} where :dig­its is a hash-map map­ping po­si­tions in the string to cor­re­spond­ing in­te­ger val­ues. As with the pre­vi­ous im­ple­men­ta­tion, words are pre- and post­fixed by un­der­scores and these are used in searched pat­terns as well.

match-pat­tern func­tion takes a word and a pat­tern as its ar­gu­ments and finds all in­dexes in which the pat­tern oc­curs in the word. It then ac­cu­mu­lates the max­imum ob­served nu­mer­ical value for each "slot" (see the ar­ti­cle of orig­inal im­ple­men­ta­tion for more de­tails). It is im­ple­mented by tail-re­curs­ing an in­ner anony­mous func­tion un­til the pat­tern is not found from the word any­more, at which point it re­turns the fi­nal value. hy­phen­ate-word func­tion takes the hy­phen and a word, calls match-pat­tern to find all oc­cur­rences of pat­terns, finds in­dexes of odd val­ues and in­jects hy­phens on po­si­tions which don't vi­olate the min­imum syl­la­ble length of two.

To split sen­tences into words a pat­tern-chars set is de­fined, which con­tains all up­per- and low­er­case char­ac­ters which oc­curred in the pat­terns. An other util­ity is the count-chars func­tion which takes two strings as its ar­gu­ments and for each char­ac­ter in the first string it cal­cu­lates the num­ber of its oc­cur­rences in the sec­ond string. This is used to count the cu­mu­la­tive num­ber of < and > char­ac­ters to know if a word is oc­cur­ring in­side <a html tag>or not</a>.

The ul­ti­mate func­tion which brings all this to­gether is the hy­phen­ate. It starts by split­ting the sen­tence into "par­ti­tions" (words and word-sep­ara­tors) by us­ing (par­tial con­tains? pat­tern-chars), checks whether odd or even par­ti­tion in­dexes are the ones which con­tain words to-be hy­phen­ated, cal­cu­lates the cu­mu­la­tive XHTML tag-bal­ance, and merges all these into a should-hy­phen­ate? func­tion. Then the par­ti­tions are ei­ther hy­phen­ated or left as-is and joined back to­gether into a string which is re­turned.

Writ­ing unit tests was cru­cial but also ex­tremely in­ter­est­ing dis­cov­ery pro­cess, as clo­jure.test comes with macros which greatly re­duce rep­eti­tion in test case def­ini­tions. With the help of self-writ­ten my-are and my-deftest writ­ing tests for func­tions is-digit? and elem-max was just two lines of code: (my-deftest test-is-digit is-digit? \textback­slash a false \textback­slash _ false \textback­slash Z false \textback­slash 0 true \textback­slash 5 true \textback­slash 9 true) and (my-deftest test-elem-max elem-max [[-3 1 3] [1 2 -3]] [1 2 3]).

The con­ven­tion of this macro is that 1st ar­gu­ment is the test name, 2nd is the tested func­tion, re­main­ing ar­gu­ments of odd in­dex are in­put val­ues and at even in­dexes are ex­pected out­comes. Other im­por­tant func­tions are tested in a straight-for­ward man­ner as well. I hope this wall of text with­out any fig­ures was at least a bit in­ter­est­ing doc­umen­ta­tion, I'm re­ally look­ing for­ward to use Clo­jure in fu­ture pro­jects.

Related blog posts: