Niko's Project Corner

This is nothing that spectacular (as if anything on my blog is), but I still wanted to describe the outline of the project of porting the hyphenation algorithm from PHP to Clojure. The implementation is only about 80 lines of code + comments + 20 lines of unit tests. For comparison the original PHP abomination is about is about 160 LoCs, although it is a bit bloated by implementing the patterns search via a trie data structure instead of using the strpos function.

This is the initial step towards re-writing and porting this whole blog platform to Clojure, which was motivated by wanting to learn new language and moving away from PHP projects. I managed to tolerate it for 10 years and got my career started but after working with Python 2 and 3 for three years it seems obvious that PHP will always lack that expressiveness, well-thought design and vision. The only positive side of PHP is the ease of deployment via php-fpm, it took me a while to understand differences between FastGCI and WSGI, and how Nginx + Python API combination had to be set up. Clojure runs on the JVM, which is also widely used in large-scale projects such as Elasticsearch, Neo4j, Apache Hadoop and Spark.

The code starts by reading patterns of english.txt one-by-one, transforming lines like "_gen3t4" into {:str "_gent", :digits {4 3, 5 4}} where :digits is a hash-map mapping positions in the string to corresponding integer values. As with the previous implementation, words are pre- and postfixed by underscores and these are used in searched patterns as well.

match-pattern function takes a word and a pattern as its arguments and finds all indexes in which the pattern occurs in the word. It then accumulates the maximum observed numerical value for each "slot" (see the article of original implementation for more details). It is implemented by tail-recursing an inner anonymous function until the pattern is not found from the word anymore, at which point it returns the final value. hyphenate-word function takes the hyphen and a word, calls match-pattern to find all occurrences of patterns, finds indexes of odd values and injects hyphens on positions which don't violate the minimum syllable length of two.

To split sentences into words a pattern-chars set is defined, which contains all upper- and lowercase characters which occurred in the patterns. An other utility is the count-chars function which takes two strings as its arguments and for each character in the first string it calculates the number of its occurrences in the second string. This is used to count the cumulative number of < and > characters to know if a word is occurring inside <a html tag>or not</a>.

The ultimate function which brings all this together is the hyphenate. It starts by splitting the sentence into "partitions" (words and word-separators) by using (partial contains? pattern-chars), checks whether odd or even partition indexes are the ones which contain words to-be hyphenated, calculates the cumulative XHTML tag-balance, and merges all these into a should-hyphenate? function. Then the partitions are either hyphenated or left as-is and joined back together into a string which is returned.

Writing unit tests was crucial but also extremely interesting discovery process, as clojure.test comes with macros which greatly reduce repetition in test case definitions. With the help of self-written my-are and my-deftest writing tests for functions is-digit? and elem-max was just two lines of code: (my-deftest test-is-digit is-digit? \textbackslash a false \textbackslash _ false \textbackslash Z false \textbackslash 0 true \textbackslash 5 true \textbackslash 9 true) and (my-deftest test-elem-max elem-max [[-3 1 3] [1 2 -3]] [1 2 3]).

The convention of this macro is that 1st argument is the test name, 2nd is the tested function, remaining arguments of odd index are input values and at even indexes are expected outcomes. Other important functions are tested in a straight-forward manner as well. I hope this wall of text without any figures was at least a bit interesting documentation, I'm really looking forward to use Clojure in future projects.

Mustache templates in Clojure, 2017 Jan (Matching: Blog, Clojure, GitHub, JVM)
Analyzing NYC Taxi dataset with Elasticsearch and Kibana, 2017 Mar (Matching: Clojure, GitHub, JVM)
English hyphenation algorithm in PHP, 2013 Jul (Matching: Blog, GitHub, Hyphenation)
Scalable analytics with Docker, Spark and Python, 2015 Dec (Matching: GitHub, JVM)
Benchmarking Elasticsearch and MS SQL on NYC Taxis, 2017 May (Matching: Clojure, GitHub)

Home	(Home page)
About	(About me)
Platform	(About this blog)

LinkedIn	(Niko Nyrhilä)
GitHub	(nikonyrh)
Stackoverflow	(nikonyrh)

Bruteforcing Countdown numbe...	(2023 Apr)
Cheating at Bananagrams with...	(2023 Apr)
Introduction to Stable Diffu...	(2022 Nov)
Matching puzzle pieces together	(2022 Jul)
Single channel speech / musi...	(2022 Feb)

Computer Vision	(13)
GitHub	(12)
Databases	(9)
Elasticsearch	(6)
FFT	(5)
Rendering	(5)
Applied mathematics	(4)

English hyphenation algorithm in Clojure

Related blog posts:

Home

Navigation

External

Most recent

Most frequent tags

Most frequent languages

Co-occurrence matrix

	Matl	Pyth	C++	Cloj	Bash	Kera
Comput	6	6	3	1	0	5
GitHub	0	2	1	4	3	0
Databa	0	3	2	2	1	0
Render	3	0	3	0	0	0
Nginx	0	1	0	0	4	0
Autoen	0	3	0	1	0	2
Elasti	0	2	0	3	0	0
FFT	3	1	1	0	0	1
Data S	2	1	2	1	0	1
JVM	0	1	0	3	1	0
Docker	0	1	0	0	3	0
FastCG	0	0	3	0	0	0
Applie	2	2	0	0	0	0
Field	2	0	2	0	0	0
Omnidi	2	0	2	0	0	0
Affine	2	0	2	0	0	0
Master	1	0	2	0	0	0
Archit	0	1	0	0	2	0
Visual	1	0	2	0	0	0
Spark	0	1	0	0	2	0
Blog	0	0	0	2	0	0
Hyphen	0	0	0	2	0	0
Stack	0	1	1	0	0	0
SQL	0	0	1	1	0	0
Busine	0	1	0	1	0	0
Signal	0	1	0	0	0	1
Encryp	0	0	0	0	1	0
Git	0	0	0	1	0	0
Stable	0	1	0	0	0	0
Redis	0	1	0	0	0	0
Thrust	0	0	1	0	0	0
Kibana	0	0	0	1	0	0
Astron	1	0	0	0	0	0
Mustac	0	0	1	0	0	0
NAT	0	0	0	0	1	0
jQuery	0	0	1	0	0	0
SSH	0	0	0	0	1	0
Happyh	0	0	1	0	0	0
Backup	0	0	0	0	1	0
Pthrea	0	0	1	0	0	0
AWS	0	0	0	0	1	0
SIFT	0	0	1	0	0	0
SURF	0	0	1	0	0	0
Conjug	0	0	1	0	0	0
Kalman	0	0	1	0	0	0
Partic	0	0	1	0	0	0
Gradie	0	0	1	0	0	0
Simult	0	0	1	0	0	0
Roboti	0	0	1	0	0	0
Princi	1	0	0	0	0	0
Receiv	1	0	0	0	0	0
Linear	1	0	0	0	0	0
Suppor	1	0	0	0	0	0
Machin	1	0	0	0	0	0
Discre	1	0	0	0	0	0