background

Niko's Project Corner

Helsinki Deblur Challenge 2021

Description Deblurring images with supervised learning
Languages Python
Keras
Tags Com­puter Vi­sion
Duration Summer 2021
Modified 15th December 2021
thumbnail

The Finnish In­verse Prob­lems So­ci­ety (FIPS) or­ga­nized the Helsinki De­blur Chal­lenge 2021 dur­ing the sum­mer and fall of 2021. The chal­lenge is to "de­blur" (de­con­volve) im­ages of a known amount of blur, and run the re­sult­ing im­age through and OCR al­go­rithm. De­blur-re­sults are scored based on how well the pytesser­act OCR al­go­rithm is able to read the text. They also kindly pro­vided un­blurred ver­sions of the pic­tures, so we can train neu­ral net­works us­ing any su­per­vised learn­ing meth­ods at hand. The net­work de­scribed in this ar­ti­cle isn't of­fi­cially reg­is­tered to the con­test, but since the eval­ua­tion dataset is also pub­lic we can run the statis­tics our­selves. Hy­per­pa­ram­eter tun­ing got a lot more dif­fi­cult once it started tak­ing 12 - 24 hours to train the model. I might re-visit this pro­ject later, but here its sta­tus de­scribed as of De­cem­ber 2021. Had the cur­rent best net­work been sub­mit­ted to the chal­lenge, it would have ranked 7th out of the 10 (nine plus this one) par­tic­ipants. There is al­ready a long list of known pos­si­ble im­prove­ments at the end of this ar­ti­cle, so stay tuned for fu­ture ar­ti­cles.

In to­tal 17 teams reg­is­tered to the con­test, all of which seem to be from an uni­ver­sity. I ex­pected more in­di­vid­ual par­tic­ipants, which is the case in most Kag­gle com­pe­ti­tions. The dataset is of a very high qual­ity, and clearly lots of thought went into its de­sign. It con­sists of 20 dif­fer­ent lev­els of blur, each of which has 200 pic­tures. Each pho­tograph shows unique strings, pho­tographed by two cam­eras. One cam­era is al­ways kept in fo­cus, while the other is pro­gres­sively ad­justed to be out of fo­cus. More de­tail can be found from their pa­per. A cropped ex­am­ple for the in­put, the tar­get and a net­work's out­put is shown in Fig­ure 1.

example
Figure 1: Ex­cam­ple crops from the dataset at a blur step 8: blurred im­age, an im­age in proper fo­cus and the out­put from a neu­ral net­work. The out­put isn't very easy to read it is much bet­ter than the in­put.

In to­tal there are 20 × 200 × 2 = 8000 im­ages, each of which is 2360 × 1460 = 3.45 megapix­els. In the TIFF for­mat they take 57 GB of space (grayscale, 16 bits of pre­ci­sion), which seem to use loss­less com­pres­sion. To ac­tu­ally feed them to the neu­ral net­work they need to be con­verted into a "raw" numpy ar­ray, typ­ically in-mem­ory. We can save a sig­nif­icant amount of space by keep­ing just 8-bits of pre­ci­sion, then each im­age takes 3.45 MB of RAM. Mul­ti­ply­ing this by 8000 im­ages it still pushes the RAM re­quire­ment to 28 GB, which might not be fea­si­ble on a typ­ical desk­top. Luck­ily there are many ways around this.

At first we can re­duce the im­age res­olu­tion, since the e-ink dis­play's pix­els seem to cor­re­spond to about 8 pix­els in the photo. So each im­age can be scaled down by 1 : 4 - 1 : 8, re­duc­ing the mem­ory re­quire­ment by 94 - 98\%! A high-res dataset is nice but we don't need that many pix­els for the OCR al­go­rithm to work. Ac­tu­ally even the of­fi­cial eval­ua­tion script scales im­ages down by 50% to im­prove OCR re­sults.

The other op­tion is to crop the im­ages fur­ther, leav­ing out the bor­ders which don't con­tain any text. This was orig­inally im­ple­mented, but it was later found out that the al­go­rithm con­fused blurred dark cor­ners (vi­gnetting, see Fig­ure 2) with heav­ily blurred text, and pro­duced dark arte­facts around the in­tended text area. This wouldn't be that much of an is­sue for a hu­man, but the OCR mis­took these for char­ac­ters and gen­er­ated more than the in­tended three lines of text as the out­put. This caused the score to plum­met to zero, since no char­ac­ters were cor­rectly matched with the known cor­rect out­put. This is ap­par­ent in some ex­am­ples of mod­els' out­puts, as shown in Fig­ure 3.

mean_and_std
Figure 2: The mean and stan­dard de­vi­ation of the dataset's pix­els (scaled to be­tween zero and one). Strong vi­gnetting is ap­par­ent in the mean, which caused prob­lems later. All im­ages have three lines of text but the font size and style has two vari­ations of it.
model_outputs
Figure 3: Ex­am­ple out­puts of dif­fer­ent mod­els, ex­cept the ones at bot­tom right which are the sharp and blurred ver­sions of the dataset be­fore pre-pro­cess­ing. The cur­rently best model's out­put is at the top right cor­ner. Ar­guably it has the sharpest let­ters and no black dots near the bor­der.

At this point the dis­cov­ery pro­cess is not nar­rated in a chrono­log­ical or­der, rather the ar­ti­cle de­scribes how the fi­nal ver­sion of the data pro­cess­ing pipeline and the neu­ral net­work works.

Each im­age is scaled down by a fac­tor of 1 : 7, drop­ping the res­olu­tion from 2360 × 1460 to 337 × 208. This leaves char­ac­ters' de­tails at a width of just 0.5 - 4 pix­els, which might ac­tu­ally be a bit too small. To midi­gate the is­sue with vinget­ting, each in­put im­age goes through a pro­cess of es­ti­mat­ing the back­ground bright­ness at each pixel and then hav­ing it sub­tracted. This is achieved by fit­ting a low-de­gree poly­no­mial f(x, x2, y, y2, d), where x & y are the image coordinates and d is the distance from the image's center. Only pixel values near the border are used to fit this function, because they are known not to contain any text. A robust least-squares estimate is obtained by first fitting the model to all the data and then discarding the top 10% of data which does not fit the model. These steps are shown in Figure 4. Not all of the 30.000 border pixels (about 40% of the image area) are needed to fit the simple function, a sufficiently large sample of 400 - 450 pixels was used.

bg_estimation
Figure 4: Steps of the back­ground nor­mal­iza­tion (aka. vinget­ting elim­ina­tion) are: the in­put im­age (top left), bor­der area sam­pling (top right), vinget­ting es­ti­ma­tion (bot­tom left, 6x ex­ag­ger­ated) and the re­sult­ing im­age (bot­tom right).

If the in­put is just a grayscale im­age, the first few con­vo­lu­tion lay­ers wouldn't have that many in­ter­est­ing fea­tures to ex­tract. This net­work uti­lizes a non-lin­ear me­dian fil­ter with a width of 3, 5, 7 and 9 pix­els to aug­ment the in­put im­age. Not much hy­per­pa­ram­eter-tun­ing was done on this, but over­all it im­proved re­sults. Ex­am­ples of these me­dian-fil­tered im­ages are shown in Fig­ure 5. Maybe this many small steps is overkill, and widths of just 3 and 7 pix­els is suf­fi­cient. Al­though now it seems ob­vi­ous that these fil­ters have the most ef­fect on the least blurred im­ages, and those are the eas­iest to de­blur any­way.

medians
Figure 5: Blurred im­ages from steps of 0 to 9 are shown in the top row, and their cor­re­spond­ing me­dian-fil­tered ver­sions (ker­nel width of 3, 5, 7 and 9 pix­els) are shown be­low.

There is also a sig­nif­icant amount of pre­pro­cess­ing done to the tar­get "ground truth" (sharp) im­ages. Their back­ground is not com­pletely white, but rather at 80 - 85% bright­ness. How­ever the model's last ac­ti­va­tion is sig­moid, which scales the out­put to a num­ber be­tween zero and one. Also on pre­vi­ous pro­jects (at least on a face VAE, not in this blog yet) it was dis­cov­ered that bi­nary_crossen­tropy pro­duces sharper im­ages than MSE. So the tar­get im­ages were re-scaled so that 75% of the pix­els are at 100% bright­ness. Again mild vinget­ting started caus­ing is­sues, but on this con­text it was midi­gated in a very dif­fer­ent man­ner than with the blurred im­ages.

On the tar­get im­age, non-white pix­els (e.g. black or grayscale) are al­lowed only near com­pletely black pix­els. Vinget­ting isn't so ex­treme that the cor­ners would have even a sin­gle black pixel, only gray. These are still sub­op­ti­mal for the net­work's loss func­tion and the OCR step so it is best to fix them. This sim­ple heuris­tic works very well, as is shown on Fig­ure 6.

target
Figure 6: Tar­get im­age pre­pro­cess­ing steps: the orig­inal (left), bright­ness-ad­justed (mid­dle) and vinget­ting-cor­rected (right). They gray bor­ders of the right-most im­age are there just to vi­su­al­ize the re­gion in which grayscale pix­els are al­lowed, out­side them (most im­por­tantly the cor­ners) all pix­els must be at 100% bright­ness. The re­sult­ing im­age con­sist of mostly 100% black or 100% white pix­els, which is the best tar­get for the \texttt{bi­nary_crossen­tropy} loss.
crops
Figure 7: 3 x 5 cropped ar­eas of the in­put im­age. Shown rect­an­gles have a ran­dom jit­ter in their po­si­tion so that theyd don't all over­lap.

Train­ing the net­work puts a very heavy load on the GPU, and caused UI pro­grams to lag. This isn't ideal when the com­puter is also used for other ac­tiv­ities, so each train­ing step was made lighter by split­ting each im­age into 3 × 5 = 15 over­lap­ping parts. These are shown in Fig­ure 7. Over­lap is very im­por­tant since the Conv2D lay­ers don't use any padding, so each sub­se­quent layer re­duces the out­put res­olu­tion. Over­lapp­nig the re­gions en­sures that each avail­able pixel con­tributes to the loss func­tion, al­though some con­tribute more than oth­ers since they are part of mul­ti­ple cropped sam­ples. Each cropped sam­ple has a res­olu­tion of 110 × 110 pix­els. Since the dataset has 4000 in­put im­ages and we are adding four me­dian-fil­tered ver­sions of each im­age, the in­put data to the model has di­men­sions of 60e3 × 110 × 110 × 5. But ac­tu­ally the model had trou­ble con­verg­ing when also the most blurred im­ages were in­cluded in the dataset, so at this stage the model is only trained with up to the blur level 9, and there are 30e3 sam­ples in to­tal. 10% of the data is used for val­ida­tion (which is dis­tinct from the com­pe­ti­tion's val­ida­tion dataset).

The net­work con­sists of three dis­tinct stages: "pre­pro­cess­ing", "it­er­ative re­fine­ment" and "post­pro­cess­ing". The only novel(?) part is the re­fine­ment step, the other two are just ba­sic Conv2D and Batch­Nor­mal­iza­tion steps. Ex­cept the net­work takes two in­puts: the im­age (along with me­dian-blurred ver­sions of it), and the blur-step num­ber. The model uses Dense lay­ers at the pre­pro­cess­ing stage, which use the step to tune the mag­ni­tude of ex­tracted fea­tures. Its ac­ti­va­tion is sig­moid so the out­put is be­tween 0 and 1, and its out­put is mul­ti­plied with the Conv2D's out­put.

The re­fine­ment stage uses a sin­gle Conv2D with a tanh ac­ti­va­tion. The net­work is kind of a mix be­tween a ResNet and RNN. Each it­er­ation step's out­put is a weighted mean be­tween the net­work's pre­vi­ous out­put and the shared (re­cur­rent) layer's out­put. (edit: as I was writ­ing this I learned about High­way net­works which have the same idea.) The weight is ad­justed by Dense lay­ers, which again adapt the net­work to the blur level. This way the net­work has only rel­atively few pa­ram­eters to fit, and it shouldn't over­fit as eas­ily as a nor­mal deeply nested con­vo­lu­tional net­work.

The post­pro­cess­ing step is a sim­ple con­vo­lu­tional net­work with a sin­gle sig­moid ac­ti­va­tion at the out­put, since we are pro­duc­ing black & white im­ages.

The net­work has four im­por­tant hy­per­pa­ram­eters to tune: the num­ber of ker­nels (di­men­sions) at the re­fine­ment stage, the num­ber of in­ter­me­di­ate ker­nels, the num­ber of re­fine­ment it­er­ations and the width of the shared Conv2D layer. These pa­ram­eters haven't been ex­ten­sively tuned, since it takes at least 12 hours to train the model with a GTX 1080 Ti GPU.

  • BN = BatchNormalization
  • dinit = lambda shape, dtype=None: 0.01 * tensorflow.random.normal(shape, dtype=dtype)
  • im, step = Input(Xs[0].shape[1:]), Input(1)
  • dim, subdim, n_refinements, shared_conv_w = 32, 32, 12, 5
  • crop = (shared_conv_w - 1) // 2
  • D = lambda i=dim: Dense(i, activation='sigmoid',
  • kernel_initializer=dinit)(step)[:,None,None,:]
  • # This is the most important layer, it is used at the refinement stage.
  • shared = Conv2D(dim, shared_conv_w, activation='tanh')
  • # Preprocessing
  • x = im
  • x = BN()(Conv2D(subdim, 3, activation='elu')(x))
  • x = x * D(x.shape[-1])
  • # These weighted geometric means make more sense if dim != subdim.
  • x = BN()(Conv2D(int((dim**1 * subdim**3)**(1/4)), 3, activation='elu')(x))
  • x = x * D(x.shape[-1])
  • x = BN()(Conv2D(int((dim**2 * subdim**2)**(1/4)), 3, activation='elu')(x))
  • x = x * D(x.shape[-1])
  • x = BN()(Conv2D(int((dim**3 * subdim**1)**(1/4)), 3, activation='elu')(x))
  • x = x * D(x.shape[-1])
  • x = Conv2D(dim, 3, activation='tanh')(x)
  • x = x * D(x.shape[-1])
  • # Refinement
  • for _ in range(n_refinements):
  • d = D()
  • x = x[:,crop:-crop,crop:-crop,:] * (1 - d) + shared(x) * d
  • # Postprocessing
  • x = BN()(Conv2D(int((subdim * dim)**0.5) , 1, activation='elu')(x))
  • x = BN()(Conv2D(8, 1, activation='elu')(x))
  • x = Conv2D(1, 1, activation='sigmoid')(x)
  • model = Model([im, step], x)

The model's sum­mary is shown be­low, with some un­in­ter­est­ing lay­ers om­mit­ted such as Batch­Nor­mal­iza­tion:

  • ___________________________________________________________________________________
  • Layer (type) Output Shape Params Connected to
  • ===================================================================================
  • # Input image
  • input_7 (InputLayer) [(None, 110, 110, 5) 0
  • # Input step number
  • input_8 (InputLayer) [(None, 1)] 0
  • # Preprocessing
  • conv2d_28 (Conv2D) (None, 108, 108, 32) 1472 input_7[0][0]
  • dense_45 (Dense) (None, 32) 64 input_8[0][0]
  • conv2d_29 (Conv2D) (None, 106, 106, 32) 9248 tf.math.multiply_75[0][0]
  • dense_46 (Dense) (None, 32) 64 input_8[0][0]
  • conv2d_30 (Conv2D) (None, 104, 104, 32) 9248 tf.math.multiply_76[0][0]
  • dense_47 (Dense) (None, 32) 64 input_8[0][0]
  • conv2d_31 (Conv2D) (None, 102, 102, 32) 924 tf.math.multiply_77[0][0]
  • dense_48 (Dense) (None, 32) 64 input_8[0][0]
  • conv2d_32 (Conv2D) (None, 100, 100, 32) 9248 tf.math.multiply_78[0][0]
  • dense_49 (Dense) (None, 32) 64 input_8[0][0]
  • # Refinement
  • conv2d_27 (Conv2D) multiple 25632 tf.math.multiply_79[0][0]
  • dense_51 (Dense) (None, 32) 64 input_8[0][0] # repeated 12 times
  • # Postprocessing
  • conv2d_33 (Conv2D) (None, 52, 52, 32) 1056 tf.__operators__.add_41[0][0]
  • conv2d_34 (Conv2D) (None, 52, 52, 8) 264 batch_normalization_22[0][0]
  • conv2d_35 (Conv2D) (None, 52, 52, 1) 9 batch_normalization_23[0][0]
  • ===================================================================================
  • Total params: 67,185
  • Trainable params: 66,849
  • Non-trainable params: 336

Af­ter train­ing with Adam op­ti­mizer (learn­ing rate = 10-2.5) for 12 hours the binary_crossentropy error was 0.0833169 for the trainign set and 0.1285431 for the validation. The model was still converging, and a better results would be obtained by letting the training to run for longer. Although it is a bit concerning that the validation error is so noticeably higher than the fitting error.

An ex­am­ple of the model's out­put is shown in Fig­ure 8, for vary­ing blur lev­els (aka,. steps). Fig­ures 9 - 16 show how the "step" pa­ram­eter af­fects the out­put. It seems very im­por­tant that this pa­ram­eter is set cor­rectly. If this net­work was used to de­blur im­ages with an un­known blur level, the whole range would need to be tested and the best one must be picked man­ually, or based on some heuris­tic.

examples
Figure 8: Ex­am­ples of the cropped in­put im­age, the tar­get sharp im­age and the model's out­put. In­puts have a blur step be­tween 3 and 9.
crop0_step2
Figure 9: Ex­am­ple out­puts with a blur level 2. The cor­rect de­blur level is in­di­cated by a gray back­ground on this and other ex­am­ples.
crop0_step3
Figure 10: Ex­am­ple out­puts with a blur level 3.
crop0_step4
Figure 11: Ex­am­ple out­puts with a blur level 4.
crop0_step5
Figure 12: Ex­am­ple out­puts with a blur level 5. This il­lus­trates nicely what hap­pens if the net­work is told to de­blur too small or too large de­tails. With a cor­rect de­blur level the out­put is very easy to read, al­though it isn't per­fectly crisp ei­ther.
crop0_step6
Figure 13: Ex­am­ple out­puts with a blur level 6.
crop0_step7
Figure 14: Ex­am­ple out­puts with a blur level 7.
crop0_step8
Figure 15: Ex­am­ple out­puts with a blur level 8.
crop0_step9
Figure 16: Ex­am­ple out­puts with a blur level 9.

The con­test (or "chal­lenge") was over­all very well or­ga­nized, and the val­ida­tion dataset came with a twist: un­like the pre­vi­ously re­leased dataset which only con­tained al­pha­bet­ical char­ac­ters, the new dataset had also dig­its. This means that the net­work can­not be trained to al­ways pro­duce al­pha­bet­ical out­puts, but rather it has to be a more gen­eral al­go­rithm. Full re­sults can be found from the web­site, but in con­clu­sion out of the 17 teams which reg­is­tered to the con­test, only nine sub­mit­ted their fi­nal re­sults ac­cord­ing to the rules (on time and open sourced on GitHub). If I had en­tered the con­test it would make ten of us, and the cur­rent (un­op­ti­mized) so­lu­tion would have been at the 7th place, beat­ing three sub­mit­ted so­lu­tions. Most of the so­lu­tions could de­blur im­ages up to step 10 and more, and the top two reached amaz­ing steps of 18 and 19 which look just plainly im­pos­si­ble!

results_fit
Figure 17: OCR ac­cu­racy re­sults of the train­ing dataset. Note that the largest used blur was the step 9, but al­ready there the mean ac­cu­racy (60.4%) was be­low the com­pe­ti­tion's thresh­old of 70%. The x-axis shows dif­fer­ent per­centiles, ex­cept the val­ues at 50 are re­placed with the mean.
results_val
Figure 18: OCR ac­cu­racy re­sults of the val­ida­tion dataset (10% of the used data) Here the model could pass the blur step 7, with a mean ac­cu­racy of 76.2%.
results_comp
Figure 19: OCR ac­cu­racy re­sults of the com­pe­ti­tion's val­ida­tion dataset As with the other val­ida­tion dataset, here the model could pass the blur step 7, with a mean ac­cu­racy of 77.0% This in­di­cates that the net­work didn't over­fit to al­pha­bet­ical in­puts but can also cor­rectly de­blur dig­its. On the other hand val­ida­tion er­ror is no­tice­ably higher than train er­ror.

Sum­mary of the re­sults are shown in Table 1. There is some ran­dom vari­ation be­tween tightly con­tested mod­els, and their rank de­pends on which blur step is ex­am­ined. At steps 4, 6 and 8 the de­scribed al­go­rithm in this pa­per was ranked at places 5th, 6th and 9th from the top.

Table 1: Com­pe­ti­tion re­sults (the stop step) and mean scores at blur steps 4, 6 and 8. \tex­tit{*Note that this net­work was not ac­tu­ally sub­mit­ted to the com­pe­ti­tion. But the of­fi­cial val­ida­tion set was not used to train it or to do hy­per­pa­ram­eter tun­ing, so the com­par­ison is fair.}
step 4step 6step 8
TeamUniversitystopscorerankscorerankscorerank
1st15_ATechnische Universität Berlin, GER1994.53394.03293.121
2nd12_BNational University of Singapore1894.75292.62392.622
3rd01Leiden University, NL1494.92491.75491.653
4th11_CUniversity of Bremen, GER / UK1092.22687.78581.255
5th06Heinrich Heine Uni. Düsseldorf, GER1096.40194.33185.924
6th13Federal University of ABC, BRA791.45771.12867.127
7th*-Niko Nyrhilä, FIN794.50581.08660.809
8th16_BTechnical University of Denmark690.20876.45768.356
9th04Dipartimento di Sciente Fisiche, ITA582.45968.00962.858
10th09_BUniversity of Campinas, BRA216.15106.33102.2710

There re­mains sev­eral ar­eas to be im­proved upon, some of which might re­sult in a higher rank:

  • More hy­per­pa­ram­eter tun­ing, in­clud­ing dif­fer­ent ac­ti­va­tion func­tions.
  • Train the model as long as it taks, do not limit to 12 hours.
  • Speed up the train­ing by show­ing fewer sam­ples of the eas­ier blur lev­els, the model ar­chi­tec­ture seems to in­ter­po­late be­tween them very well. Maybe this could be a dy­nam­ical pro­cess which hap­pens au­to­mat­ically dur­ing train­ing?
  • Switch to a lower learn­ing rate upon ini­tial con­ver­gence, do not use the fixed 10-2.5.
  • Use fur­ther blur steps for train­ing, do not limit to steps 0 - 9. If con­ver­gence is an is­sue one could try to ad­just sam­ple weights so that ini­tially the least blurred im­ages are weighted more. More blurred im­ages would be used mainly for fine-tun­ing.
  • Down-scale im­ages less than by a fac­tor of 1 : 7, so that char­ac­ter de­tails are black & white rather than grayscale.
  • In­clude sharp im­ages' ∂x and ∂y as an ad­di­tional tar­get to the net­work's out­put, maybe it would provide stronger gra­di­ents and the net­work would learn faster?
  • Re­duce over­fit­ting by data aug­men­ta­tion, ide­ally the net­work would be ro­ta­tion-in­vari­ant.
  • Add dropout, ei­ther the stan­dard or the 2D vari­ant. The dropout rate is yet an other hy­per­pa­ram­eter.
  • Try reg­ular­iza­tion as well, es­pe­cially the shared layer which has the most pa­ram­eters.
  • Cur­rent pre-pro­cess­ing steps aren't ideal, the model should learn not to get con­fused by vinget­ting.
  • Use an other net­work ar­chi­tec­ture like an U-net to get lo­cal and global con­text, al­though it is closer to cheat­ing at this point since such net­work was the con­test win­ner.

But if it takes 24 hours to train the net­work, it will take quite a long time to find the op­ti­mal mix of even some of these op­tions. At least it is triv­ial to par­al­lelize, and con­sumer GPUs are still get­ting faster. The used GTX 1080 Ti is from 2017, so it is al­ready 4 years old as of 2021.


Related blog posts:

VideoClustering
Agadmator
MapStitcher
ReceiptUndistort
CarTracking