Engineering Reference Check - Part Two

Read Part One first.

So some of you (in my imaginary future audience) may be wondering about the results of my first phone call as a reference. Before I found out, I got an email from the other company where my previous co-worker had listed me as a reference.

Here were the questions:

  1. Relationship: Please describe your current and/or past professional relationship with this individual (co-worker, supervisor, instructor, client, industry colleague, etc.)

  2. The Job Description and Work Environment: Please provide a summary of his/her duties and responsibilities in the position you knew him/her best.

  3. The Applicant’s Competencies: (Primary strengths, skills, talents, etc.) Points of interest: communication skills, interpersonal skills, follows direction, punctuality, time management, dealing with stress or frustrations. Please provide examples if available.

  4. Overall Job Performance: How well did this person satisfy the objectives of the position in which you knew him/her?

  5. Areas for Improvement: Please describe any areas where you feel this person could improve, develop, or benefit from additional training, education, or experience.

  6. Reason for Job Separation: Why did this person leave his/her employment at the company from which you knew him/her?

  7. Rehire and/or Recommendation Comments: Do you have any hesitation in recommending this person for a position requiring a high degree of responsibility or public trust?

  8. Closing comments?

Answering these questions over email was much easier than on the phone since I could take my time to eloquently express my thoughts. In fact, I wish I had gotten this email first so that I could have taken the time to talk about my former teammate’s strengths and come up with concrete examples. In addition, it would have helped me answer the nuclear questions (#5 and #6) with a more positive spin.

Before I replied, I received a message from my referencee. (I know referencee isn’t a word but the English language needs a word for the person who asks you to be their reference.)

Here was his exact message:

THANKS FOR THE REC! IM PRETTY SURE THAT IT HELPED A LOT BECAUSE I BOMBED THE ONSITE!!!!

I thought to myself that he had done so well that he got the job in spite of my stuttering and lack of preparedness but all’s well that ends well!

How to Prepare for a Engineering Reference Call

Yesterday, I had my first call as a reference for one of my previous co-workers. He had done a great job on our team and I hoped to convey that in the call. However, like in my first technical interview, nerves got the better of me as I struggled to come up with concrete examples on the spot. Here are some tips to avoid making the same mistakes that I did.

  1. Research who is calling you.

I had looked up the company ahead of time but a simple Google search would have revealed that I was talking to the founder and not a HR representative.

  1. Think about a time you disagreed with the person and how you were able to work it out.

I was asked to give an example of a time I disagreed with my former co-worker. I was caught off guard and stumbled to come up with a good example because although we disagreed often, I had never been upset because we respected each other’s opinions. If this is the case, instead of stumbling to come up with an answer, I could have just said that.

However, had I anticipated this question, I could have taken it one step further and recalled the time when he was emotionally committed to an approach. He disagreed with our CTO at first but after listening to an explanation he quickly adapted the new tool that was suggested in his own free time. This is an application of the classic spin of a question designed to hear a weakness into a story about the candidate’s strength.

  1. Why did this person leave?

In this case, it was related to the revenue cuts due to the coronavirus and I never knew the true reason. However, this would have been a great time to talk about all the conspiracy theories that we come up with when we try to deal with survivor’s guilt. For example, were they unlucky that they were just switched to another project or another manager with whom they hadn’t yet fostered a sense of loyalty? Or were they willing to stand up for what they thought was right and were unfairly punished?

  1. Tell the truth by using your imposter syndrome.

For once, imposter syndrome can be a good thing! Based on my conversations with friends and coworkers, it seems that most people deal with some degree of imposter syndrome. A common characteristic could be self doubt created by comparing yourself to others. While you would normally try to hide that in an interview, this is the ideal time to talk about it! What are all the amazing things that this person did that you were envious of? What characteristics did you try to emulate? Where did this teammate help out in fixing one of your weaknesses on a project?

Example:

I really struggled with cleaning up code for a complex SQL query that we were making in Typescript. However, in this moment, [Name] stepped up and refactored and redesigned the code to make it more composable and reduce the lines of code. This change

  1. Take some time to think about his or her true strengths.

Although at the surface level, you can think of some adjectives such as extremely productive and talk about how they could churn through Jira cards, take time to dig deeper because you might be nervous on the phone and don’t want to sound like a broken record repeating the same things. If they did a good job there will be plenty of examples to think of. But it is a lot easier to brainstorm at your leisure in the shower or during your workout than while you are being interrogated.

After the phone call, I remembered that one of this person’s strengths was his ability to put on headphones, start hammering away on his Kinesis keyboard, and get extended periods of deep work done. Although I wish I had thought of it beforehand, I shot a quick email to the person who had called me to mention that I had not fully verbalized this thought during our talk.

  1. Prepare for other hard questions and more hard questions that might come up based on your response.

For example, I was asked to stack rank my coworker against the rest of the engineers in the company. Although I honestly ranked him very highly for his level of experience and said so, I was challenged to answer “How did this person’s greenness manifest?” It is impossible to anticipate all of these questions but at least you can prepare for the ones that caught me off guard.

What other questions did I miss? Leave them in the comments below please for posterity’s sake!

Quantity Beats Quality?

There is a claim that “Quantity Always Trumps Quality”.

The ceramics teacher announced on opening day that he was dividing the class into two groups. All those on the left side of the studio, he said, would be graded solely on the quantity of work they produced, all those on the right solely on its quality.

His procedure was simple: on the final day of class he would bring in his bathroom scales and weigh the work of the “quantity” group: fifty pound of pots rated an “A”, forty pounds a “B”, and so on. Those being graded on “quality”, however, needed to produce only one pot – albeit a perfect one – to get an “A”.

Well, came grading time and a curious fact emerged: the works of highest quality were all produced by the group being graded for quantity. It seems that while the “quantity” group was busily churning out piles of work – and learning from their mistakes – the “quality” group had sat theorizing about perfection, and in the end had little more to show for their efforts than grandiose theories and a pile of dead clay.

I heard this story from the book while listening to a James Clear podcast which I had “temptation bundled” with playing some Pokemon Sword.

I immediately thought back to the time my boss had challenged me to take 1 good picture every day to get better at photography and thus develop my artistic side more to apply to software engineering. However, I got so hung up on what a good picture was that it never became a habit. I get hung up on the same thing when trying to write or workout consistently.

So does quantity trump quality in software engineering especially with the emphasis on clean code and reducing technical debt?

At work I have reflected that when I am more aggressive with building software and talking about design I put myself in the position to make more mistakes and to get embarrassed about looking stupid. However these soul ego crushing moments are actually the best learning experiences and the more I get now, the better. For example, now I remember that std::collections::VecDeque is a ring buffer after misspeaking and being corrected.

Another example is that we’ve had to rebuild our stream processing “reusable infrastructure” multiple times. Even though we all want to design our software to be future-proof, design requirements change and software has a purpose that it is best suited to.

For example, my first full time job was to work on a stream processing platform. Whereas delivering augmented video in 30 minutes was the cutting edge the first year, that time was cut to 3 minutes the next year, and then sub-second the next. In the first case, it might be best to optimize for individual team productivity by building around a RabbitMQ message passing system for multiple services to interact with. However in the sub-second case, message passing between services on different servers was out of the question and all these services would need to be on the same machine.

Thus, by focusing on quantity, we can eventually achieve greater quality than if we focused on quality to begin with.

Atomic Habit

What if every day I focused on writing one line of reusable code?

In 2001, researchers in Great Britain began working with 248 people to build better exercise habits over the course of two weeks. The subjects where divided into three groups.

The first group was the control group. They were simply asked to track how often they exercised.

The second group was the “motivation” group. They were asked not only to track their workouts but also to read some material on the benefits of exercise. The researchers also explained to the group how exercise could reduce the risk of coronary heart disease and improve heart health.

Finally, there was the third group. These subjects received the same presentation as the second group, which ensured that hey had equal levels of motivation. However, they were also asked to formulate a plan for when and where they would exercise over the following week. Specifically, each member of the third group completed the following sentence: “During the next week, I will partake in at least 20 minutes of vigorous exercise on [DAY] at [TIME] in [PLACE].”

In the first and second groups, 35 to 38 percent of people exercised at least once per week. (Interestingly, the motivational presentation given to the second group seemed to have no meaningful impact on behavior.) But 91 percent of the third group exercised at least once per week-more than double the normal rate.

He also goes on to describe how each one of these decisions is a vote in a voting machine which becomes your identity eventually.

Every day, I will write at least 1 line of reusable code at 9:30 am in my room. I am a 10x engineer. (I am not there yet but will keep saying it to myself until it becomes my identity.)

Also, building reusable code is a good way to get future projects done in a weekend! -_xj7x on Hacker News

Alias Generic Type Constraints in Rust

Code Smell

While experimenting in Rust, I found myself often using the same trait bounds over and over.

1
2
3
4
5
6
7
8
use std::{cmp, hash, fmt}

pub struct Chunk<T: cmp::Eq + hash::Hash + fmt::Display + Copy + ChunkKey> { ... }

async fn receive_chunks<
  T: cmp::Eq + hash::Hash + fmt::Display + Copy + ChunkKey 
  U: cmp::Eq + hash::Hash + fmt::Display + Copy + ChunkKey 
>(...) { ... }

Repeating these trait bounds over and over would lead to increased maintenance cost if I ever needed to add or remove a trait bound and I was repeating myself unnecessarily.

The Solution

To solve this, you could declare a trait which implemented all these trait bounds so that you can replace all future uses of the trait bounds with this new “alias”.

1
2
3
4
5
6
7
8
use std::{cmp, hash, fmt}

pub trait DropKey: cmp::Eq + ChunkKey + hash::Hash + Copy + fmt::Display {}
impl<T> DropKey for T where T: cmp::Eq + ChunkKey + hash::Hash + Copy + fmt::Display {}

pub Struct Chunk<DropKey>

async fn receive_chunks<T: DropKey, U: DropKey>(...) { ... }

Tools for Learning Rust

“the book”

Affectionately nicknamed “the book,” The Rust Programming Language will give you an overview of the language from first principles. You’ll build a few projects along the way, and by the end, you’ll have a solid grasp of the language.

Although many seasoned developers may prefer to jump straight into experimenting building programs in Rust, it may still be worthwhile to start with reading about ownership and lifetimes before jumping into Rust. Coming from a background of professional experience using Elm and Typescript and working on side projects with Elixir and Haskell, I already had used functional programming but it might be good to review that section as well.

cargo-watch

https://github.com/passcod/cargo-watch

cargo install cargo-watch

cargo watch -x "test -- --nocapture"

cargo-watch is a useful command line tool for saving time in your iteration cycles. Just run cargo watch -x test in one of your windows and it will continuously install new pacakges, build, and run tests for your program.

If you don’t want to constantly change what is running in your cargo watch command, you could execute a wrapper script so that you can just update the script instead of using Ctrl-C in the terminal pane where you are running the cargo watch.

For example, run cargo watch -s './run.sh'

#!/bin/bash
# run.sh

set -e
export RUST_LOG="sample_app=info"
# avoid printing rustc warnings which are probably also visible in your editor
cargo rustc --bin sample-app -- -Awarnings && \
  target/debug/sample-app sample-arg.txt

cargo-edit

https://github.com/killercup/cargo-edit

If you ever find yourself wishing for an easier way to add packages without having to look up the version and then opening your Cargo.toml and adding a line, look no further!

Run cargo install cargo-edit and then you can run cargo add [dependency], cargo rm [dependency] and even cargo update to update all your dependencies.

clippy

https://github.com/rust-lang/rust-clippy

Clippy is a linting tool to help improve your code. After installing it, you can run cargo clippy to see the suggestions and cargo fix -Z unstable-options --clippy to automatically apply the suggestions. If you use VS Code, you can also enable Clippy suggestions in your settings in order to view the Clippy suggestions and explanations in your code.

Failing With Rabbitmq

Last post, we went over RabbitMQ best practices. This time I’ll be going over how we’ve failed with RabbitMQ, best practices in development and in production, and the observed side effects and resolution.

Unlimited Prefetch

One of the first issues we encountered was using the default prefetch which is unlimited. Once we tried load testing our application, we saw the consumer run out of memory and then restart. The prefetch should always be set for consumers. The rule of thumb is to set the prefetch value to (the total round trip time) / (processing time on the client for each message) if you have a single consumer or a prefetch of 1 if there are multiple consumers and/or slow consumers.

For example if it takes 10 ms for the message to be delivered to the consumer, 20 ms to process the message, and 10 ms to send the acknowledgement, the prefetch should be (10 + 20 + 10 ms) / (20 ms) = 2. However, for multi-threaded consumers, the prefetch should be larger than this rule of thumb since more than one message can be processed simultaneously.

RabbitMQ Sizing

Another failure case we encountered was the crash of RabbitMQ due to too many connections for the instance size. We had doubled the connections we were making without checking that our RabbitMQ instance could handle the additional connections. This can be monitored using the metric monitors on the RabbitMQ Management UI. When scaling up the number of consumers and messages, it would be a good idea to first assess whether or not a larger instance size would be required based on these metrics.

After this incident, we bumped the instance size as well as moving over to the 2 node cluster for failover.

The RabbitMQ plans can be found here.

Requeueing Failed Messages

Requeueing failed messages by nack’ing them with requeue not set to false or by throwing an error causes them to be requeued. These messages will be resent to the consumer until they are rejected with requeue = true or they are successfully processed. This can lead to messages being stuck in a failure loop and can be catastrophic if there are side effects before the consumer fails.

We protected against this by retrying once before rejecting/n’acking with requeue = false and then logging these messages for debugging.

Too Many Connections

Earlier we discussed running into resource constraints due to having too many connections. Another issue could be a bug in an application which is causing it to make more and more connections. We had an api which was making new connections for each request when it should have been reusing a single connection. Fortunately we were monitoring metrics and caught this issue soon after it was deployed and before our alerts were triggered.

Having alerting set up through one of the RabbitMQ integrations with Datadog, Kibana, etc. can help catch these situations early. Having a development environment with it’s own RabbitMQ server would also serve to catch these issues before they get deployed to production.

RabbitMQ Management UI Issues

Another issue we ran into was the failure of the RabbitMQ Management UI to load. After login, the headers would be visible with no data in the body. We were able to work around it sometimes by logging in through the CloudAMQP management dashboard. We realized that the logstream queue was missing a consumer which meant logs weren’t being ingested. Although we haven’t diagnosed the cause of the issue, this was resolved through a server restart.

RabbitMQ Best Practices

Choosing your prefetch

The rule of thumb is to set the prefetch value to (the total round trip time) / (processing time on the client for each message) if you have a single consumer.

The default prefetch is unlimited which could be an issue for high availability because crashing with unlimited prefetch can causes a lot of issues with message redelivery. This also causes performance issues since all unacknowledged messages are stored in RAM on the broker.

With multiple consumers and/or slow consumers, you probably want a lower prefetch (1 is recommended) to keep consumers from idling.

Queues

Keep them short if possible to free up RAM. This means setting a max-length, setting a TTL, or enabling lazy queues. Setting a max-length will discard messages from the head of the queue. This is important to remember when creating debug queues so that they don’t cause performance degradation.

Auto-delete queues you aren’t using either with a TTL. You could also use auto-delete which deletes when the last consumer has canceled or when the channel is closed but this could lead to lost messages.

“Queue performance is limited to one CPU care” The consistent hash exchange plugin or the RabbitMQ sharding plugin helps you load balance or partition queues respectively.

One queue can handle up to 50k messages/s.

Connections

Connections and channels should be long lived but channels can be opened and closed more frequently.

Messages

Persistent messages have to be written to disk which prevents data loss but hurts performance. Make queues durable to survive broker restarts.

Cluster setup

For high availability, having more than one RabbitMQ node is desirable in case one node goes down.

Exchange type

Direct exchanges are the fastest.

Caution with TTL

“TTL and dead lettering can generate performance effects that you have not forseen.”

Dead Lettering

Be aware that throwing an error and rejecting/nack’ing a message will cause a message to be requeued unless

Versioning

Update your RabbitMQ/Erlang versions.

References

[https://www.cloudamqp.com/blog/2017-12-29-part1-rabbitmq-best-practice.html](Rabbit Best Practice)

Speeding Up Batch Deletes and Updates in PostgreSQL

We ran into some slow insert and delete queries this week. In one instance, we had to delete tens of thousands of rows from a table. In another we were updating hundreds of thousands of row in the database. In both cases we went through a similar process to analyze and optimize the queries.

Slow Delete

This is what our slow delete looked like. In this case there were about 10 million rows that needed to be deleted.

1
delete * from pokemon_box where pokedex_no=6;

First we ran this one delete to see how long it would take.

1
delete * from pokemon_box where id in (select id from pokemon_box where pokedex_no=6 limit 1);
It took about 2.5 seconds which seems excessive. After that we tried increasing the limit to 10 and it scaled almost linearly.

Analyzing Slowness

When trying to speed up a slow query, PostgreSQL’s explain analyze is your best friend. Limit the number of updates and run it with an explain analyze to figure out what is slow. Remember to run this in a transaction if you don’t want to modify the database yet since explain analyze will run the insert/update/delete.

1
2
3
4
5
6
begin; -- Begin a transaction
-- Delete a single row
explain analyze delete * from pokemon_box where id in (select id from pokemon_box where pokedex_no=6 limit 1);
-- Delete 10 rows
explain analyze delete * from pokemon_box where id in (select id from pokemon_box where pokedex_no=6 limit 10);
abort; -- Rollback or abort the transaction

The explain analyze gives us a breakdown of the timing of the query. We had a constraint or trigger which is being checked or being triggered every row which was taking up the majority of the time.

Speeding Up the Query

To speed up these queries, we want to remove the constraints that are being checked or the triggers that are being triggered. However this could leave our database in a bad state if something is modified that breaks the constraints. To prevent this, we want to remove the constraints/triggers, update the rows, and then add back the constraints/triggers all in a transaction.

We could also do the update or delete in a single migration, since migrations are run in transactions.

Constraint

To speed up our delete, this is how we dropped the foreign key constraint and then ran our delete before restoring the foreign key constraint.

1
2
3
4
5
6
7
8
9
begin;
-- Drop foreign key constraint that is slowing down delete
alter table items_in_bag drop constraint items_in_bag_pokemon_holding_item_id_fkey;

delete * from pokemon_box where pokedex_no=6;

--Restore database state
alter table items_in_bag add foreign key (pokemon_holding_item_id) references pokemon_box(id);
commit;

Trigger

To speed up our bulk update, we ran these commands. We actually didn’t use the begin; and commit; since our update was inside a migration which is already in a transaction.

1
2
3
4
5
6
7
8
9
begin;

alter table [TABLE_NAME] disable trigger [TRIGGER_NAME];

update [TABLE_NAME] set [UPDATE] where [CONDITION];

alter table [TABLE_NAME] disable trigger [TRIGGER_NAME];

commit;

Reword Git Commit Without Merge Conflicts

Motivation

We needed to change the commit message for an old commit. The problem was that there had been multiple merges since that commit with merge conflicts that had been resolved. If we tried to change the commit message by using git rebase, it seemed that we would need to re-resolve all the merge conflicts between the commit that we wanted to change and the latest commit on the branch.

It seemed like there should be a way to use the merge resolutions that we had already written. Although normally we wouldn’t want to rewrite history, a special character had been pushed as part of the commit message and this was breaking our Jenkin’s build and deploy process so we couldn’t deploy our application.

Pitfalls and Warnings

Before we go further, let’s acknowledge the dangers of rewriting Git history.

Normally when we rebase, we would only want to rewrite local history that had not yet been pushed to a remote repository. Why? Everyone who has already fetched the remote repository would need to coordinate to rewrite their history or else we could get duplicate commits (the original commits and the new ones) for every commit after the earliest commits that were changed. Instead of running git pull on the affected branches, all users would need to git fetch; git rebase to replace their local branch.

Another danger is not being up to date with the remote repository. Since rewriting remote history requires you to force push, any changes to the branch between your last pull and when you force push would be replaced. To protect against losing code that someone else pushes in this time frame, we should use git push --force-with-lease instead of git push --force in almost all situations.

Finally if other branches already have the commit that we are trying to change, all our work could be undone if those branches are merged in. We could change the commit in all affected branches but this would lead to duplicate commits in history since the commit hashes are different since they are being updated at different times. To handle this, we tried to merge all our branches with the bad commit message into one branch before rewriting history so we would only have to update that branch. For the branches that weren’t ready to be merged, we would later branch off the fixed branch and then cherry pick the new commits onto the new replacement branch.

Solution

Here’s the basic solution followed by the explanation below.

  1. Set up git rerere. git config --global rerere.enabled 1
  2. Download the rerere-train.sh script. (In the next step, we also assume you have moved the script to the repository you are trying to rebase. Otherwise just use the path to the script, {PATH}/rerere-train.sh instead of ./rerere-train.sh`.)
  3. Run sh ./rerere-train.sh [branch_name] to save all previous merge resolutions in cache. (You will probably need to hit q a bunch of times to escape the git messages as they show up.)
  4. Rebase interactively to reword the commit. git rebase --rebase-merges --no-verify -i HEAD~46.
  5. Force push with git push --force-with-lease.

git-rerere to record conflicted merge resolutions

git rerere is to “reuse recorded resolution of conflicted merges”. By setting rerere.enabled, we can enable recording conflicted automerge results and the corresponding resolution. Unfortunately this only starts recording when it is enabled which brings us to our next topic.

Recording previous resolutions with rerere-train.sh

We can record how we resolved previous merges with conflicts using rerere-train.sh. The script lists all the merges that have taken place for the branch to get into it’s current state and then walks through these merges to check if there were conflicts. It also saves all the conflicts that it encounters. To refresh the recorded resolutions you can also run the script with the --overwrite flag.

Rebasing interactively with --rebase-merges

I started off with git rebase -i HEAD~46 doing a roughly binary search with the number passed to HEAD~ until I got to the latest commit that was before the commit I wanted to reword and then running git rebase --abort each time (or deleting every line of the rebase todo so it didn’t start the rebase).

A normal git rebase -i is not what we want because it would flatten all the commits that you are rebasing to a linear history.

 o---o---o---o---A---C---E---o---o---o develop
                  \     /
                   B'--D'

After normal git rebase starting from A, this is what the branch would look like:

 o---o---o---o---A---C---B---D---o---o---o develop

Instead we wanted to preserve the branch structure so at first we tried --preserve-merges. From the man pages: “Recreate merge commits instead of flattening the history by replaying commits a merge commit introduces. Merge conflict resolutions or manual amendments to merge commits are not preserved.” However, our recorded conflicts were not being applied even after training git-rerere since the conflict resolutions were not preserved. A git diff should show whether or not conflicts were resolved using the recorded resolutions since the >>>>>>>, =======, <<<<<< lines should be missing but weren’t.

After aborting that attempt, we tried using the --rebase-merges option. From the man pages - “By default, a rebase will simply drop merge commits from the todo list, and put the rebased commits into a single, linear branch. With –rebase-merges, the rebase will instead try to preserve the branching structure within the commits that are to be rebased, by recreating the merge commits. Any resolved merge conflicts or manual amendments in these merge commits will have to be resolved/re-applied manually”. In this case, the recorded resolutions we trained using the rerere-train.sh script were preserved and we double checked each step with git diff and verified that the recorded resolutions were reused since the >>>>>>>, =======, <<<<<< lines were missing.

We also use --no-verify to avoid running our precommit hooks at every step of the rebase.

Force push

After changing the commit message, we used git push --force-with-lease to remove the breaking commit message from remote history. Then every person who had pulled the repository had to checkout the branch and git fetch and then git rebase to replace their local branch with the new history.

References

This Stack Overflow post asked the same question and introduced me to git rerere and Do you even rerere? gave some useful examples. The git-rebase man pages (man git-rebase) led us to the --rebase-merges option which was the final piece of the puzzle. Here is the equivalent documentation.

Lessons From Deploying Elm

Background

On February 23rd, the Elm application at app.socalstatescioly.org will be used for the first and second tournaments for its second consecutive Science Olympiad season.

The site is a web application designed for mobile users to use on the day of their Science Olympiad competition to navigate to and from events. The Science Olympiad tournament is a multi-disciplinary tournament for elementary, middle, and high schoolers competing in three separate divisions. These tournaments are usually held on high school or college campuses with many spectators and competitors finding their way to their different events throughout the day.

The Team

Our development team consisted of two freshly graduated software engineers and a computer science student working with feedback and direction from the Southern California Science Olympiad co-directors. We had experience competing and volunteering in Science Olympiad so this was a fun and exciting opportunity to give back to the community.

The Stack

The user facing frontend of the application is written in Elm. It was chosen since it was a relatively new language that we were excited about for being a functional language and the guarantee of “no runtime exceptions in practice” 1.

It hits an API currently implemented in Typescript. The user interface for admins to create tournaments is currently implemented in React. The previous API and admin interface were implemented using the Phoenix framework.

Elm Experience

The Learning Curve

Since Elm is a functional language which compiles to Javascript, it comes with a bit of a learning curve especially for team members who hadn’t written in a functional language before.

Although the Elm community is very beginner friendly, certain aspects of functional programming such as the immutability of variables2 forces programmers to think about solving problems in a different way. For example, since you can no longer change a variable, instead of using loops which rely on updating an index variable, you would have to use recursion or the map function.

Although it could be overcome by working at it full time, the learning curve made it difficult for newer programmers without frontend and functional programming experience to contribute features to this side project.

No Runtime Exceptions

By far the biggest win in using Elm was that there are “no runtime exceptions in practice”. Whereas our Phoenix and React UI’s had the occasional error from getting to a bad state, once the Elm code compiled we could have confidence that users would not reach an error state because all of these errors should have been represented in the data. For example, there is a commonly used library called RemoteData which represents RemoteData as NotAsked, Loading, Failure e, or Success a. Since you would handle all four of these cases in your code, you would have to define a behavior for the error state. Since all the cases are accounted for, there should be no crashing.

This allowed us to ship changes close to the tournament and during tournaments without worrying about introducing new unhandled error states. The confidence and guarantees that Elm provided were critical since this was a side project that saw progress through bursts of features implemented on nights and weekends.

One Year Later

While we rewrote both the API and the admin UI to make it easier for everyone to contribute by rewriting them in Typescript and React, most of the Elm code has remained static between last season’s release and this season’s. Some features changed such as the in app geolocation being replaced by a link to Google Maps but the core Elm code hasn’t required much change.

Elm Upgrade 0.18 -> 0.19

Between the end of the 2018 Science Olympiad season (January - April) and the 2019 season, Elm 0.19 was released 3. When we tried to upgrade to Elm 0.19, many of our features broke including geolocation4 and how we handled dates.

Since we were pressed for time to rewrite the admin user interface to make it more usable, we had less time to spend on the Elm code. So for now we have a branch partially updated to use Elm 0.19 and we are trying to keep up with the main branch which has gotten mostly CSS changes and some bug fixes.

Admin UI Rewrite

The difficulty of onboarding Elm newbies on a side project to contribute in a significant manner led to our decision to rewrite the Admin user interface, which was originally built with the Phoenix web framework, in React instead of Elm. Since the Admin user interface saw fewer users who were more likely to have a direct line of communication with the developers and more patience with bugs, we could more easily afford to have errors as we developed the user interface.

Takeaways

Writing the Science Olympiad web application in Elm probably took longer for us to implement features than the equivalent features in a more familiar language such as React. Some of the extra time was offset by the excitement of experimenting with Elm.

The costs of choosing a newer and functional language such as Elm meant that it would take more time for people to learn the language and contribute since fewer developers would come in knowing the language. In addition, the smaller community and meant less community support which in turn made learning the language and version upgrades more difficult.

Overall, choosing Elm for the main customer facing app paid dividends when we needed to finally release the application. “No runtime exceptions (in practice)” gave us the confidence to move fast and deploy given our time and resource constraints.


  1. “no runtime exceptions in practice” ↩︎

  2. Immutability means that once you set a variable you cannot change it. ↩︎

  3. Elm 0.19 Release Notes ↩︎

  4. Github Issue to update geolocation to Elm 0.19 - Due to the Elm 0.19 changes, the Elm geolocation no longer can be supported. Instead you must now use Elm ports and implement the geolocation code in the Javascript handlers. ↩︎

Debugging Knex: Timeout acquiring a connection

Background and Why Knex

Knex.js is a Javascript query builder for PostgreSQL (any many other relational databases).

It is preferable to use a query builder such as Knex to avoid many of the security vulnerabilities of constructing raw SQL as well as to get useful functions for using transactions without dealing with the implementation complexities.

I have used Knex for many Node applications written in Typescript. They have great documentation and even provide type definitions for Typescript users.

Knex: Timeout acquiring a connection

Let’s dive into why you’re probably here. The timeout acquiring a connection has been a pretty common error (unhandled promise rejection) that I’ve run into when using Knex for applications under load. The diagnosis and solution suggested below will also apply to other query building packages.

Unhandled rejection TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?

Diagnosis

The error message contains a suggestion to get started debugging. “Are you missing a .transacting(trx) call?” If you are using knex.transaction, maybe you forgot the .transaction(trx) call. Unfortunately, in my experience, this was usually not the problem.

More commonly, this error occurs due to making too many queries at a time or very slow queries. Unfortunately the error message doesn’t include the actual query that is timing out. To help us figure out if this is the case, we can time queries to see if they are taking a long time.

Here’s an example of adding logs to time your knex queries while debugging.

1
2
3
4
5
6
function getNodes(conn: Connection, nodeId: string): Promise<Node> {
  const getNodesTimingTag = `getNodes ${Math.random()}`;
  console.time(getNodesTimingTag);
  const node= await conn('nodes').where({nodeId});
  console.timeEnd(getNodesTimingTag);
}

When you run your application, this should output a bunch of logs looking like:

getNodes 0.7576625381743156: 148.150ms
getNodes 0.6049968597856601: 168.230ms
getNodes 0.02798437817797117: 178.130ms

If the logs indicate that the time for queries to resolve starts out very low and is taking longer and longer, we can confirm that this is a performance issue. Usually we will be looking for the times to increase to 60 seconds1 which indicates that we are either overloading Knex or Postgres with queries.

These logs also give us insight into which queries are slow which we should prioritize optimizing using the suggested solutions below.

Solutions

Increasing Pool Size

Increasing the pool size may be the fastest solution. An example of when this will work is if you have around a dozen slow running queries that need to run concurrently and the rest of the queries are relatively fast. By increasing the pool size, we allow these slow queries to run while still having the ability to run our fast queries.

However, this may not work if you have so many concurrent queries that all the connections in the pool are being used. In this case, consider decreasing the load on Postgres by limiting concurrency, more efficient queries, caching or using promise queues.

Otherwise, let’s start with increasing the pool size.

According to the docs the default pool size is 0 to 10. You may also need to increase the pool size or memory limit of your Postgres database to support the extra connections.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
var knex = require('knex')({
  client: 'psql,
  connection: {
    host : '127.0.0.1',
    user : 'your_database_user',
    password : 'your_database_password',
    database : 'myapp_test'
  },
  pool: { min: 0, max: 50 }
});

Limit Concurrency

You could just reduce the concurrent calls to Knex if performance is not critical. Even if performance is critical, you don’t want to simultaneously send off a thousand requests since most of those requests will have to wait for a connection to free up and the timeout timer starts as soon as you send the request to Knex.

To limit concurrency use bluebird’s map (or mapSeries) which allows you to control concurrency, instead of the ES6 map.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import * as bluebird from 'bluebird';

// Use bluebird map with concurrency limited
async function databaseCallsInParallel(nodeIds: string): Promise<void> {
	await bluebird.map(
		nodeIds,
		async nodeId => updateNodes(nodeId),
		{concurrency: 10}, // Limit concurrency to 10 at a time
	);
}

// Instead of ES6 map which does not limit concurrency
async function unlimitedDatabaseCallsInParallel(nodeIds: string): Promise<void> {
	await Promise.all(nodeIds.map(async nodeId => updateNodes(nodeId)));
}

Promise Queues - Another Way to Limit Concurrency

Maybe your application operates in bursts and needs to make many queries during these bursts. In this case we can limit the queries sent to Knex at a time by using promise queues. Promise queues enable us to rate limit async operations such as Knex Postgres queries. These queries will still wait in memory but Knex won’t try to acquire a connection until the previous queries are fulfilled. p-queue is the library that we use.

Improving Database performance - Indices

Increasing the pool size and limiting promise queues may just be treating the symptoms of the problem if we need our application to be very performant. On the other hand, maybe those solutions won’t work at all if database performance (CPU usage) is our bottleneck.

For example, if the tables we are querying are very large, we will want to consider creating and utilizing database indices to speed up access to our data.

Caching

You could also reduce database load by reducing the number of database calls by caching in memory. Check out the memoizee library for caching function calls or store query results in variables.

If you are caching static results then this is very straightforward. Otherwise, make sure to handle cache invalidation and keeping the cache updated.

Takeaways

Knex timing out trying to acquire a connection is often due to an overload of queries.

There are some fixes we can make in our application such as increasing pool size or limiting concurrency.

Ideally, we can also implement caching to reduce calls to the database. If we are dealing with large tables, creating new database indices and optimizing queries to use these indices will also reduce load on the database and increase performance.

Hope this was helpful! Leave feedback in the comments section below.

Are you still having trouble debugging the timeout acquiring a connection in your application? Leave a comment below and I’ll get back to you as soon as possible!


  1. This is the default timeout for acquiring a connection. See Knex source code. ↩︎

Concurrent Job Queue in Postgres

Implement a concurrent job queue in PostgreSQL.

Motivation

Message brokers such as RabbitMQ have first in first out (FIFO) queues. That means that jobs are executed in the order that they are placed on the queue.

With Postgres, we can implement more complex ordering of jobs. For example, we could manipulate the message order by adding more columns to store job data and changing the query to fetch the job. In addition, we can disable and enable jobs in the database at any time.

A drawback is that using Postgres in this way does not scale as well to millions of messages a second since each consumer must query and update the database. If the message rate is not an issue and we want more control over job execution order, using Postgres is a good candidate for implementing a job queue.

Schema

Let’s consider a simple example where order of importance of jobs is the same as the order of their chunk_idx's.

1
2
3
4
create table jobs (
  chunk_idx integer not null primary key,
  is_complete boolean default false,
);

Get Job

Here is an example query to get a job and mark it is as complete. See if you can figure out what is wrong or inefficient before scrolling to the next section.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- Query 1
begin;
update jobs
set is_complete = true
where chunk_idx = (
  select chunk_idx
  from jobs
  where not is_complete
  order by chunk_idx
  limit 1
)
returning chunk_idx;
commit;

Example

Usually we won’t be able to complete a job instantaneously so let’s add a sleep before committing the update to simulate a slow job. Assuming we have jobs in the database, what would happen if we run this new query (Query 1 with sleep) and then our original query (Query 1) before Query 1 with sleep completes?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Query 1 with sleep
begin;
update jobs
set is_complete = true
where chunk_idx = (
  select chunk_idx
  from jobs
  where not is_complete
  order by chunk_idx
  limit 1
)
pg_sleep(10); -- This is where we would do work
returning chunk_idx;
commit;

Both return at a delay and return the same chunk_idx. This means that jobs will be duplicated if we have multiple consumers.

Fixing Duplication

Let’s see if we can fix the job duplication so that we don’t have consumers doing the same job.

The issue is that in Query 1 with sleep, the selected row is not updated until the transaction completes. When we run Query 1, it doesn’t block until it tries to update the row. When it gets blocked, it already has selected the same row as the Query 1 with sleep. After the Query 1 with sleep transaction completes, Query 1 will do its update1.

In Postgres, if we select ... for update we can lock the row so that other select for update's will have to wait2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Query 2
begin;
update jobs
set is_complete = true
where chunk_idx = (
  select chunk_idx
  from jobs
  where not is_complete
  order by chunk_idx
  for update
  limit 1
)
returning chunk_idx;
commit;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
-- Query 2 with sleep
begin;
update jobs
set is_complete = true
where chunk_idx = (
  select chunk_idx
  from jobs
  where not is_complete
  order by chunk_idx
  for update
  limit 1
)
pg_sleep(10);
returning chunk_idx;
commit;
Again we can run Query 2 with sleep and then Query 2. This time, the two queries will complete different jobs. In addition, if the first query aborts before it is complete, the second query will still grab the correct chunk.

Although we fixed multiple consumers doing the same job, the second transaction waits for the first transaction before it completes which means that consumers are still not doing jobs concurrently.

Concurrency

In Postgres 9.5, the SKIP LOCKED feature was added. This gives us exactly what we want which is to skip a job if it is currently locked for an update.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Query 3
begin;
update jobs
set is_complete = true
where chunk_idx = (
  select chunk_idx
  from jobs
  where not is_complete
  order by chunk_idx
  for update skip locked
  limit 1
)
returning chunk_idx;
commit;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
-- Query 3 with sleep
begin;
update jobs
set is_complete = true
where chunk_idx = (
  select chunk_idx
  from jobs
  where not is_complete
  order by chunk_idx
  for update skip locked
  limit 1
)
pg_sleep(10);
returning chunk_idx;
commit;
This doesn’t block so if we run Query 3 with sleep which locks job 1 and then Query 3, Query 3 will complete job 2 without waiting for other jobs to complete. If Query 3 with sleep fails, then the next consumer to get work would pick up job 1.

Takeaways

By using the SELECT FOR UPDATE SKIP LOCKED command, we can implement a job queue in Postgres which assigns the most important job at all times and can be used concurrently. By using Postgres for our job queue, we can implement more complex logic for how priority of jobs should be calculated compared to a message broker.

Acknowledgements

The idea and the queries in this post are from a company technical talk by Nibir Bora and Tim Higgins. The resources used to develop the talk and original code can be found at https://github.com/nbir/pg-queue-talk


  1. The UPDATE command acquires a ROW EXCLUSIVE lock which conflicts with other ROW EXCLUSIVE locks on the row but not SELECT’s which acquires a ACCESS SHARE lock. See PSQL Locking. ↩︎

  2. The SELECT FOR UPDATE command acquires a row level FOR UPDATE lock which prevents other FOR UPDATE locks from being acquired on that row. So if two queries try to SELECT FOR UPDATE on the same row, the second query will need to wait until the first one relinquishes the lock. See PSQL Row Locking. ↩︎

Commitment

I’m committing to publishing one post every week for the next 51 weeks.

Jeff Atwood, the founder of Stack Overflow and Discourse attributes all the success he had to starting and regularly writing his blog.

John Sonmez, the founder of Simple Programmer, attributes all the success he had to starting and regularly writing his blog.

Both recommend that every software engineer should have their own blog.

I’ve started several blogs but never got very far with any of them. However, those experiences have sped up the setup and publication of this blog.

So today I’m starting again and this time I’m seeing it through to at least 52 posts of hopefully useful content to share what I’ve learned as a software engineer and improve my communication skills.

Currently I’m using Hugo as my static site generator for the amazing speed of static sites compared to content management systems such as Wordpress and the ability to write posts in Vim and Markdown. Jane is the current theme since it looked beautiful and optimized for speed. It also has a ton of features such as support for multiple commenting systems.

This site is hosted on Gitlab Pages since a static site doesn’t require a server and Gitlab Pages is free. DNSimple is the DNS provider as the user interface is excellent for developers1.


  1. Warning: This is a referral URL and both you and I get $5 off our subscriptions if you use this link to sign up. Students used to be able to sign up for a Github Developer Pack and get the first year free but it doesn’t look like DNSimple is still part of the pack. ↩︎