Transactional email deadlock

Transactional email deadlock

Today I would like to tell you about one real story about email delivery and SendGrid. In the process of investigating its root causes, I had to talk to SendGrid’s customer service and the problem in general has not been solved, but we have a plan. I also hope that someone from the readers is having similar problems and will be able to help answer the questions at the end of this post. For those who are only going to use systems marketing themselves as transactional email (SendGrid, MailGun, Mandrill), I hope this post will help you understand what problems they help to solve, and what type of problems they don’t, and whether or not to use such systems in general.

Since last year, we have been developing and supporting one big SaaS project management system using a freemium model with development teams located in the United States, Australia, Bulgaria and Ukraine. SendGrid is used to send and receive emails. It is obvious what kind of notifications such a system will send – registration, email confirmation, password recovery, but mostly they are update notifications from users – a new comment is posted, a task is due and so on and so forth. I have to say that we have had some experience with SendGrid. When we added the functional monitoring feature to Nerrvana (link) we started using SendGrid, but the amount of emails sent by us immeasurably small compared to our project management system, and so here, we first encountered problems with its use.

So what was the deal?

The client is in China, and is ironically, a leading email marketing company with 20 employees registered in our project management system. Domain is .asia. Some employees complained that they did not receive email notifications. I jumped into SendGrid’s interface and began my investigation. Here is what I saw:

SendGrid problem we faced

It turned out that some users are being notified, while for others our attempts were marked as ‘Dropped’ with the reason ‘Bounce’. It was very strange – how are these users different from the others? The concept of “Bounce” was new to me and I decided to first learn what it means. If this is a common standard – read about it – if not, grasp the meaning put into it by its creator – SendGrid.

It turned out that “Bounce” means that the mail server accepted mail, but was not able to deliver to user’s inbox. It remained to figure out the reason why there are these bounces and I opened a ticket with SendGrid’s customer service asking why this is happening and what distinguishes two types of bounces, which I found on the page http://sendgrid.com/logs/index, toying with filters:

Bounce and result of being bounced

In their response, I received a link to the documentation page – http://sendgrid.com/docs/Delivery_Metrics/index.html, and learned that SendGrid divides bounces to soft and hard. I was also pointed to the page http://sendgrid.com/bounces which I haven’t discovered yet back then. It can be used to find out when an email address was added to a bounce list with the reason. It also allows to delete an email from a bounce list. Here for the first time, I thought that there should be an automatic way to do it, since our volumes would be unrealistic to scan lists, analyze error messages, and clean them manually. I was told that SendGrid is not sending (dropping) all subsequent emails to the addresses from a bounce list until we remove it from it. “Gee” – was my first thought in untranslatable Russian, and I wrote again to the help desk. I had more questions – although it would seem that SendGrid could describe it all in the documentation. To me, they should have more resources for documenting their own product according to CrunchBase funding figures.

From my point of view, it would be quite logical in the activity log to say:

- This delivery attempt to the recipient “bruce.lee@our_client_domain.asia” returned response “such and such” -> putting email to “Hard bounce” list

- This delivery attempt to the recipient “bruce.lee@our_client_domain.asia” was dropped, because the address is already in the “Hard bounce” list. To view all dropped attempts and the original server’s response leading to a bounce, go to http://sendgrid.com/bounces/bruce.lee@our_client_domain.asia.

In this case, it is all simple and clear – you can see what happened, when and to what stop list an email ended up in, as well as how many attempts were dropped so far. It is clear what was considered as a soft and hard bounce, whether you look at original bounce or a drop caused by a previous attempt. You will see the connection between the activity log and invalid, bounce, spam reports available on the Email Reports page. That is, in all cases there is a root cause and consequences. Come on! Please show it to me in a user friendly way! I won’t bother you with the questions I was pestering tech support with.

To prevent emails from ending up in the bounce, list technical support suggested to add domains to the “Address Whitelist” application available for our subscription level – Silver, but it is not a solution. Continuing to send e-mail to a provider, which black listed you without understanding a root cause is not our approach.

Further, it was even more interesting; I have found the root cause in the Email Reports technical support pointed me to – “550 Connection frequency limited” and from our client, I knew that their ESP (email service provider) is the largest ISP in China – QQ.com. Client added our domain to the white list and it did not help. There was QQ itself bouncing us even before our emails reached our client’s white list. The client was unhappy and said they will leave to Basecamp. With all our love for 37signals, this was not pleasant news for us. The client shared the information that QQ has a limit on the volume of mail received by each user. That would explain why some users receive emails from us, but others do not. Our client also explained that QQ does not allow small ESP (in this case, us) to send large email volumes to its users. There was a reasonable question – who is the ESP in this case; us or SendGrid? It turned out that we are, and it is all our problem. QQ has established, for example, that all senders (except those which they consider to be large) can send up to 10 emails a day for each QQ user. It seems that as soon as one of our users receives these QQ-portioned ten letters from us we start getting “550 Connection frequency limited” error and look forward to a new day in the Middle Kingdom to send the next ten. In addition, we also find ourselves in SendGrid’s bounce list and are unable to send at all until we remove the address from the list (we know now that we’ll get there on a regular basis – thanks to QQ).

If, by the way, you make a search for “550 Connection frequency limited” in Google, you will immediately see that all the links either mention QQ.com or are pages of the QQ.com. That is a known problem.

Q is famous for 550 Connection frequency limited

Why SendGrid does not know anything about it and does not warn clients – “you guys are sending to QQ – keep in mind that ….”? Why SendGrid cannot negotiate with QQ to get this condition removed for SendGrid’s own customers or at least act as a mediator between their own eligible clients and QQ, as a major player in this market?

Moving ahead – our client from China advised us to predict(?) the amount of emails we send to all of our customers served by QQ in order not to send a lot of emails to the same users. Can you imagine that? I cannot. Another alternative they had in mind for us – ask SendGrid for a help, which by that time, we had already figured out.

SendGrid replied that unfortunately QQ web pages are in Chinese (so they kind of learn about this problem from us, and they still are not aware that online translators existence to this day). They also said that we need to contact QQ and send them our source IP address and request to ease restrictions on incoming e-mail from us. Also SendGrid offered us to buy for $20 a month additional IP addresses in order to send some emails from different IP’s. “Good” solution, but what is the probability of QQ blocking by IP (they can block by domain name as well)? This is where it all stuck.

I wrote to QQ but no one got back to me, and nothing has changed on QQ’s side – still the same error. I added this particular domain in the white list to SendGrid’s white list application but it only helps not to block our attempts on SendGrid side so eventually, on the next day we can get through to our users mailboxes at QQ. As you can imagine, this did not solve the problem at all. As soon as a ‘glass’ with label ‘our domain name’ allocated to a QQ user is ‘filled’ all the rest from us is spilled in a form of “550 Connection frequency limited”.

Here is a transactional email deadlock. I understand that there are a lot of tasty features SendGrid has taken from MailChimp – click tracking and content inspection to detect potential problems with spam filters, but it is not needed for most SaaS applications. It is rather a functionality for the marketing companies sending newsletters, when the same email distributed to a large number of users. Well this is a problem MailChimp traditionally solved very well. SendGrid was targeting developers and developers who do not send newsletters (usually). I understand SendGrid’s desire to offer a single platform for everything related to sending mail. Battles for market shares – MailChimp launches Mandrill and SendGrid offers services MailChimp was traditionally a vendor of choice, but guys, you have a huge workload ahead to nail transactional emails at least to remove headaches with the deadlocks I just described. If you do not agree with me – share your thoughts in the comments.

This, and other cases have led us to decide to keep all email delivery attempt statuses in our own application for analysis and future problem resolution automation. It has become clear that we need to start gathering information about the statuses of sent messages in our own database first, then we need to analyze and classify issues and start working on problems depending on the impact on our system. Therefore, in the next post Alexander Savchenko will tell how he used SendGrid Events for sending email via SendGrid’s API and how he moved the system from SendGrid’s SMTP to the API.

Igor Kryltsov

Print this post | Home

3 comments

  1. Craig Pfeffer says:

    I am experiencing the same issue, and I think I know what is causing this. SendGrid has several servers that are not DNSed correctly. If your recipient is checking RDNS that email will be blocked by the recipients spam filter and thus putting them on a blocked list in SendGrid.

    I did go to the link sendgrid.com/bounces and they have a settings option that allows you to automatically remove these after x number of days.

    I have placed a ticket with SendGrid asking about the DNS problem and will update this post with what I find out.

  2. Craig Pfeffer says:

    Update: I received a reply from SendGrid about the RDNS issue. They have been standing up new servers recently and their compliance team is behind getting DNS setup correctly.

  3. Igor Kryltsov says:

    Hi Craig,

    Thanks for sharing. Regarding: “They have been standing up new servers recently and their compliance team is behind getting DNS setup correctly.” – we saw this 5 months. Cannot be related to some recent changes by SendGrid to me. They have so many reasons for a bounce. I personally found their invalid, soft and hard bounce lists and rules around them very confusing and explanations from their support only make it worse.

    So we simply work with the API, see how it responds and fix our processing logic without trying to figure out why it works one way or another on SendGrid’s part of it. We also found that it is much easier to always clean all lists at SendGrid via API and have own stop lists and own logic of maintaining them. So every time SendGrid puts something into a stop list we remove it via API so they are always empty for us on their side.