Skip to content

2021 09 11 (Saturday) Deployment

Mike Marcotte edited this page Sep 13, 2021 · 3 revisions

General Notes

This deployment includes a few bugfixes, including two that require a data migration.

Until we can accomplish a more seamless user experience for deployments, we will be performing deployments on weekends or after midnight on weekdays.

Bugfixes

Feature Stories

None

Timeline

  • 22:00 - made the Pull Request

  • 22:01 - deleted efcms-prod-beta (empty) table east & west

  • 22:08 - merged PR ✅; CircleCI build

  • 22:18 - tests pass ✅

  • 22:18 - tests pass ✅

  • 22:19 - confirm that the deploy was updated correctly, and the migration will be alpha => beta

  • 22:39 - deploy completed successfully

  • 22:40 - migration started

  • 23:58 - observed that the final segment to migrate (53880/68806) timed out at 15min. perhaps too many items.

  • 00:13 - observed that same segment (53880/68806) timed out again at 15min. Logs suggest point to an overloaded pk section-outbox|docket:

    The item of section-outbox|docket 2021-04-13T19:26:23.218Z alread existed in the destination table, probably due to a live migration.
    
  • 00:28 - observed that same segment (53880/68806) timed out again at 15min with similar logs. The scan is trying to work through too many items for a Lambda to process. Will try one more, and if it fails I will call it.

  • 00:43 - observed that same segment (53880/68806) timed out again at 15min with similar logs. Aborting the deployment, and will revisit in the morning. Emailing the team.

Conclusion

The application will continue to function on the previous version without these bug fixes.

Tonight’s deployment failed for a technical reason, which is somewhat hard to explain, but here goes. It was making use of a “migration” that converts the data in the database from an old version to a new version so that the updated code will continue to function with data in the updated structure that it would expect.

These deployments are the ones that take 4-5 hours because they comb through every record in the database, fix the ones they need to fix, and then reindex everything into the Elasticsearch Cluster. It does this by slicing the database into many pieces (68,806 pieces tonight) and then working through each record on that slice.

The problem is that the pieces are divided by the name of their primary identifier, and the codebase is overloading one of those – the outbox of the docket section, who have been very busy of late. So, the code that was migrating that slice was churning through record after record, and it kept hitting the maximum execution time of 15 minutes in the process.

When we tested this bugfix about one and a half weeks ago, there weren’t as many records, and the migration completed successfully. The activity between then and now pushed this segment past the threshold.

So, going forward we are going to have to reconsider the approach that left this segment overloaded. It’s an issue we have addressed before with Work Items.

Some related docs to when considering what approach to take:

Clone this wiki locally