Header Image
Active Incident

Updated a few seconds ago

We continuously monitor our services. Any issues will be posted here. Regular deployments happen weekly on Mondays between 18:00 and 18:30 CET.

Incident Status

Operational

Components

Platform API and processing

Locations

gcp europe-west6



May 13, 2025 09:21 CEST
May 13, 2025 07:21 UTC
[Investigating] A critical component crashed, we're working to get it back up.

May 13, 2025 09:30 CEST
May 13, 2025 07:30 UTC
[Investigating] We've brought the crashed component back up. Processing is back to normal. Training will be off until we fix the root cause of a duplicate key violation that seems to be the root cause.

May 13, 2025 10:11 CEST
May 13, 2025 08:11 UTC
[Identified] We've identified the root cause and are working on a hotfix. Failed documents are being reset so they will be processed. Newly uploaded documents are being processed normally since 9:30

May 13, 2025 12:47 CEST
May 13, 2025 10:47 UTC
[Monitoring] We're testing the fix while monitoring processing closely. The root cause lay in a processing performance optimization where we skip OCR for a document if it was already done for the exact same pages on the batch. This is intended to speed up processing. This caused a unique constraint violation when a year-old document got reset. New documents were not affected. We're also seeing elevated pressure on the database, redis, and sidekiq workers that we're looking into. It's unclear at this point if these are related to the above change or not.

Platform API and processing




Operational

Platform frontend




Operational

Login and Authorization services




Operational

Webhooks




Operational

Team Vietnam




Operational

Team Switzerland




Operational

0

Upcoming Maintenances

3

Incidents Last 30 Days

1

Maintenances Last 30 Days

External Services

History (Last 7 days)

Delayed processingDegraded Performance

Incident Status

Degraded Performance


Components

Platform API and processing


Locations

gcp europe-west6




May 7, 2025 11:42 CEST
May 7, 2025 09:42 UTC
[Investigating] We are investigating elevated failure and rates and processing delays. We're focusing on stabilizing the system first, and will then proceed to reset failed documents.

May 7, 2025 11:59 CEST
May 7, 2025 09:59 UTC
[Monitoring] The system seems to have stabilized, and we'll increase throughput again to wok on the processing backlog.

May 7, 2025 13:49 CEST
May 7, 2025 11:49 UTC
[Monitoring] We're still throttling to avoid overloads while processing the backlog. We've also disabled some write-ahead caching of ground truth data to reduce pressure on certain components, and are looking into more ways to throttle upstream. As to a cause, we've seen slowly increasing response times on some queries that we're looking into. That database looks healthy otherwise.

May 7, 2025 14:14 CEST
May 7, 2025 12:14 UTC
[Monitoring] The processing backlog has been worked through, and we'll reset failed documents of the past hours so they get processed too. Formulation of preventative measures is ongoing.

May 7, 2025 14:53 CEST
May 7, 2025 12:53 UTC
[Monitoring] System has been behaving well since 14:25. Failed docs have been reset. We'll continue to monitor closely and look into the root cause.

May 8, 2025 08:13 CEST
May 8, 2025 06:13 UTC
[Resolved] No more issues. Upstream throttling implemented to stabilize will remain to reduce the likelihood of this issue going forward.