[ AI  Kubernetes  Python  Performance  Cloud  ]

There was a nasty bug in huggingface’s tokenizers that caused a random runtime error depending on how you deal with the tokenizer when processing your neural network in a multi-threading environment. Since I´m using kubernetes I was able to fix it by allowing the pod to be scheduled again, so I didn´t paid too much attention to it because it was related to the library itself and it was not my code the one that caused the bug.

NOTE: Details here

But then, while checking my AppInsights I found something very bad :). I found that the bug that I thought was occurring “sometimes” was happening all the time:

This is the error rate of my backend with the transformers version 4.6.0:


After digging a little bit I did a couple of things to improve the performance (none of them required change in my backend code):

  • Take myself in control of creating the whole image from scratch:
    • Change the base image from pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime to ubuntu:20.04
      • “Official image” is 7.5Gb
      • “Mine” is 2.7Gb (the one with cpu-only support)
    • Manually compile pytorch 1.9.0 with/without cuda support depending on my container requirements
  • Ensure fastapi[all]>=0.68.0
  • Ensure transformers==4.10.2
    • This was the one who fixed the bug

And this is now, with the transformers version 4.10.2:


Near ~3x faster and 15x less error rate :)