A random failure: Problems upgrading Ansible and Python

The story of a data loss bug I caused at work, how I fixed it, and some surprising behavior in Python and Ansible.

What follows is the story of a data loss bug I caused at work, how I fixed it, and my journey to discovering surprising behavior in Python and Ansible.

Start from the top to read the full story
Read the end of the story to find the strategy I will use to prevent similar problems in the future
Skip to the Epilogue for in-depth technical testing and findings, as well as the bugs I reported and the responses from the dev teams

Chapter 1: The glorious upgrade

Wherein we undertake major Ansible and Python upgrades at work, and in so doing, discover an unfortunate, undocumented behavior change.

Ominous foreshadowing

It’s a Friday. I am “on triage”, meaning I am the point person for all new tickets and requests.

In the mid-afternoon, a coworker lets me know that some of our machines have wiped important data after a reboot

I don’t take this too seriously. I’m aware that these machines have been recently moved around, and that wiping data is expected in such circumstances. I ask him to gather more data, but I don’t investigate on my own.

This will prove to be a mistake.

When my coworker gets back to me with additional information, it’s clear that machines which had rebooted since Tuesday are losing data incorrectly.

What happened on Tuesday?

Tuesday is the day I upgraded Ansible from 2.5 to 2.9, and – it turns out, most critically – Python from 2.7 to 3.6, for all of our Ansible provisioners globally.

Some of my best work …

… is implicated in this failure.

Last year, I wrote and deployed a system that encrypted certain critical data on every machine running our customized Linux distribution within the firm, worldwide. My team collaborated with me for the design, but the implementation and deployment were completely on my shoulders.

In any other circumstance, I would be really proud of that.

We actually intended for it to wipe this data on boot – under certain conditions. Namely, if the machine were repurposed, or if the underlying hardware were swapped. This would protect us in at least two scenarios:

If a recycler gets careless and liquidates machines without destroying disks, potentially leaking proprietary data belonging to us or our customers.
If a machine gets repurposed, and contains data that new users are not allowed to see. In our world, changing the purpose of a machine to this extent would always come with a hostname change.

I implemented this design by mixing a randomly generated secret (stored securely) with the hostname and serial number of the machine to form an encryption password.

The system worked in production for at least 5 months, and has been deployed to hundreds of machines so far. Any machine that is new, moved, or renamed will have this system applied to it automatically. (Machines already in production when I deployed this system will continue to run without it after rebooting until they are moved, renamed, or retired.) We are on our way to rolling it out to several thousand machines around the world.

I certainly haven’t forgotten that the Ansible/Python upgrade was also my project. I will soon discover that my creations are conspiring against me.

Because today, data is getting wiped from machines which have NOT changed hostnames or serial numbers.

Chapter 2: Here be dragons

Wherein we come face to face with certain death.

I am actually pretty lucky

The machines which my coworker found to have lost data are not yet in production. They didn’t have any critical data on them when they were rebooted, and there are about ten of them in total, which means I have a pool to test with.

I’m also lucky that no other machines running this code have been rebooted since Tuesday.

So, you know. It could be worse.

I’m sweating pretty hard at this point

It’s about 4PM. Did I mention it’s Friday? It’s also A’s birthday, and things haven’t gone… perfectly. A few gifts have arrived late or wrong, and we’re under the damned COVID19 lockdown, so the best party we can throw is margarita delivery and a FaceTime call.

Two of the most impactful changes I’ve made at the company are colliding at a really inconvenient time.

REBOOTS MAY CAUSE DATA LOSS

The first thing to do is write a strongly worded post on the internal wiki.

IT Notables – REBOOTS MAY CAUSE DATA LOSS, investigation ongoing

Not the kind of thing you generally want to have your name on.

Fortunately, we have no critical reboots scheduled for this weekend, and after the wiki post, emails, chat rooms, and a lot of swearing in the privacy of my home office, I get to work as quickly as possible.

Debugging

I can’t really describe the initial panic. It’s a drowning feeling.

I know this is a tractable problem, but the timing, the birthday, the attention I’d called to it, and the fact that they are my two biggest projects of my time at the firm put some pretty ugly visions in my head.

Debugging

My job is not a calm introvert’s paradise on the best day and, well, whenever that best day was, we’d started a global pandemic since then.

I’m breathing shallowly and my eyes are darting around. I’m not sure what to look at first.

Debugging

Right.

I’ve got to fix this.

Chapter 3: The plot thickens

Wherein we press on against imminent destruction amidst thick swamps of doubt, to find ourselves at an intriguing discovery.

Narrowing down the problem

I add some elegant¹ printf debugging² to the Ansible role responsible for the encryption.

Between that and a few shells open in different Python virtual environments, I am able to determine that yes, under our new Ansible 2.9/Python 3.6 system, the passwords I’m generating are different than under the old Ansible 2.5/Python2.7 system.

To narrow down the problem, I test Ansible 2.9 under Python 2.7³, and verify that change alone will generate different passwords for the same input.

But why?

Some more specifics on the algorithm

Here are relevant tasks from my Ansible role, in pseudocode.

- name: Get the secret
  get_secret:
    ...
  register: secret_result

- name: Calculate encryption passphrase
  vars:
    salt: "{{ 65534 | random(seed=inventory_hostname+ansible_product_serial) | string }}"
  set_fact:
    passphrase: "{{ secret_result + ... | password_hash('sha512', salt) }}"

With a little more debugging, I can see that the salt is different under Python 3 than it was under Python 2.

Incidentally, I copied this magic 65534 number out of the Ansible filter documentation without thinking too much about it. Those docs use it in an example very close to what I am trying to do in my Ansible role:

An idempotent method to generate unique hashes per system is to use a salt that is consistent between runs:
{{ 'secretpassword' | password_hash('sha512', 65534 | random(seed=inventory_hostname) | string) }}

Is that supposed to work?

I remember when I wrote this implementation, I had wanted to make sure the password would be the same every time, given the same secret material, hostname, and serial number. I know I had checked the Ansible documentation:

As of Ansible version 2.3, it’s also possible to initialize the random number generator from a seed. This way, you can create random-but-idempotent numbers…

Hmm. Ansible says it should be the same since version 2.3. There’s no mention of any breakage from new Ansible or Python versions here.

Did Ansible’s implementation change?

In the stable 2.9 branch, I find the code in question. I do see changes in the history, but nothing seems to really have changed the algorithm in the Ansible code. In every version, from 2.5 to 2.9 (and probably before and after), calling the random filter as I do results in a call to Python’s random.Random() with the optional seed parameter passed.

I do find one curious thing: in the 2.5 code for the filter, it increments the end argument (my magic 65534 number) by one.

Uhh. But wait. In the 2.9 code for the filter, this doesn’t seem to happen?

I don’t understand why this is. I try copying the Ansible 2.9 version of the filter and modifying it to increment the end argument by 1 as the Ansible 2.5 version does. This produces results dissimilar to both Ansible 2.5/Python 2.7 and Ansible 2.9/Python3.6.

I decide to leave this question unanswered. My previous test shows that Ansible 2.5 under Python 3.6 has the same wrong result as the Ansible 2.9/Python3.6 production deployment. I think that line of inquiry will be more fruitful.

Did Python’s implementation change?

It turns out, this code returns a different result under Python 2 and Python 3.

import random
print(random.Random("example test seed").randrange(0, 65535, 1))

On my personal Mac, at least, Python 2 gives me a result of 35879, while Python 3 gives me a result of 13619.

Chapter 4: Intermission

Wherein we enjoy the comforts of the world, and sleep the sleep, not of the dead, but of the dying⁴.

A good decision

I stop here. By this point, I have what I believe is a fix, but it isn’t tested yet. I’m not sure that there isn’t also something else wrong, hidden behind the change in random.Random() behavior.

I’ve already spoken with my team, letting them know that I would need to spend some time with A, and that I might not have a fix pushed until the next day.

I leave for a while to join the FaceTime party. (I look sadly at the freshly made margarita pitcher.) And I go back to my desk to try to finish up the night of.

Ultimately, though, I decide not to attempt a production fix before bed. Rather than exhaustedly deploying at the end of the night, I would get a good night’s sleep and tackle the problem when there was more runway.

Chapter 5: An ignoble solution

Wherein we commit evil in pursuit of good; wherein we sin, yet achieve salvation.

Implementation

The most dumbfuck dipshit fix possible is to shell out to the python2 binary from within Python 3 Ansible and run the filter code as a one-liner.

You had better fucking believe that’s exactly what I do.

It is true that this is kicking the can down the road. At some point, we’ll install machines without python2, right? I figure, we have plenty of runway until then, and an obvious failure during testing when we call python2 in the Ansible role. Plus, I was pretty anxious to let people reboot machines again.

Testing

What comes next is testing, which is tricky. I had a limited number of machines that were booted before my Python 3 upgrade and contain data no one would miss. I have fewer than 10 machines like that easily available.

Even so, my implementation works perfectly on every one of them. I feel comfortable pushing the change, and immediately ~~remind everyone I once made a horrible mistake~~ communicate to everyone that it is now safe to reboot machines.

I close my laptop and am able to enjoy the rest of my weekend. I have all the margaritas I missed the night before, and return to work on Monday to write a giant post-mortem⁵.

Fortunately, my hack works! Machines are rebooted and data survives.

Why wasn’t this tested before?

Well, I did test it.

The thing is, I tested it on machines in our lab that get wiped all the time. Early on I’m sure I was testing something else about the Ansible 2.9/Python3.6 upgrade, and just didn’t notice that I’d ended up wiping the storage location for critical data, which would have been more or less empty on the lab machines anyway.

Then, later on, when I specifically looked for potential data loss, I’d use the same machines, which had already encrypted the data under the new version of the filter. It lost no data at this stage, so I deployed to production.

I regret the error.

How will I prevent this in the future?

I plan to improve my testing practice as a result.

When making changes to code that manages encryption keys or otherwise interacts with storage at a low level, I plan to add a stage to the deployment process.

Deploy to lab machines, as I did for this change. For us, this involves keeping the test machines out of the normal deployment environment, at least for the duration of the test.
Duplicate the lab environment on production machines that have been up for a while. I will have to take care when selecting these machines, because not only is there inherent risk from data loss when testing storage changes, but taking these machines out of the normal deployment environment will carry its own risk.
Deploy to production globally.

If we had a persistent, managed staging environment, that might implement the new second step in that list. Sadly, we don’t have that, at least for now.

Epilogue Part 1: Consulting the ancient texts

Wherein we obtain crumbling tomes of past incantations, the works of men failing, and worse – forgotten.

I later discover that the Ansible project has been notified of this behavior

Oh hey look.

Filter random with seed not stable between python 2 and 3 #63784

I’m pretty frustrated that this didn’t even trigger a documentation update.

Even if they had updated the documentation, the bug was probably filed too late to save me from rolling out a dependency on this. I had already begun to deploy to production by October 22, when this bug is dated. However, it certainly would have given me a reason not to complete the Python 3 upgrade until I had a resolution for it.

But, even worse, this behavior was already tested and documented in the PR for the seed feature.

Python compatibility 👍

$ python2.4 -c 'from random import Random, SystemRandom, shuffle; r = Random("foo"); print r.randrange(0, 100, 1)'
88
$ python2.5 -c 'from random import Random, SystemRandom, shuffle; r = Random("foo"); print r.randrange(0, 100, 1)'
88
$ python2.6 -c 'from random import Random, SystemRandom, shuffle; r = Random("foo"); print r.randrange(0, 100, 1)'
88
$ python2.7 -c 'from random import Random, SystemRandom, shuffle; r = Random("foo"); print r.randrange(0, 100, 1)'
88
$ python3.4 -c 'from random import Random, SystemRandom, shuffle; r = Random("foo"); print(r.randrange(0, 100, 1))'
58
$ python3.5 -c 'from random import Random, SystemRandom, shuffle; r = Random("foo"); print(r.randrange(0, 100, 1))'
58

What must have happened is that 58 and 88 were visually confusing.

Something that bit me so hard definitely hurts to see in plain text, though.

Actually, Python still offers backwards compatibility…

…but Ansible doesn’t use it.

Python’s documentation link for random.Random(), above, helpfully links directly to the implementation.

That code is kind of cool because there is code from 21 years ago right next to code that’s less than 2 years old!

Unfortunately, there isn’t a good way to link to a diff. The differences between 2.7 and 3.6 are huge, and GitHub cannot show single file diffs between two revisions. You can however clone the repo and see a diff with git diff origin/2.7..origin/3.6 -- Lib/random.py.

And it was there that I noticed the documentation for the seed() function:

def seed(self, a=None, version=2):
    """Initialize internal state from hashable object.

    None or no argument seeds from current time or from an operating
    system specific randomness source if available.

    If *a* is an int, all bits are used.

    For version 2 (the default), all of the bits are used if *a* is a str,
    bytes, or bytearray.  For version 1 (provided for reproducing random
    sequences from older versions of Python), the algorithm for str and
    bytes generates a narrower range of seeds.
    """

So they did offer some backwards compatibility, but it was opt-in. And per the Ansible code we saw above, Ansible does not behave differently under Python 3 than Python 2, so the opt-in compatibility was not used.

Epilogue Part 2: Cursed experiments

Wherein we read the forbidden grimoires and attempt to duplicate deranged works, whose infernal effects surely drove their authors mad.

Maybe I can use this instead of my shitty hack?

To use this, we have to change our previous short program slightly. Recall that before:

import random
print(random.Random("example test seed").randrange(0, 65535, 1))
# python2: returns 35879
# python3: returns 13619

Now we try this version:

import random
r = random.Random()
r.seed(a="example test seed", version=1)
print(r.randrange(0, 65535, 1))
# python3: returns 35880

Which… well, that gives me a result of 35880, while Python 2 gives me a result of 35879.

Which looks quite different, yet suspiciously close.

Python’s backwards compatibility might not be very usable

Hmm.

Let’s use a word list I have lying around under /usr/share/dict on my Mac, and try to run the code in normal mode under Python 2, and in compatibility mode under Python 3.

#!/usr/bin/env python
import random, sys
r = random.Random()
with open('/usr/share/dict/web2') as webdict:
  for w in webdict.readlines():
    word = w.strip()
    if sys.version_info[0] == 2:
      r.seed(word)
    elif sys.version_info[0] == 3:
      r.seed(word, version=1)
    else:
      raise Exception("Who could have expected this?")
    print("{}: {}".format(word, r.randrange(0, 65535, 1)))

And let’s run it with both Python 2 and 3:

python2 ./test.py > test.results.python2
python3 ./test.py > test.results.python3

And take a quick look at the results. Note that the middle column shows whether the two are equal.

$ head test.results.python2         $ head test.results.python3
A: 8866                          <  A: 8867
a: 56458                         <  a: 56459
aa: 29724                        =  aa: 29724
aal: 11248                       =  aal: 11248
aalii: 16623                     =  aalii: 16623
aam: 62302                       <  aam: 62303
Aani: 31381                      =  Aani: 31381
aardvark: 6397                   =  aardvark: 6397
aardwolf: 32525                  <  aardwolf: 32526
Aaron: 32019                     =  Aaron: 32019

Ok, that definitely looks unintentional. In that tiny sample, 6/10 of the compatibility mode results are equal, and 4/10 compatibility mode results are greater than the old result by just 1.

Python claims compatibility

I wonder if anyone has ever tried to use this compatibility mode before?

I do see in the Python 3.9 documentation for random:

Sometimes it is useful to be able to reproduce the sequences given by a pseudo-random number generator. By re-using a seed value, the same sequence should be reproducible from run to run as long as multiple threads are not running.

Most of the random module’s algorithms and seeding functions are subject to change across Python versions, but two aspects are guaranteed not to change:

If a new seeding method is added, then a backward compatible seeder will be offered.

The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed.

More comprehensive tests

I started to worry that maybe I was using some unsupported Python version, or maybe this has already been fixed upstream. Luckily, you can get a nightly CI-built Docker container with Python versions 2.7 and 3.4-3.9a, linked from the python.org download page. I decided to script my tests, so they would be reproducible

For context, I’m running this script on a Mac running macOS 10.15.4 and Docker Desktop 2.3.0.2.

I think my version of Python 3 came with the system? Or maybe it came with Xcode? But in any case, the ci-image Python versions should remove any doubts.

#!/bin/sh
set -eux
mkdir testpy
cp /usr/share/dict/web2 testpy/web2
/usr/bin/python testpy/test.py > testpy/results.macos10.15.4.system.python2.7.16
/Library/Frameworks/Python.framework/Versions/3.8/bin/python3 testpy/test.py > testpy/results.macos10.15.4.system.python3.8.2
docker run -v $PWD/testpy:/testpy:rw -u root -it --rm quay.io/python-devs/ci-image sh -c 'python3.9 /testpy/test.py > /testpy/results.ci-image.python3.9'
docker run -v $PWD/testpy:/testpy:rw -u root -it --rm quay.io/python-devs/ci-image sh -c 'python3.8 /testpy/test.py > /testpy/results.ci-image.python3.8'
docker run -v $PWD/testpy:/testpy:rw -u root -it --rm quay.io/python-devs/ci-image sh -c 'python3.7 /testpy/test.py > /testpy/results.ci-image.python3.7'
docker run -v $PWD/testpy:/testpy:rw -u root -it --rm quay.io/python-devs/ci-image sh -c 'python3.6 /testpy/test.py > /testpy/results.ci-image.python3.6'
docker run -v $PWD/testpy:/testpy:rw -u root -it --rm quay.io/python-devs/ci-image sh -c 'python3.5 /testpy/test.py > /testpy/results.ci-image.python3.5'
docker run -v $PWD/testpy:/testpy:rw -u root -it --rm quay.io/python-devs/ci-image sh -c 'python3.4 /testpy/test.py > /testpy/results.ci-image.python3.4'
docker run -v $PWD/testpy:/testpy:rw -u root -it --rm quay.io/python-devs/ci-image sh -c 'python2.7 /testpy/test.py > /testpy/results.ci-image.python2.7'

I also made my test Python script a bit more robust:

#!/usr/bin/env python
import os, random, sys
mydir = os.path.dirname(__file__)
r = random.Random()
maxidx = None
with open('{}/web2'.format(mydir)) as webdict:
  for idx, raw_word in enumerate(webdict.readlines()):
    word = raw_word.strip()
    if sys.version_info[0] == 2:
      r.seed(word)
    elif sys.version_info[0] == 3:
      r.seed(word, version=1)
    else:
      raise Exception("Unexpected python version")
    print("{}: {}".format(word, r.randrange(0, 65535, 1)))
    if maxidx != None and idx >= maxidx:
      break

More comprehensive results

In short, the results I got from the ci-image container matched my local results, except that Python 3.4’s compatibility mode didn’t seem to be compatible with either.

I think the Python 3.4 result is ignorable - I know earlier Python 3.x versions had compatibility problems addressed in later 3.x releases, and since all subsequent releases had matching results, the problem has clearly been fixed. For that reason, I don’t include the Python 3.4 results below.

My locally installed Python 2.7 results matched the ci-image Python 2.7 results.

> head results.ci-image.python2.7
A: 8866
a: 56458
aa: 29724
aal: 11248
aalii: 16623
aam: 62302
Aani: 31381
aardvark: 6397
aardwolf: 32525
Aaron: 32019

My locally installed Python 3.8 results matched the ci-image Python 3.5-3.9a.

> head results.ci-image.python3.5
A: 8867
a: 56459
aa: 29724
aal: 11248
aalii: 16623
aam: 62303
Aani: 31381
aardvark: 6397
aardwolf: 32526
Aaron: 32019

Epilogue Part 3: The real ending is the bugs we reported along the way

Wherein we document the traps for future explorers.

Not a Python bug

I filed a Python bug, and after providing a bit more context, they closed my issue as “not a bug”.

I had wondered whether they’d even want to fix this, since even though I thought it broke the reproducibility guarantee, it has had the current behavior for so long that maybe it isn’t worth fixing.

I was surprised to learn from the response that actually, the reproducibility guarantee is not intended to cover .randrange(), but only .random(). However, a closer reading of the documentation does make this clear.

Therefore, if the Ansible project decides they want to offer their own reproducibility guarantee on the random filter, they should probably move away from using Python’s .randrange(), and towards using .random() directly.

An existing Ansible bug

I commented on the Ansible bug, but at the time of this writing, there has been no response from the Ansible dev team.

I am hopeful that the documentation and/or implementation will be updated as a result.

No. ↩︎
Actually, for Ansible, this would literally be “debug debugging”, which sounds very silly. ↩︎
A previous version of this post claimed that Ansible 2.9 doesn’t support Python 2, which was not correct, per the documentation. I’m not sure why I thought that, but thanks to Alex Willmer who sent me the correction. Apologies for the error. ↩︎
I’m telling you, I did not sleep very well. ↩︎
Which I love doing, by the way. Failures, and the technical details of what happened and how they were fixed, are so fascinating. I mean, it’s not like you can judge, you’re reading a footnote at the end of a 3900 word failure log. ↩︎

2020 0520 A random failure: Problems upgrading Ansible and Python

Chapter 1: The glorious upgrade

Ominous foreshadowing

What happened on Tuesday?

Some of my best work …

Chapter 2: Here be dragons

I am actually pretty lucky

I’m sweating pretty hard at this point

REBOOTS MAY CAUSE DATA LOSS

Debugging

Debugging

Debugging

Chapter 3: The plot thickens

Narrowing down the problem

Some more specifics on the algorithm

Is that supposed to work?

Did Ansible’s implementation change?

Did Python’s implementation change?

Chapter 4: Intermission

A good decision

Chapter 5: An ignoble solution

Implementation

Testing

Why wasn’t this tested before?

How will I prevent this in the future?

Epilogue Part 1: Consulting the ancient texts

I later discover that the Ansible project has been notified of this behavior

Actually, Python still offers backwards compatibility…

Epilogue Part 2: Cursed experiments

Maybe I can use this instead of my shitty hack?

Python’s backwards compatibility might not be very usable

Python claims compatibility

More comprehensive tests

More comprehensive results

Epilogue Part 3: The real ending is the bugs we reported along the way

Not a Python bug

An existing Ansible bug

Responses

Webmentions

Comments