There is quite a bit of chatter these days about testing in production when it comes to large scale distributed systems. However, there isn't much chatter about the elephant in the room: state.
What happens to state that got mutated or deleted by a test run in production? Moreover, what happens to state that has been corrupted with either the wrong data or missing data? What happens when this state gets pushed out widely (as is common in pub-sub architectures)? What happens when the state pushed out gets picked up by other consumers, which makes "rollbacks" non-trivial.
More broadly speaking, this talk goes into the best practices for managing the lifecycle of state when testing in production. When testing in a staging environment, it's often the case that there's a parallel staging stack with its own copy of data stores, caches, queues etc. Data loss, data corruption, cache invalidation, cache poisoning and so forth can be tolerated. This isn't quite the case when testing in production.
This talk will cover strategies for testing STATEFUL services in production safely