In one bright day, our integration tests in the build server started to fail randomly. Each test that failed showed this message:
ORA-02049: timeout: distributed transaction waiting for lock
These were integration tests on our database (Oracle database). Each of these tests had opened a distributed transaction and rolled it back to prevent changes in the database (Using the System.Transactions.TransactionScope). Each time we ran them, different tests had failed. The strangest thing was that newer tests we wrote, which used nHibernate transaction instead of a distributed transaction passed successfully.
A couple of days later, the same tests started to fail on our development computers. This was a red line for me. I dove into the tests. If you ran each test alone, it always passed. If you ran a couple of tests together, some of them had failed, without a specific order. When you debug a test, it always ALWAYS pass. This was very frustrating.
After a couple of fruitless debugs, I started to look for monitoring possibilities for the DTC transactions. I found that the performance monitor had some counters (under Distributed transaction coordinator title) which provide useful information. I used the Active Transactions counter, Aborted Transactions and Transactions/sec counters. I ran the tests and saw that before each test that failed, a previous transaction was hanging. I put a Thread.Sleep command in the Setup method (the method that runs before each tests) and surprisingly the tests had passed. This was very weird. I couldn’t understand why this is happening.
After a couple of days, I almost gave up. I asked for advice of another programmer from our infrastructure team - Doron, and he mentioned that another project on the same server, which also runs integration tests with DTC, never fails. He suggested to me to start moving tests from one project to another and see what is happening.
To do that I needed to create some tables in the DB schema of the tests. When I logged to the schema, it showed me that the password for the schema will expire in a couple of days… And then it hit me. I changed the connection string to another schema and surprise surprise – all the tests passed…
I looked at the schema’s definitions, and I saw that it’s state was ‘EXPIRED (GRACE)’ – which means that the password will expire soon, and Oracle gives us a grace period before the user will expire. This was very strange because this user was defined with a never expiring password. It turns out that while exporting and importing this schema, the Oracle system guys made a mistake, and defined the user with a password that expires in two months. Another look at the definitions, showed that the password expired exactly at the day the tests started to fails…
The one thing I don’t understand in all this is why the nHibernate tests had passed and the DTC tests had failed. I assume it happens because nHibernate is using a local transaction. So the question is why this is happening in this case with a distributed transaction and not with a local transaction.