Solaris SMF Oracle Grid agent service killing Oracle instance

In the last couple of months we had some strange problems with all processes of Oracle database and ASM instances just disappearing. The shared memory segment was still there and you could connect internally to the instance, even execute some queries until you give some statement where it actually needed to do something, resulting in some kind of a “ghost” instance. We had to shutdown the instance using ”shutdown abort” and start it up again. There was nothing logged in any logfile, so we didn’t know what killed the instance. We knew the killing had to be done in a very nasty way, because when you kill the processes of an Oracle instance one by one, something gets logged and Oracle will be “terminated” by the leftover processes.

A couple of days ago we accidently reproduced the above problem (all Oracle processes suddenly gone, but the shared memory segment still there) when we patched the Oracle Grid Control agent on one of our database servers. We use(d) to have an Solaris SMF service defined for starting/stopping/restarting the Oracle Grid Control agent on all of our database servers, so before patching the Oracle Grid Control agent we stopped the agent by using the svcadm -v disable command. When we tried to restart the agent again after patching, we noticed that the service was being placed in maintenance mode, meaning something went wrong when the agent was stopped.

The Solaris SMF framework works with Solaris contracts for managing events and monitoring the status of a service. A contract is a group of processes that belong together, which is determined how a process is started. When a service within the SMF framework is stopped, it will execute the configured command to stop the process (for the Oracle Grid Control agent this is emctl stop agent) and will monitor (even if the executed command returns exit code 0) is all processes within the contract are gone (after a defined timeout). If there are still processes running for the contract, the SMF framework will kill (kill -9) the remaining processes that are part of the contract!

It became clear that when we stopped the Oracle Grid Control agent by disabling the service, the Oracle instance got killed after the configured timeout period for the agent service. That’s nice; when you stop you Oracle grid control agent, you kill your database!

The source of the problem was clear, but why do the Oracle instance processes belong to the same contract as the Oracle Grid Control agent?
The reason is because we do patching using Oracle Enterprise Manager. The OEM provisioning pack will stop the database, apply the patch and start the database again through the Oracle Grid Control agent on the database server. So every time you install a patch using the provisioning pack or when you stop/start a database using OEM, the database/ASM instance processes become part of the Oracle Grid Control agent contract.

Down below you will see a graphical view of the problem when using an Solaris SMF service for starting the Oracle Grid Control agent.

The solution for this problem is “simple”: don’t use a Solaris SMF service for managing a Oracle Grid Control agent on your database server or don’t use the OEM provisioning pack for installing patches/deploying databases and don’t stop and start databases using OEM.