Just added 4 new management servers to our SCOM 2012 R2 environment on friday, and while they seemed to successfully join the management group at first, I have since had a number of problems around the Management Configuration Service.
First, some detail on our environment:
Prior to friday it looked like this:
4x Management Servers, two of which are in a UNIX/Linux Servers resource pool
1x Web Server (that is also a management server but no agents report directly to it)
plus the usual 1x DB and Data Warehouse servers.
On friday I added 4 new management servers, with the intention that 2 of them be used for windows monitoring and Web application monitoring, and the other 2 be used for network monitoring.
All management servers are in the same vlan
the two database servers are together in a separate vlan
the web server is in a separate vlan on its own.
Looking through the logs I can see that the two new servers successfully received management configuration and seemed to join the management group correctly.
At some point on the weekend the two servers I had earmarked for network monitoring (though I hadn't started doing any monitoring on them) and the web server (which has been around for ages) started constantly reporting problems receiving management configuration.
The other management servers in my management group seem to be having intermittent but similar problems, i.e. they still sometimes
On all three servers I'm seeing a lot of event ID 29121, which alerts in scom with the following detail:
Management Configuration Service failed to process agent configuration request. OpsMgr Management Configuration Service failed to process configuration request (Xml configuration file or management pack request) due to the following exception Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle) at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.ExecuteOperationSynchronously(IDataAccessConnectedOperation operation, String operationName) at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.GetConfiguration(IDictionary`2 agentList, ConfigurationSignature configurationSignature) at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentRequestProcessor.ProcessConfigurationRequest(ICollection`1 requestList, Int32& processedRequestsCount) at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentRequestProcessor.Execute() at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.ThreadManager.ResponseThreadStart(Object state) ----------------------------------- System.Data.SqlClient.SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more) at System.Data.SqlClient.SqlDataReader.Read() at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.AgentConfigGetOperation.ReadChildrenAgentInfo(IDataReader reader, AgentConfigurationBuilder builder) at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.AgentConfigGetOperation.ReadData(SqlDataReader reader) at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.ReaderSqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult) ClientConnectionId:ebed46a4-fb23-42af-aa57-ac7c835b058a
I am also getting alerts for the entire Management Group as follows:
Management Configuration Service group failed to perform delta synchronization work item for a period of time. Last error message (if available): Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperationTimeoutException: Exception of type 'Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperationTimeoutException' was thrown. at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle) at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.ExecuteOperationSynchronously(IDataAccessConnectedOperation operation, String operationName) at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.WriteConfigurationDelta(IConfigurationDeltaDataSet dataSet) at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.TransferData(String watermark) at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.ExecuteSharedWorkItem() at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem()
and
Management Configuration Service group failed to perform snapshot synchronization work item for a period of time. Last error message (if available): Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle) at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.ExecuteOperationSynchronously(IDataAccessConnectedOperation operation, String operationName) at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.StartSnapshot() at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.StartSnapshot() at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.ExecuteSharedWorkItem() at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem() ----------------------------------- System.Data.SqlClient.SqlException (0x80131904): Sql execution failed. Error 50000, Level 16, State 1, Procedure SnapshotSynchronizationStart, Line 49, Message: Failed to set configuration space lock. Other process uses config space at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) at System.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader() at System.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult) at System.Data.SqlClient.SqlCommand.EndExecuteNonQuery(IAsyncResult asyncResult) at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.NonQuerySqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult) ClientConnectionId:9ad5c75f-70d8-497c-a695-407704a30016
At the same time, I'm seeing a health service heartbeat failure on the web server.
So to summarise, I've got 4 new servers in the environment, of which 2 seem to be working fine, 2 are failing to get configuration data, and 1 of my original servers is now failing to get configuration data too. I am also occasionally seeing problems with
ALL of the management servers getting configuration data, but most of them seem to resolve by themselves.
I have tried flushing the health service state and cache on the three problematic servers (no result) and have trawled through the event log looking for an explanation but can't see anything.
I'm getting desperate and kind of pulling my hair out here. Can anyone help?