Skip to main content
Operations & Process Managed IT

Troubleshooting SQL Server Performance in a Manufacturing Environment

Jonathan Bourke 7 min read
SQL Server performance tuning VMware manufacturing database

We were asked to help with SQL Server performance issues affecting a manufacturer’s production lines. Their SQL Server cluster supported critical manufacturing systems, and it had been running for years without significant problems. Then the random slowdowns started — unpredictable, difficult to reproduce, and severe enough to impact production. Failovers to the passive node provided temporary relief, but the problem followed. Late-night callouts became routine.

The investigation uncovered three separate root causes, each contributing to the overall performance degradation. This is the kind of problem that looks like one thing but turns out to be several things layered on top of each other.

The environment

The SQL Server instance was running an older version of SQL Server Standard Edition on Windows Server 2012 R2. The cluster was virtualised on VMware, backed by a SAN for storage.

The initial symptom reports were vague — “the system is slow” — which is typical for performance issues that affect end users. The event logs were inconclusive. There were no obvious errors, no crash dumps, no clear indicators of what was wrong.

At the infrastructure level, CPU, RAM, and storage latency all looked normal at first glance. The hardware was not obviously overwhelmed. In fact, it appeared to be mostly idle, which made the reported slowdowns even more confusing. If the hardware is not busy, why is the system slow?

That contradiction — idle hardware with poor application performance — was the first clue.

Issue one: VMware socket configuration

This was the most impactful finding, and it is a mistake I have seen in multiple VMware environments running SQL Server.

The symptom: CPU utilisation on the SQL Server VM never exceeded 50%. Not during peak production, not during heavy reporting, never. The VM had 8 vCPUs allocated, and monitoring showed that only 4 were ever active at any given time.

The root cause was in how VMware had configured the virtual hardware. VMware allows you to allocate vCPUs as either multiple sockets with one core each, or fewer sockets with multiple cores each. The default configuration — and the one in use here — was “wide”: 8 sockets x 1 core per socket.

SQL Server Standard Edition has a licensing limitation: it can use a maximum of 4 sockets OR 24 cores, whichever is reached first. With 8 sockets, SQL Server hit the socket limit at 4. It was ignoring half its allocated CPU.

The fix was straightforward:

  1. Reconfigure the passive cluster node’s virtual hardware to 2 sockets x 4 cores (same total vCPU count, different topology)
  2. Failover to the reconfigured node
  3. Verify SQL Server now sees and uses all 8 cores

The improvement was immediate. CPU utilisation could now reach 100% of the allocated capacity, and the workload that previously saturated 4 cores now had twice the headroom.

The lesson: VMware’s default vCPU configuration is not optimised for SQL Server licensing. Any SQL Server Standard instance on VMware should have its socket/core configuration reviewed. This is a five-minute fix with an outsized impact.

Issue two: configuration debt

With the CPU constraint resolved, the system was noticeably faster, but there were still performance anomalies during peak loads. We ran the First Responders Kit — specifically sp_Blitz — to get a comprehensive health check of the SQL Server configuration.

The findings were extensive. This was a system that had been installed years ago and never revisited:

Out-of-support SQL Server version. The instance was running a version that had fallen out of Microsoft’s support lifecycle. No security patches, no bug fixes, no support if something goes wrong.

Incorrect memory configuration. SQL Server’s minimum and maximum memory settings were at their defaults. In a dedicated SQL Server VM, you want to set max server memory to leave enough for the operating system (typically 4-8 GB) and give everything else to SQL Server. The defaults allow SQL Server to either starve itself or starve the OS, depending on workload patterns.

Wrong MAXDOP and Cost Threshold for Parallelism. MAXDOP (Maximum Degree of Parallelism) controls how many cores a single query can use. Cost Threshold for Parallelism controls how expensive a query must be before the optimiser considers parallel execution. Both were at defaults that did not match the workload characteristics or the hardware configuration.

Outdated Auto-close, Auto-shrink, and Recovery Interval settings. Auto-close causes databases to shut down and reopen on each connection — catastrophic for performance in a production system. Auto-shrink causes ongoing data file fragmentation. Recovery Interval at its default meant the checkpoint process was not tuned for the write volume.

We also deployed Ola Hallengren’s maintenance solution to establish proper database maintenance:

  • Database Integrity Checks (DBCC CHECKDB) on a weekly schedule to detect corruption early
  • Index Maintenance to rebuild or reorganise fragmented indexes based on fragmentation thresholds
  • Statistics Updates to ensure the query optimiser has current data distribution information for efficient query plans

These are table-stakes maintenance tasks for any production SQL Server, but they had never been implemented. The index fragmentation alone was severe enough to cause visible performance degradation.

Issue three: the runaway query

With the CPU socket issue fixed and the configuration debt addressed, overall performance was substantially improved. But sp_BlitzFirst — which monitors real-time SQL Server activity — was still flagging periods of high CPU consumption that correlated with the remaining performance complaints.

We used sp_BlitzCache to identify the most resource-intensive queries in the plan cache. One application stood out immediately: a timing and data collection application was executing over 5,000 queries per minute, consuming more than 40% of the server’s total CPU resources.

This was not a poorly written query in the traditional sense — each individual query was lightweight. The problem was the volume. The application was polling the database at an extreme rate, and the cumulative CPU cost was enormous.

We confirmed the diagnosis by temporarily disabling the application during a maintenance window. CPU consumption dropped by approximately 40% immediately. The remaining workload — the actual production systems — ran without any performance issues.

The resolution required working with the application vendor to re-architect the data collection approach. Instead of polling the database thousands of times per minute, the application was reconfigured to use a more efficient data retrieval pattern with longer polling intervals and batch operations. The monitoring data was still collected with adequate granularity for the manufacturing process, but the database load was reduced by an order of magnitude.

The cumulative result

The three fixes delivered compounding improvements:

  1. 100% increase in available CPU resources — from the VMware socket reconfiguration. SQL Server could use all 8 allocated cores instead of being limited to 4.

  2. ~40% reduction in CPU consumption — from identifying and re-architecting the runaway query. The database server was freed from an application that was consuming nearly half its capacity for a non-critical function.

  3. Dozens of misconfigurations corrected — from the First Responders Kit analysis and Ola Hallengren maintenance deployment. Proper memory allocation, parallelism settings, and ongoing index and statistics maintenance.

  4. Vendor-supported configuration — the SQL Server was brought to a supported version with current settings and documented maintenance procedures.

  5. Better instrumentation — the diagnostic tools deployed during the investigation were left in place for ongoing monitoring. Future performance issues can be investigated with data instead of guesswork.

The practical impact: production line slowdowns stopped. Late-night callouts stopped. The manufacturing team had confidence that the database supporting their production systems was properly configured, properly maintained, and properly monitored.

Patterns that apply beyond this project

SQL Server performance issues in manufacturing environments share common characteristics that I have seen across multiple engagements.

The problem is rarely one thing. This case had three distinct root causes. Fixing any one of them would have improved performance, but the system would still have been degraded by the other two. Systematic investigation with proper diagnostic tools is essential — guessing which fix to apply first wastes time and risks production.

Virtualisation defaults are not database defaults. VMware, Hyper-V, and other hypervisors configure virtual hardware with general-purpose defaults. SQL Server has specific requirements around CPU topology, memory reservation, and storage I/O that do not align with those defaults. Every virtualised SQL Server should be reviewed against Microsoft’s best practices for the platform.

Configuration debt accumulates silently. A SQL Server instance that was correctly configured at installation can drift significantly over years of unreviewed operation. Software updates, workload changes, and application additions all create potential for misconfiguration. Periodic health checks with tools like sp_Blitz catch these before they become production incidents.

Not every database problem is a database problem. The runaway query in this case was an application issue, not a SQL Server issue. The database was doing exactly what it was asked to do — it was being asked too often. Effective performance investigation looks at the full stack, not just the database engine.

Ready to talk?

No sales pressure. Just straight answers about your IT and security.

Get in Touch

Related Insights

Operations & Process Managed IT

Thinking of Switching MSP? Here's What to Expect

Switching managed IT providers doesn't have to be painful. A practical guide to what the transition looks like and how to prepare.

Jonathan Bourke 5 min read