I spend a lot of time looking at logs. Over the years, I analyze more Windows Server Failover Cluster (WSFC), Windows Event, and SQL Server logs than most.
That said, I want to address something I run into from time to time. For whatever reason, people seem to think what logs say is a personal attack on them, their skills, and/or something adjacent. Let me explain.
I always say this: “If you are reading cluster logs, you are in a bad place.” One could argue reading any log, especially on a frequent basis, means you are not where you want to be. Logs are not light summer beach reading.
Proactive monitoring often uses logs to pop up events or information. Monitoring must be tailored to your day-to-day IT requirements. Trawling through multiple logs and correlating events to determine why your AG is down is a vastly different exercise than just seeing if your CPU spikes.
A log’s job is to document what is going on for a particular widget and often includes associated things. For example, a WSFC’s log contains information about networking as it relates to the WSFC. The WSFC log does not measure network performance from the various WMI counters but it may inspire you to go look there if an issue is seen.
Whenever I troubleshoot issues, often part of that exercise is to report findings. Sometimes that is a call, other times a summarized e-mail. For others, it is a detailed report that not only covers the “why”, but also how to mitigate or improve the situation. In other words, it depends.
As with most things, there are good and bad ways to convey a message. Unfortunately, the baby can be very ugly even in the best of scenarios. In those situations, there is no (easy) way to sugar coat that message. The core problem is that once human beings are involved, feelings and emotions come along for the ride. Deliver findings in a humane way – leading with “you suck at your job” is not going to help matters. If your findings are meant to hurt others, rethink your position because karma comes back around.
Information presented in a log has no opinion nor is said information alternate facts. Hopefully, the log output does not contain the words “you are a moron” or “you suck” (if so, find better vendors). Some AI aside that is probably in the works somewhere (hello robot overlords!), when a log documents that the server lost connectivity to the network, that’s all it is saying.
A problem documented in a log is not a referendum on your job as an administrator or your skills. It is most likely an indicator that something must be addressed. The widget and its log do not know why the network dropped out from underneath at a given moment. Knowing the network drop happened and its timeframe points you in a direction to resolve the issue. That may mean correlating other logs and their events. After that you may investigate if this was transient/one time or something that has happened multiple times. Let the information lead you on a treasure hunt.
If any of the above resonates with you, here’s how you can address this unique challenge.
What Can You Do?
First, don’t be defensive. Many companies have walls and divisions that exist between groups or organizations. This phenomenon often leads to finger pointing in times of crisis. Chances are if you are in a server down situation, multiple people/groups will need to work together to resolve the issue. Put egos and agendas aside. Never make others jump through hoops to prove “it’s not my problem”. Everyone is on the same team – Team <insert name of company here>.
Second, remember troubleshooting is a logical exercise. You start with logs or a set of assumptions and you either prove them true or false based on the data you have. Sometimes people know right away what an issue is. This is because they have either seen it before or they have years of experience that quickly rules out noise. Do not dismiss instinct; it is a very real thing.
DBAs are often the default blame acceptors since the problem will show up in SQL Server first even if it started somewhere else in the chain. If an Always On Availability Group (AG) goes down because the underlying network drops out, it’s not SQL Server nor the WSFC or Pacemaker’s fault. The network team will most likely need to investigate what is happening and resolve the situation.
DBAs often must go out of their way to prove performance problems (physical or virtual) are not things like a query issue. Of course, performance problems manifest themselves in SQL Server, but what is the actual cause? For example, a noisy neighbor on a virtualization host can impact another VM. The problem shows up in SQL Server, but it is something the virtualization administrator will probably need to address in some way – not the DBA. Work together, not with two middle fingers held high by each camp.
When it’s time to look at what the logs tell us, have an open mind. Sometimes it’s not our problem and we can breathe a sigh of relief. Other times the spotlight is directly shining on you. Both are ok. Deal with it and move on. Making it personal is not a winning scenario.
Have you encountered situations like this? Chime in below.