The Daily Grind:

“Okay, now why is this happening?”

You deal with pop-up problems all day; user complaints, crashed nodes, app failures. Are the nodes and the OS properly configured, or is PerfMiner reporting them in a separate configuration cluster? Did it change recently? Same question for software versions and code libraries or modules.

  • What was running on that node before it crashed? What was it doing?
  • Which resources are under contention?
  • Which of your users are running the least efficient code? Is that normal?
  • Which developers are producing the least efficient code? Is that necessary?
  • Are your batch systems and CPUs oversubscribed?