SAN or storage area network comprises three layers: host/servers, fabric layer, and storage arrays. Often, SAN performance issues could be due to misconfigurations or faulty components within any of these three layers. Proper monitoring tools can help you identify various performance metrics, which could be pretty helpful during troubleshooting.
If you have issues with your SAN, you want to take a step-by-step troubleshooting approach. To help you learn more, we have covered the common SAN performance bottlenecks, how to identify these issues, and troubleshooting best practices for your convenience.
Common SAN Performance Issues
Many things can go wrong in a complex storage environment. Proper identification of performance issues can help speed up troubleshooting and resolution. We’ve grouped the Storage area network performance challenges into seven categories. These are:
- Compatibility issues. Bottlenecks often arise if non-compatible hardware/software components are introduced into the SAN environment. Most providers have a list of compatible configurations, hardware, and software, which should be maintained to avoid compatibility issues.
- Incorrect zoning. Making frequent changes to the 16-digit SAN zoning names can often lead to configuration challenges.
- Faulty cables and connections. Failing fiber cables are a common cause for concern since they often fail slowly, causing critical issues before they suddenly shut down. Using the best cable performance monitoring tools can help identify the problem before they compound into some severe downtimes.
- Exceeding SAN’s capacity limits. Overloading the inter-switch link, saturating the SAN ports, or connecting several switches in the fabric layer are some of the capacity issues that can cause critical bottlenecks. Detecting these problems can be challenging; hence it’s necessary to use the right troubleshooting solutions or software.
- Storage and Host Configuration challenges. Manual LUN (Logic Unit Number) configuration often results in errors that can be challenging to troubleshoot. Several things can also go wrong on the server side. Components such as the host bus adapter (HBA) driver, volume manager, OS, and multi-pathing software must be configured to match the vendor’s specifications. Any misconfiguration can lead to problems that could be difficult to troubleshoot.
- Slow Storage response times. If the storage devices used in the SAN environment are slow or failing, the overall SAN performance will be affected adversely. High-performance SSDs are often used in high-end deployments where speed and reliability are non-negotiable.
- Hardware Failures. In a robust and well-managed SAN environment, hardware failures are rare but can cause severe issues. Typical hardware components that can fail are switches, port cards, and SFP ports.
How to Identify SAN Performance Problems
System administrators often report errors in their SAN environment, which can be linked to several performance issues. However, sometimes the problem is caused by high expectations that exceed what the system can offer. This often occurs when the technology or equipment within the SAN environment does not meet the unique business needs; hence the network doesn’t yield the expected results. Knowing how to differentiate the two is critical to identifying and fixing various bottlenecks.
Ideally, you cannot identify a problem you are unaware of. Before you can map out any problem, you want to track the entire system performance such that you have some performance baseline or reference points. That way, you can compare the data or metrics and point to a particular time or possible reasons why the system was down, etc.
Some of the critical data points you should collect to help with your SAN troubleshooting include:
- Response times. If the latency for a read operation is above 15 milliseconds, you should do some troubleshooting. The problem could be with your storage or host bus adapters. Similarly, if the latency for write operations is above three milliseconds, it indicates that the write cache could be full, hence a problem with the disk.
- Average queue length. Higher queue wait numbers than the number of spindles (making up the volume) is often a sign of SAN storage problems.
- LUNs Utilization percentage. This metric shows the spindles’ performance, helping locate possible problems.
- I/O Operations per Second (IOPS). This metric indicates the input/output per second serviced by the SAN’s storage array.
- CRC Errors. The higher the number of cyclic redundancy checking (CRC) errors, the higher the chances of SAN switch problems. Here, the performance issues could be due to failing connectors or cables.
- Port utilization. This metric indicates the ports’ workload. Examining it can help you or your system administrators understand the throughput and identify whether there are any switch/port performance issues.
When troubleshooting your SAN, ensure you are familiar with the common performance issues and configuration techniques. Documenting the status of your SAN environment can also make troubleshooting more convenient since you’ll have some data or metrics to refer to. Most monitoring software and applications in the market send text and email alerts when key thresholds are breached, helping you respond in real-time. If upgrades or significant changes are to be effected, you should plan and anticipate issues that could arise. Conducting a thorough “what if” analysis is an excellent place to start. You also want to regularly back up the configuration before and after significant changes. That way, you can quickly restore your AN’s performance status in case of unexpected bottlenecks.