Network design for mission-critical systems - Technews Industry Guide: Sustainable Manufacturing 2022 - SA Instrumentation & Control

2022-09-11 03:32:30 By : Ms. Shirley Du

In any industrial or utility control, protection, production and/or safety system, the communication network is becoming one of the most critical components, however this is often overlooked and as such a sub-par network can be designed and implemented. These networks will often barely meet the minimum requirements to work, meaning that while the attached systems all test out as working correctly, the network is not actually as stable and reliable as might be believed. However, if the network is not directly tested and checked, this could lead to sign-off of a network that is one failed switch or cable away from crashing.

In any industrial or utility control, protection, production and/or safety system, the communication network is becoming one of the most critical components, however this is often overlooked and as such a sub-par network can be designed and implemented. These networks will often barely meet the minimum requirements to work, meaning that while the attached systems all test out as working correctly, the network is not actually as stable and reliable as might be believed. However, if the network is not directly tested and checked, this could lead to sign-off of a network that is one failed switch or cable away from crashing.

Another common issue is sourcing switches that do not cater for all the requirements that might arise in the future. This could mean hardware that is not rugged enough to cater for a harsh environment, or devices that do not offer certain diagnostic and troubleshooting tools, leading to long-term issues that can be very costly or difficult to resolve correctly.

Rectifying such issues can also be extremely time consuming, and can affect production on a live plant, meaning even greater losses. Redesigning and recommissioning a network takes time and effort, and may require physical changes such as cabling if not correctly planned for from the beginning.

Another component that can have an even greater impact than incorrectly selected hardware, is the logical design and configuration of the network. This includes everything from the VLAN and IP subnet design, to the Layer 3 routing between and across subnets, to firewalls on WAN links or links to other organisations/networks.

Fixing physical issues like hardware and cabling can, at least in most cases, be done in a piece-by-piece fashion across a network – replacing a single switch or cable at a time can be planned properly so as to have minimal impact on the network and attached systems. However, redesigning something like an IP subnet or routing infrastructure can lead to outages on entire sections, as not only the network devices will require reconfiguration, but also most of the end devices as well. This not only means downtime of these end devices for maintenance, but also requires getting assistance from users that know the edge device requirements and configuration, which may in some cases be from third-party companies and thus much harder to coordinate than with just a single company handling the network itself.

Physical layout and site-specific considerations

The most important step in putting together a truly reliable, resilient network that properly caters for the system it is being built to support, is the initial design of this network, and specifically some initial decisions that will greatly affect everything from cabling to hardware layout.

Before anything else, one needs to consider the physical site/system layout that this network is going to support, and how the cables will be laid across this site. If no cabling or trunking exists, this can be one of the most costly components of the network itself, as the civil work required to install the cable can be expensive.

A couple of things need to be catered for here. First, one must of course make sure that network connection points exist for all end devices that require it. In most cases, the end devices in industrial and utility environments support fibre connections directly nowadays, and with multimode 100 Mbps fibre supporting distances of up to 2 km as standard, this normally allows for ample flexibility. However, fibre (especially the ruggedised fibre required for long runs) can be quite costly, and so in other cases copper (Cat5e or Cat6) cable is used instead.

The important thing here is to remember that copper cable can be susceptible to electro-magnetic interference (EMI), which can be quite strong in high-power environments such as substations, or near arc furnaces or similar high-current machinery. In these cases, either shielded copper cable should be used, or in cases of highly critical end devices, fibre should be considered despite the increased cost. The cost of changing from copper to fibre at a later stage will of course cost a lot more than going with fibre from the beginning.

Physical and logical topology design

Once the basic connection points for all edge devices have been decided upon, one can start looking at the actual physical topology of the network, meaning the interconnections between the switches that will have the edge devices connected to them. In some cases, separate switches for edge device connections may be considered, which connect to a central backbone of switches running a high level of redundancy. This is the more common design used in power grid environments such as substations, with each bay containing one or two edge switches for edge devices, connecting to a central mesh of backbone switches interconnecting the bays with the process and station levels, as well as any WAN connections back to the control centre. In these cases, a distinction can be made between backbone and edge switches, with edge switches focusing on a high number of 100 Mbps connections, and backbones focusing on gigabit connections to each other and to the bay switches.

In an industrial environment, it is more common to find a single loop of network switches, or a number of interconnected loops. Often, each of these loops will service a separate function in the overall plant/factory, such as one loop for exterior security, another for monitoring of shipping and trucks, another for conveyors, and so on. These switches will all interconnect at one or more central points, such as in the control room for the site. In these cases, it is harder to distinguish between backbone and edge switches, as most switches will be both.

In some cases, a branched-off switch for expansion or connection of specific remote edge devices may clearly be an edge switch rather than backbone, but most of the time in these types of networks, no specific distinction is made between backbone and edge switches. In these cases, it is recommended to look at a more flexible, modular network switch which allows on-the-fly changing of modules according to requirements.

Either way, the backbone of the network now needs to be designed. With the decisions made about where individual edge devices will connect, one can start working with those locations as the nodes for the backbone. All these individual locations then need to be interconnected, keeping in mind that this will translate into a physical interconnection between the locations/nodes. This means things like cable runs must also be considered. Cables cannot always simply be run directly between points, and so the restrictions and limitations of the actual site must be considered, with the relevant specialists involved. At this time, one can start confirming all the distances between nodes, and the longest and shortest distances can be used to decide what cabling to use.

Copper cabling has a maximum run of 100 m (although it is recommended to stick to 95 m for actual cable runs due to losses from connections, patch leads, etc.), which really limits it to single buildings or very short outside runs. It is also susceptible to EMI as mentioned above, meaning that in certain environments it must be properly shielded. The general rule of thumb in utility and industrial environments is to only use copper within cabinets, and normally only in the control centre.

For field devices, multimode fibre is recommended instead. Multimode fibre offers distances of 2 km for 100 Mbps connections (edge device to edge switch normally) and 500 m for gigabit connections (backbone connections and uplinks to the backbone switches), and so is quite well suited to most sites. In cases where this is still too short, single-mode fibre can be used instead, however the cost of single-mode fibre is higher than multimode in most cases, and increases further as the distances required increase (due to the requirement for more precise lasers, cabling, etc. so as to minimise signal loss).

In some cases, one may be able to standardise a local site to use multimode fibre throughout (plus some copper connections for HMIs, scada, machines and the like), with single-mode fibre only required for WAN breakout (or not required at all for some small sites). This is generally the case in the power grid environment, where substations are small enough to use multimode fibre, and these are then connected to each other and the control room through a wider-scale single-mode fibre connection. In other cases, which can often be seen in mining or similar applications, the majority of the network can use a combination of multimode fibre and copper (especially when EMI is not a major concern), with single-mode used only for certain longer cable runs (down shafts, for example).

When using fibre for cable runs, one can also consider using multi-core fibre cables where possible or required. These cables include a number of fibre cores within a single armoured/protected cable. This can be much more efficient when multiple cables are required (such as cables out to the field), but are also great for future expansion and maintenance.

Having a multi-core cable with a number of dark fibre cores (unconnected to the network) allows for quick resolution of individual cable core breaks in the future, as well as for possible expansion if needed in the future. However, the flip side of this coin is to not over-rely on multi-core cable for redundancy.

Having two redundant connections between sections of the site is good practice, but when both of these are within a single multi-core cable, a complete break of that cable will break both redundant connections. This is a commonly seen issue where the logical/physical topology of the network is not correctly tied to the actual physical site, leading to designs that seem to be highly resilient on paper but are actually very susceptible to single points of failure in the physical world.

A similar issue is seen when devices are specified with redundant power inputs, but upon installation these are both bridged to a single power source, once again leading to a single point of failure.

While evaluating the various options for cable runs, one must also ensure to keep redundancy in mind. Obviously, at least one physical connection is needed to each location, but various redundancy protocols are available to provide higher resiliency and reliability on the network. The choice of the redundancy method to be used will depend on, or dictate, the topology that must be used.

Some redundancy protocols, such as Media Redundancy Protocol (MRP), specify that a ring topology must be used, but others such as Rapid Spanning Tree Protocol (RSTP) allow for full meshes of a limited number of switches. Newer redundancy protocols are being developed all the time, such as the recent Parallel Redundancy Protocol (PRP) which requires two completely separate and independent networks running in parallel. As such, it is critical to consider the options available from a site-specific viewpoint in terms of where cables can be run, as well as from a logical perspective in terms of what redundancy can or will be used.

One must also be realistic with regard to the requirements of the site. PRP provides extremely high levels of redundancy, since other redundancy protocols can be layered within its infrastructure. PRP also allows instant recovery since, rather than having to recover the network, the protocol duplicates all traffic from the beginning, so even when an entire internal network goes down, a duplicate packet is already in transmission around a separate, independent network.

However, PRP requires not only two completely separate networks (normally exact physical duplicates), but also specialised hardware to allow edge devices to interface with both networks correctly. Very few systems outside of the utility market require such levels of redundancy, and even within utility systems one must be careful to not over-specify PRP as it can lead to extremely costly network capital and operational expenditure.

Alternatively, one can look at using PRP only for certain highly critical systems, while using other redundancy protocols for less critical parts of the system. PRP even allows the use of internal networks as networks in their own right (with some restrictions on where data can flow), which again can be used to balance redundancy and the cost of implementing that redundancy.

At this stage it is important that one has an intermediate understanding of industrial networking design, or employs someone to help with the process, even at just a basic level for now. These initial decisions about the network can be critical to the entire process, but are often not given the respect they deserve, leading to subpar or overly costly networks being implemented.

In fact, these early decisions are generally much more costly and time consuming to fix than the decisions that need to be made at later stages of the design. An incorrect cable design and installation can require an upheaval of the entire site to solve, not to mention the man-hours required. Changing an IP subnet, on the other hand, while painful and disruptive, can normally be done in a few hours of downtime. As with many things in life, the impact of not doing things properly from the beginning generally far outweighs the cost of doing it correctly.

So far, everything discussed has mostly revolved around Layer 1 of the Open Systems Interconnection (OSI) model, that being the physical layer. There is one last major physical-layer point that must be considered, and that is the actual port count available for edge devices at each connection point of the network. In some cases, a switch may be used exclusively for backbone interconnection, meaning no edge port devices are required, while in other cases the switch may sit closer to the logical edge of the network and will provide network connections for tens of devices.

A switch/network’s main purpose in any solution is to provide connectivity to the end devices and allow them to intercommunicate. With this very obvious functionality in mind, it is surprising how often the edge device count that a network must cater for is considered almost as an afterthought. The edge port connections required on the network should be at least partially understood at this stage, and these requirements should be kept in mind for the rest of the design phase.

Hardware choice and spares strategy

Different manufacturers will offer different models of switches, most of which these days are at least slightly modular. This will normally allow a degree of flexibility, however, whether this will cater for your requirements must be carefully checked, ideally by contacting your supplier for the hardware and asking for their assistance.

Choosing hardware based only on port count and physical details like power supply can easily lead to the incorrect hardware being provided, which will often lead to overpaying for the hardware as you end up purchasing features that will never be implemented – or even worse, hardware that does not cater for a critical feature that will be needed at some point.

Another big consideration at this stage is spares and expansions management. Especially in today’s uncertain world where devices can take over half a year to be delivered due to component shortages, spares are vital to reliable network operations.

However, it is undesirable to waste storage space and other resources on too many spares of differing types. The best strategy is therefore to look at standardising the network hardware as much as possible. In some networks this might be easier, with the network requirements leading to a single switch build throughout. In others, a system might be desired that allows for modules to be installed into a switch chassis to provide the port counts required. This could mean that, for a central backbone switch, a few modules with high-bandwidth ports like gigabit fibre is ideal, while for an edge switch, some of the fibre modules could be swapped for higher port-count 100 Mbps copper or fibre modules, providing more ports with lower individual transfer speeds.

In most mission-critical networks, the rule of thumb is to look at gigabit-speed backbones, with 100 Mbps for edge device connections. Unless your network is also meant to handle high-traffic systems such as CCTV, a 100 Mbps edge connection/gigabit backbone philosophy should suffice in most cases. For higher-traffic networks, gigabit edge connections may be needed, and in some cases 10 gigabit backbones.

Going from physical to logical design

At this stage one should have considered the physical topology of the network and have an idea of the edge device counts per switch. An idea of what switches are suitable can thus be established, although there may be some upcoming decisions that will still determine what features the switches must support. More importantly, one should now have a rough idea of the size of the network, and what devices are going to be using it.

The next step is to consider these end devices and how they can be logically segregated on the network. For instance, devices can generally be grouped according to their functions, so scada-related devices could make up one group while engineering access terminals such as HMIs may fall into another group.

In some instances, it might be desirable to group security-specific devices together, or allowances might need to be made for guest access by contractors, with limited Internet and network access. On other sites the segregation may be geographical, or may relate to the functionality of the plant, such as separating different stages of manufacture into their own groups. This logical segregation is very site- and scenario-specific, and should be done with input from both the users of the end devices as well as a network design specialist.

The goal in the end is to split devices into these logical groups, which can then each be assigned their own VLAN and IP subnet. IP subnetting is very common to all larger networks, whether a corporate office or a mission-critical power plant. IP subnetting restricts devices from sharing traffic at Layer 3 (the IP layer), meaning that, they cannot communicate with one another without a router in place to route the traffic between different subnets. However, even with IP subnets in place, there are certain traffic types that will be able to reach between devices, such as various broadcast traffic.

Instead, VLANs allow traffic to rather be segregated at the switching level, meaning that any traffic from one VLAN can be prevented from reaching a device in another VLAN without specifically routing and allowing this traffic through.

This more stringent segregation has a number of advantages, not the least of which is security, as viruses and malware will often use abnormal methods to transmit themselves between devices, such as by exploiting broadcasts and similar. It also increases reliability and availability, while reducing the volumes of traffic that the end devices have to deal with.

Problems are also much better kept within a VLAN, meaning that issues in one system are less likely to affect other systems on the network. In many cases, such as in utility networks that must comply with IEC61850 standards, VLANs and other protocols are not only recommended but are in fact required to provide the necessary functionality.

The final VLAN and IP subnet designs will both be very specific to the application in question, and many different approaches can be correct. Here, again, a design can greatly benefit from the expertise of someone familiar with mission-critical network design – following some of the best industry practices as well as having hands-on familiarity with these types of systems is essential to providing a well-balanced and efficient VLAN design.

Similarly, the IP subnetting will be dependent on certain factors, not least of which is the question of whether a particular network segment is part of a larger network that needs to be interfaced with at some point. For instance, in something like a substation network design for a new Independent Power Producer (IPP) that needs to interface to a government department’s existing systems at some point, one needs to consider their existing networking layout to a degree, and cater for things like routing and remote access where needed.

The VLAN and IP subnetting setup is critical to efficient and reliable operation of the network and attached systems, and also creates the foundation of the logical network design upon which the rest of the network design will be based.

The next major consideration is routing on the network, both for those cases where devices in one VLAN need to communicate with devices in another VLAN, as well as potentially allowing devices on the network to communicate outside to a wide-area network such as a country-wide private network, or more commonly, the Internet.

Routing between VLANs within a private network is generally secure and does not raise security concerns, however any network not under your direct control should always be considered insecure, even if this is another company network such as corporate IT. From a security point of view, one must always protect against attacks from such unsecure networks, which can very easily originate within a corporate network. These attacks can be deliberately targeted by users with malicious intent, but can also originate from innocent users who are simply in the wrong place, doing the wrong things. In either case, an appropriate level of cybersecurity protection must be in place.

Cybersecurity is an open-ended spectrum when talking about a mission-critical network, and can also be very difficult to justify, especially in today’s world where more than just a basic firewall is generally required. Depending on the levels of security applied, this can cost in the region of tens or hundreds of thousands of Rands, if not more.

Security does not add convenience to the network or its systems (in fact it generally restricts convenience quite severely), and will not directly improve profits or productivity. In fact, it will generally have a detrimental effect on profits (especially when monthly or annual licence renewals are required) and also productivity (since extra steps are often required in many processes as a result of stronger security).

As such, security can be very difficult to quantify, especially given that this aspect can outweigh the cost of the network in some cases. However, the cost of having a security breach can be devastating, and the cost of not having proper cybersecurity can end up crippling a company or putting them out of business. It is not uncommon to hear of a new major data breach every few days, and things like ransomware or similar can not only completely shut down production at a site but, even worse, could interfere with things like safety and security, leading to yet further losses or health and safety risks.

In summary, it is important to realise that the network is effectively the nervous system of a modern site, allowing all the various protection, control, monitoring, security and safety devices to intercommunicate in a quick, reliable and efficient fashion.

Oftentimes the design and implementation of the network is handed off as a side project to someone involved with one of the main control systems at a site, and they are simply worried about getting basic communications to work. However, this generally leads to an under-designed and unreliable network. Often such a network will work fine for the first year or two, but as soon as devices and cables start ageing or something gets damaged, the network starts to struggle, and resolving the issue at this stage can be much more costly and disruptive than designing a proper, resilient network from the beginning.

This design and implementation should be handled by specialists in industrial networking, who understand the process and have experience dealing with the unforeseen problems that can arise on these systems. As with many such components of a mission-critical system, skimping on effort and cost early on in the network design can lead to much more effort and money being expended in the long run, whereas spending the time and capex early on will lead to a resilient network that requires minimal maintenance, conserving budget and time to spend on maintaining and upgrading the actual control, protection and monitoring systems.

For more information contact H3iSquared, +27 11 464 6025, sales@h3isquared.com, www.h3isquared.com

© Technews Publishing (Pty) Ltd | All Rights Reserved