AoIP: the benefits


Due to its differential features with respect to video, the need for less bandwidth and greater ease for digital processing, audio was the first thing that migrated to the IP environment within the audiovisual sector. Although this took place more than twenty years ago, it still raises many doubts when implementing an AoIP workflow, especially if it comes from a linear environment, even an analog one. Let’s take a look at what Audio over IP (AoIP) is and what it entails.

Within the audiovisual field many people, including myself, think that audio is not just the other half of what we do, but something much more important than video. A one specific fault in video is less critical in the eyes of the viewer, as is also a loss of video quality, as compared to audio. Proof of this is the quality of most ‘video interviews’ that have filled the news programs in the last year. Image there is more than questionable, whereas the audio is much better, as compared to what comes out of a TV set.

It is also true that providing quality equipment is much easier in audio than in video. A good microphone connected to the computer and well placed is enough: auto-configurable preamps and noise reduction and optimization algorithms do the rest more than decently. When dealing with images, things are quite different. A good camera, good framing, lighting… It is understandable that the same standards are not met when having to take care of all these factors from home.

Audio signal

Bit depth: 16 bits
Sampling rate: 48 kHz
Bandwidth: 16 * 48,000 = 768,000 bits/sec (750 Kbits/s, less than 1 Mbps)
1 Gbps Ethernet (80% usage rate)
Audio channels on Ethernet: 800,000,000 / 768,000 = 1,048 channels per link (Equivalent to more than 32 MADI cables)

When making reference to audio over IP, the analogy is the same. It is easier, partly due to technical reasons and partly because of previous experience gathered by both manufacturers and professionals in the field, than with images. However, complications do exist. We can have audio signals with different sampling rates and definition bits, and it is assumed that we must be able to use them interchangeably. Getting there is not trivial and requires some experience and skill.

In TM Broadcast we are going to approach the world of AoIP in a trilogy of articles, so as not to make it too dense. These articles will review the benefits, practical applications and management of AoIP networks in production environments.

Increasing complexity

The first item entailing increased complexity in the audio field is the number of channels. The image is just one. However, there are at least two audio channels. And from there, up to 16 channels or even more. MADI was one of the first technologies that emerged to transfer multiple audio channels, up to 64, over a single cable; digitally, of course. This is based on time-division multiplexing, which requires specific equipment to embed and de-embed said channels.

Something similar happens in SDI. This technology is capable of transmitting multiple audio channels, 16 in this instance, within the transport network. In this case, we have an even greater limitation: the audio always has to go along with the image; they cannot be separated for transport. Again, specific equipment is required in order to aggregate and separate said channels, process them, route them, send them and receive them, which limits operability and increases costs.

Asynchronous operation

IP solves some of these problems, as it is much more flexible and scalable than MADI, SDI, AES, or similar ‘traditional’ digital audio standards. Although it also basically uses time-division multiplexing, this is performed asynchronously, thus allowing information to be inserted and extracted from the transport stream in a simpler way.

This results in one of the greatest benefits that IP brings to the world of audio: it is practically agnostic with respect to the hardware used for transmission of information and signals. While it is true that nearly everyone uses Ethernet technology to transport IP data packages, other protocols can be used, which results in even lower costs and greater interoperability. Using Ethernet to carry IP signals in production environments is only logical. Firstly, because its current bandwidth allows so without the need to go to more exclusive and therefore expensive protocols; secondly, because it has been there for more than 30 years now, with the trust and reliability this implies.

Bidirectional data

Another aspect connected to adoption of IP in audio is that, through the same physical medium, we can transmit signals in both directions without greater technical complexity or the need for more complex infrastructure. This is a very important paradigm shift for the broadcast world, which is more used to thinking only about one-way signals.

Not only does it enable the exchange of audio signals between two points in both directions -which is a direct application of having bidirectional information- but it also allows controlling the or source from destination and vice versa. Let’s not just think about remotely controlling equipment (for example, muting a microphone from the mobile unit), but also about allowing the destination point to detect the signals that the source is sending to it in order to seamlessly adapt the source to the needs or features of the network.

This solves one of the big problems introduced by asynchronous communication that MADI, SDI or AES did not involve: interoperability. In a synchronous communication system, signal specifications are defined and fixed. However, in asynchronous systems this is not necessarily the case. We can have signals with different technical specifications on the same network without differentiating them. The fact that both source and destination of these signals -or even the network itself- can communicate, allows them to adapt to these kinds of signals and therefore this interoperability issue is solved at once.

Better flexibility

As we have just mentioned, closed synchronous systems guarantee interoperability at the expense of giving up a lot of flexibility. The capacity of the equipment, the network, the infrastructure in general, and the configuration and technical specifications for the signals, must be all defined and established beforehand… and remain unchanged throughout the system. And what happens in the event of an unforeseen event, or a configuration or an equipment error? How should any potential events be foreseen and sorted out? One answer: by oversizing. This is how until now we were sure of being able to deal with any signal when facing any situation. Obviously, this is expensive whichever way you look at it.

In IP, the receiver of a signal does not have to know how that signal is going to get there or even how many signals will be received or in what order: it simply listens and adapts to what it receives. For this, some degree of intelligence and the aforementioned bidirectionality are necessary in order to communicate with the source and ‘reach an agreement’. I am oversimplifying, I know, but this is basically correct.

Not only this, but a single signal can be sent to multiple destinations without increasing the bandwidth used: multicast. This provides yet another layer of flexibility which, coupled with increased interoperability and two-way communication capabilities, offers a plethora of possibilities to ponder.

And what about the clock?

Amongst the basics in any broadcast environment we find reference signals. All the equipment items have to work guided by the same reference signal, since they are all synchronous. All of them must operate according to the same pattern, whatever it is. In the audio environment, these reference signals are called clocks. In both MADI and AES or SDI, there is a master clock that sets the sampling pattern for all equipment, thus allowing interconnection for exchange of signals without loss.

A single missing sample of an audio signal is clearly perceptible to ordinary mortals, so this is critical. But we have already mentioned that in the IP world communications are asynchronous, so the existence of these clocks is not necessary. Can you imagine a clock to which all the computers in the world that connect to the Internet had to link up in order to talk to each other? Unthinkable, right? Well, the same thing happens in AoIP: they are not required.

In any case, a certain level of synchronization is necessary, but this is generated within the network itself autonomously. An RTP (Real Time Protocol) was initially used to synchronize the network, and later on PTP (Precision Time Protocol) was adopted. In the latter case, using a specific clock generator is not necessary, as any signal-generating equipment within the network can act as a PTP generator, thus simplifying operations and allowing redundancy, since any other generator of an audio stream can be the master PTP.

Synchronizing ‘several’ clocks

So there is no clock, but there are several clocks. What a mess, right? Well, it seems somewhat complex, but it really is simpler than that and, above all, much more operational. It is true that there is no master clock, but multiple clocks running on the same network and this is something that must be solved, but everything is already invented. The solution for this is called a buffer.

Although it is assumed that, thanks to PTP, both the sender and the receiver are synchronized, certain anomalies are allowed in the network as caused by other signals that we do not control or by changes in infrastructure during the transmission of signals. To do this, certain security ‘buffers’ are implemented. The information that arrives before recreating the audio itself is saved in order to allow correcting errors and reordering packets, thus minimizing losses and allowing certain asynchrony.

The only problem is that the time taken to process the signals increases. If part of the signal has to be stored in a memory before being regenerated, this increases the processing time, it is obvious. However, due to the large capacity of the network and the high sampling rate used in audio, 48 kHz is equivalent to one sample every 0.02 milliseconds. A lot of samples must be saved to make the delay of these buffers a problem. Only in long distances or on highly busy networks the necessary size of the buffers can result in noticeable delays with an impact on the signal, but that is really unusual.


In this first series of articles on AoIP we have discussed the benefits that this technology brings to us as well certain problems together with their respective solutions. Increased interoperability, information transmission capacity and flexibility are its three greatest advantages. However, there are certain drawbacks to be sorted out, such as the reliability of the network or its complexity.
The large experience gathered by IP equipment manufacturers and its use in other industries enables to have better equipment at a lower cost, something that must be always considered.
In part two of this trilogy on AoIP we will delve into practical AoIP solutions in production environments, and then move on to address in part three the management of this type of networks and their security.

Author: Yeray Alfageme

Riedel designs custo
BT Media and Broadca