Basic knowledge of streaming video encoding.
Adobe Media Encoder CS4
Flash Media Encoding Server 3.5
As a producer of video on the web, you know that you're judged by the quality of your video. In this regard, many producers are considering converting from the venerable On2 VP6 codec to H.264. H.264 offers better visual quality than VP6, and the AAC audio codec offers much better quality than the MP3 codec paired with VP6. Starting with Adobe Flash Player 9 Update 3, you could play back files encoded in H.264/AAC formats. As of September 2008, the penetration of H.264/AAC-compatible players exceeded 89% in all Internet-connected PCs. No wonder they're switching over.
This article first discusses the issues involved in such a changeover, including the potential requirement for royalties. I then describe the H.264-specific encoding parameters offered by most encoding programs. Finally, I cover how you can produce H.264 video with Adobe Media Encoder CS4 and Adobe Flash Media Encoding Server 3.5.
To begin, I should explain some introductory concepts related to H.264 video.
What is H.264?
H.264 is a video compression standard known as MPEG-4 Part 10, or MPEG-4 AVC (for "advanced video coding"). It's a joint standard promulgated by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).
H.264's audio sidekick is AAC (advanced audio coding), which is designated MPEG-4 Part 3. Both H.264 and AAC are technically MPEG-4 codecs—though it's more accurate to call them by their specific names—and compatible bitstreams should conform to the requirements of Part 14 of the MPEG-4 spec.
According to Part 14, MPEG-4 files containing both audio and video, including those with H.264/AAC, should use the .mp4 extension, while audio-only files should use .m4a and video-only files should use .m4v. Different vendors have adopted a range of extensions that are recognized by their proprietary players, such as Apple with .m4p for files using FairPlay Digital Rights Management and .m4r for iPhone ringtones. (Mobile phones use the .3gp and .3g2 extensions, though I don't discuss producing for mobile phones in this article.)
Like MPEG-2, H.264 uses three types of frames, meaning that each group of pictures (GOP) is comprised of I-, B-, and P-frames, with I-frames like the DCT-based compression used in DV and B- and P-frames referencing redundancies in other frames to increase compression. I'll cover much more on this later in this article.
Like most video coding standards, H.264 actually standardizes only the "central decoder...such that every decoder conforming to the standard will produce similar output when given an encoded bitstream that conforms to the constraints of the standard," according to Overview of the H.264/AVC Video Coding Standard published in IEEE Transactions on Circuits and Systems for Video Technology (ITCSVT). Basically, this means that there's no standardized H.264 encoder. In fact, H.264 encoding vendors can utilize a range of different techniques to optimize video quality, so long as the bitstream plays on the target player. This is one of the key reasons that H.264 encoding interfaces vary so significantly among the various tools.
Will there be royalties?
If you stream H.264 encoded video after December 31, 2010, there may be an associated royalty obligation. As yet, however, it's undefined and uncertain. Here's an overview of what's known about royalties to date.
Briefly, H.264 was developed by a group of patent holders now represented by the MPEG Licensing Suthoring, or MPEG-LA for short. According to the Summary of AVC/H.264 License Terms(PDF, 34K) you can download from the MPEG-LA site, there are three classes of video producers subject to a potential royalty obligation.
If you're in the first two classes, and are either distributing via pay-per-view or subscription, you may already owe MPEG-LA royalties. The third group, which is clearly the largest, is for free Internet broadcast. Here, there will be no royalties until December 31, 2010 (source: AVC/H.264 License Agreement). After that, "the royalty shall be no more than the economic equivalent of royalties payable during the same time for free television."
According to their website, MPEG-LA must disclose licensing terms at least one year before they become due, or no later than December 31, 2009. Until then, we're unfortunately in the dark as to which uses of H.264 video will incur royalties, and the extent of these charges. For more information on H.264-related royalties, check out my article, The Future's So Bright: H.264 Year in Review, at StreamingMedia.com.
H.264 and Flash Player
As I mentioned, Adobe added H.264 playback support to Adobe Flash Player 9 Update 3 back in 2007. The apparent goal was to support the widest possible variation of files containing H.264 encoded video, and Flash Player should play.mp4, .m4v, .m4a, .mov, and .3gp files, H.264 files using the .flv extension, as well as files using the newer extensions that were released along with Flash Player 9 (see Table 1).
Table 1. File extensions for H.264 files produced for Flash Player playback
|.f4v||'F4V '||video/mp4||Video for Flash Player|
|.f4p||'F4P '||video/mp4||Protected media for Flash Player|
|.f4a||'F4A '||audio/mp4||Audio for Flash Player|
|.f4b||'F4B '||audio/mp4||Audio book for Flash Player|
I'll describe profiles and levels in the next section. For now, understand that Flash Player supports the Baseline, Main, High, and High 10 H.264 profiles with no levels excluded. Accordingly, when you're producing H.264 video for Flash Player, you're free to choose the most advanced profile supported by the encoding tool, which is typically the High profile. On the audio side, Flash Player can play AAC Main, AAC Low Complexity, and AAC SBR (spectral band replication), which is otherwise known as High-Efficiency-AAC, or HE-AAC.
Producing H.264 video
You have seen that you have nearly complete flexibility regarding profiles and extensions; what else do you need to know before you dig into the details? A couple of things.
First, unlike VP6, which is available only from On2, there are multiple suppliers of H.264 codecs, including MainConcept, whose codec Adobe uses in Adobe Media Encoder and Adobe Flash Media Encoding Server. I've compared the quality of H.264 files produced with H.264 codecs from other vendors, and MainConcept has proven to be the best.
In general, while the overall quality of other codecs has improved, there are some tools to avoid out there. If you're producing with a different tool and not achieving the quality you were hoping for, try encoding with one of the Adobe tools.
Second, some older encoding tools do not offer output directly into F4V format. If F4V format is not offered in your encoding tool, the best alternative is to produce an MPEG-4 compatible streaming media file using the .mp4 extension.
With this as background, I'll describe the most common H.264 encoding parameters.
H.264 encoding parameters
Though H.264 codecs come from different vendors, they use the same general encoding techniques and typically present similar encoding options. Here I review the most common H.264 encoding options.
Understanding profiles and levels
According to the aforementioned article, Overview of the H.264/AVC Video Coding Standard, a profile "defines a set of coding tools or algorithms that can be used in generating a conforming bitstream, whereas a level "places constraints on certain key parameters of the bitstream." In other words, a profile defines specific encoding techniques that you can or can't utilize when encoding the files (such as B-frames), while the level defines details such as the maximum resolutions and data rates.
Take a look at Figure 1, which is a filtered screen capture of a features table from Wikipedia's description of H.264. On top are H.264 profiles, including the Baseline, Main, High, and High 10 profiles that Flash Player supports. On the left are the different encoding techniques available, with the table detailing those supported by the respective profiles.
Figure 1. Encoding techniques enabled by profile (source: Wikipedia)
As you would guess, the higher-level profiles use more advanced encoding algorithms and produce better quality (see Figure 2). To produce this comparison, I encoded the same source file to the same encoding parameters. The file on the left uses the Main Profile; the files on the right uses the Baseline. A quick check of the chart in Figure 1 reveals that the Main Profile enables B slices (also called B-frames) and the higher-quality CABAC encoding, which I define later in this article. As you can see, these do help the Main Profile deliver higher-quality video than the Baseline.
Figure 2. File encoded using the Main profile (left) retaining much more quality than a file encoded using the Baseline profile (right)
So, the Main and High profiles deliver better quality than the Baseline Profile; what's the catch? The catch is, as you use more advanced encoding techniques, the file becomes more difficult to decompress, and may not play smoothly on older, slower computers.
This observation illustrates one of the two trade-offs typically presented by H.264 encoding parameters. One trade-off is better quality for a file that is harder to decompress. The other trade-off is a parameter that delivers better quality at the expense of encoding time. In some rare instances, as with the decision to include B-frames in the stream, you trigger both trade-offs, increasing both decoding complexity and encoding time.
To return to profiles: At a high level, think about profiles as a convenient point of agreement for device manufacturers and video producers. Mobile phone vendor A wants to build a phone that can play H.264 video but needs to keep the cost, heat, and size requirements down. So the crafty chief of engineering searches and finds the optimal processor that's powerful enough to play H.264 files produced to the Baseline Profile. If you're a video producer seeking to create video for that device, you know that if you encode using the Baseline profile, the video will play.
Accordingly, when producing H.264 video, the general rule is to use the maximum profile supported by the target playback platform, since that delivers the best quality at any given data rate. If producing for mobile devices, this typically means the Baseline Profile, but check the documentation for that device to be sure. If producing for Flash Player consumption on Windows or Macintosh computers, this means the High Profile.
This sounds nice and tidy, but understand this: While encoding using the Baseline Profile ensures smooth playback on your target mobile device, using the High Profile for files bound for computer playback doesn't provide the same assurance. That's because the High Profile supports H.264 video produced at a maximum resolution of 4096 × 2048 pixels and a data rate of 720 Mbps. Few desktop computers could display a complete frame, much less play back that stream at 30 frames per second.
Accordingly, while producing for devices is all about profile, producing for computers is all about your video configuration. Here, the general rule is that decoding H.264 video is about as computationally intense as VP6—or Windows Media, for that matter. So long as you produce your H.264 video at a similar resolution and data rate as the other two codecs, it should play fine on the same class of computer. (For comparative playback statistics for H.264, VP6 and VC-1, check out my StreamingMedia.com article, Decoding the Truth About Hi-Def Video Production.)
In general, this means that as long as you're producing SD video at 640 × 480 resolution and lower, it should play fine on most post–2003 computers. If you're producing at 720p or higher, these streams won't play smoothly on many of these computers. You should consider offering an alternative SD stream for these viewers.
What about H.264 levels? If producing for mobile devices with limited screen resolution and bandwidth, you also have to choose the correct level, which again should be specified by the device manufacturer. However, since Flash Player can handle any level supported by any of the supported profiles, you don't have to worry about levels when producing for Flash Player playback on a personal computer.
When you select the Main or High Profiles, some encoding tools will give you two options for entropy coding mode (see Figure 3):
- CAVLC: Context-based adaptive variable-length coding
- CABAC: Context-based adaptive binary arithmetic coding
Of the two, CAVLC is the lower-quality, easier-to-decode option, while CABAC is the higher-quality, harder-to-decode option.
Figure 3. Your entropy coding choices: CABAC and CAVLC
Though results are source-dependent, CABAC is generally regarded as being between 5–15% more efficient than CAVLC. This means that CABAC should deliver equivalent quality at a 5–15% lower data rate, or better quality at the same data rate. In my own tests, CABAC produced noticeably better quality, though only in HD test clips encoding to very low data rates. This is shown in Figure 4, from a 720p file produced with CABAC on the left and CAVLC on the right, both to the same 800 kbps video data rate. Figure 4 shows a portion of a frame cut from a 16:9 720p video. Now 800 kbps is very low for 720p footage; by way of comparison, YouTube encodes H.264 720p footage at 2 Mbps, over 2.5 times the data rate.
Figure 4. 720p file produced using CABAC on the left, CAVLC on the right
Though neither image would win an award for clarity, the ballerina's face and other details are clearly more visible on the left. The bottom line is that CABAC should deliver better quality, however modest the difference. Now the question becomes, How much harder is the file to decompress and play?
Not that much, it turns out. I tested this on two of the less-powerful multiple-core computers in my office, one a Hewlett-Packard notebook with a Core 2 Duo processor, and the other a Power PC-based Apple PowerMac. As you can see in Table 2, the CABAC file increased the CPU load by less than 1% on the HP notebook, and less than 2% on the Mac. Based on the improved quality and minimal difference in the required playback CPU, I recommend choosing CABAC whenever the option is available.
Table 2. CPU consumed when playing back H.264 files encoded using CABAC and CAVLC
|HP Compaq 8710w Mobile Workstation –|
Core 2 duo
|Apple PowerMac – Dual 2.7 GHz PPC G5||35.5||33.7||1.8%|
I, P, and B-frames
It's common knowledge that talking-head footage, where very little changes from frame to frame, encodes at higher quality than dynamic, motion-filled video. That's because H.264, like all high-quality motion codecs, is designed to take advantage of redundancies between video frames. The more redundancy, the higher the quality at any given bit rate.
To leverage this redundancy, H.264 streams include three types of frames (see Figure 5):
- I-frames: Also known as key frames, I-frames are completely self-referential and don't use information from any other frames. These are the largest frames of the three, and the highest-quality, but the least efficient from a compression perspective.
- P-frames: P-frames are "predicted" frames. When producing a P-frame, the encoder can look backwards to previous I or P-frames for redundant picture information. P-frames are more efficient than I-frames, but less efficient than B-frames.
- B-frames: B-frames are bi-directional predicted frames. As you can see in Figure 5, this means that when producing B-frames, the encoder can look both forwards and backwards for redundant picture information. This makes B-frames the most efficient frame of the three. Note that B-frames are not available when producing using H.264's Baseline Profile.
Figure 5. I, P, and B-frames in an H.264-encoded stream
Now that you know the function of each frame type, I'll show you how to optimize their usage.
Working with I-frames
Though I-frames are the least efficient from a compression perspective, they do perform two invaluable functions. First, all playback of an H.264 video file has to start at an I-frame because it's the only frame type that doesn't refer to any other frames during encoding.
Since almost all streaming video may be played interactively, with the viewer dragging a slider around to different sections, you should include regular I-frames to ensure responsive playback. This is true when playing a video streamed from Flash Media Server, or one distributed via progressive download. While there is no magic number, I typically use an I-frame interval of 10 seconds, which means one I-frame every 300 frames when producing at 30 frames per second (and 240 and 150 for 24 fps and 15 fps video, respectively).
The other function of an I-frame is to help reset quality at a scene change. Imagine a sharp cut from one scene to another. If the first frame of the new scene is an I-frame, it's the best possible frame, which is a better starting point for all subsequent P and B-frames looking for redundant information. For this reason, most encoding tools offer a feature called "scene change detection," or "natural key frames," which you should always enable.
Figure 6 shows the I-frame related controls from Flash Media Encoding Server. You can see that Enable Scene Change detection is enabled, and that the size of the Coded Video Sequence is 300, as in 300 frames. This would be simpler to understand if it simply said "I-frame interval," but it's easy enough to figure out.
Figure 6. I-frame related controls from Flash Media Encoding Server
Specifically, the Coded Video Sequence refers to a "Group of Pictures" or GOP, which is the building block of the H.264 stream—that is, each H.264 stream is composed of multiple GOPs. Each GOP starts with an I-frame and includes all frames up to, but not including, the next I-frame. By choosing a Coded Video Sequence size of 300, you're telling Flash Media Encoding Server to create a GOP of 300 frames, or basically the same as an I-frame interval of 300.
I'll describe the Number of B-Pictures setting further on, and I've addressed Entropy Coding Mode already; but I wanted to explain the Minimum IDR interval and IDR frequency. I'll start by defining an IDR frame.
Briefly, the H.264 specification enables two types of I-frames: normal I-frames and IDR frames. With IDR frames, no frame after the IDR frame can refer back to any frame before the IDR frame. In contrast, with regular I-frames, B and P-frames located after the I-frame can refer back to reference frames located before the I-frame.
In terms of random access within the video stream, playback can always start on an IDR frame because no frame refers to any frames behind it. However, playback cannot always start on a non-IDR I-frame because subsequent frames may reference previous frames.
Since one of the key reasons to insert I-frames into your video is to enable interactivity, I use the default setting of 1, which makes every I-frame an IDR frame. If you use a setting of 0, only the first I-frame in the video file will be an IDR frame, which could make the file sluggish during random access. A setting of 2 makes every second I-frame an IDR frame, while a setting of 3 makes every third I-frame an IDR frame, and so on. Again, I just use the default setting of 1.
Minimum IDR interval defines the minimum number of frames in a group of pictures. Though you've set the Size of Codec Video Sequence at 300, you also enabled Scene Change Detection, which allows the encoder to insert an I-frame at scene changes. In a very dynamic MTV-like sequence, this could result in very frequent I-frames, which could degrade overall video quality. For these types of videos, you could experiment with extending the minimum IDR interval to 30–60 frames, to see if this improved quality. For most videos, however, the default interval of 1 provides the encoder with the necessary flexibility to insert frequent I-Frames in short, highly dynamic periods, like an opening or closing logo. For this reason, I also use the default option of 1 for this control.
Working with B-frames
B-frames are the most efficient frames because they can search both ways for redundancies. Though controls and control nomenclature varies from encoder to encoder, the most common B-frame related control is simply the number of B-frames, or "B-Pictures" as shown in Figure 6. Note that the number in Figure 6 actually refers to the number of B-frames between consecutive I-frames or P-frames.
Using the value of 2 found in Figure 6, you would create a GOP that looks like this:
...all the way to frame 300. If the number of B-Pictures was 3, the encoder would insert three B-frames between each I-frame and/or P-frame. While there is no magic number, I typically use two sequential B-frames.
The algorithm of context-based adaptive binary arithmetic coding (CABAC) has been developed within the joint standardization activities of ITU-T and ISO/IEC for the design and specification of the video coding standard H.264/AVC. In a first preliminary version, the new entropy-coding method of CABAC was introduced as a standard contribution [VCEG-L13] to the ITU-T VCEG meeting in January 2001. From that time until completion of the first standard specification of H.264/AVC (Version 1) in May 2003, the CABAC algorithm underwent a series of revisions and further refinements.
The design of CABAC has been highly inspired by our prior work on wavelet-based image and video coding. However, in comparison to this research work, additional aspects previously largely ignored have been taken into account during the development of CABAC. These aspects are mostly related to implementation complexity and additional requirements in terms of conformity and applicability. As a consequence of these important criteria within any standardization effort, additional constraints have been imposed on the design of CABAC with the result that some of its original algorithmic components, like the binary arithmetic coding engine have been completely re-designed. Other components that are needed to alleviate potential losses in coding efficiency when using small-sized slices, as further described below, were added at a later stage of the development. Support of additional coding tools such as interlaced coding, variable-block size transforms (as considered for Version 1 of H.264/AVC) as well as the later re-introduced, simplified use of 8x8 transforms have also been integrated at a time when the core design of CABAC was already reaching a level of technical maturity.
At that time - and also at a later stage when the scalable extension of H.264/AVC or High Efficiency Video Coding (HEVC) was designed - another feature of CABAC has proven to be very useful. It turned out that in contrast to entropy-coding schemes based on variable-length codes (VLCs), the CABAC coding approach offers an additional advantage in terms of extensibility such that the support of newly added syntax elements can be achieved in a more simple and fair manner. Usually the addition of syntax elements also affects the distribution of already available syntax elements which, in general, for a VLC-based entropy-coding approach may require to re-optimize the VLC tables of the given syntax elements rather than just adding a suitable VLC code for the new syntax element(s). Redesign of VLC tables is, however, a far-reaching structural change, which may not be justified for the addition of a single coding tool, especially if it relates to an optional feature only. Since CABAC guarantees an inherent adaptivity to the actually given (conditional) probability, there is no need for further structural adjustments besides the choice of a binarization or context model (and associated initialization values) which, as a first approximation, can be chosen in a canonical way by using the prototypes already specified in the CABAC design.
CABAC has been adopted as a normative part of H.264/AVC as well as of the HEVC draft standard; in H.264/AVC, it is one of two alternative methods of entropy coding. The other method specified in H.264/AVC is a low-complexity entropy-coding technique based on the usage of context-adaptively switched sets of variable-length codes, so-called Context-Adaptive Variable-Length Coding (CAVLC). Compared to CABAC, CAVLC offers reduced implementation costs at the price of lower compression efficiency. For TV signals in standard- or high-definition resolution, CABAC typically provides bit-rate savings of 10-20% relative to CAVLC at the same objective video quality. Note that for the HEVC draft, CABAC is the only entropy coding method.