#888 Improve performance of AV1 decoder

Open
opened 2 years ago by wolfbeast · 14 comments
wolfbeast commented 2 years ago (Migrated from github.com)

The AV1 video decoder has abysmal performance in comparison to any and all other codecs we have in use in our tree.

This is a known issue; libaom is primarily a reference library. Alternatives (e.g. dav1d) are available but rely on assembly that is (currently) incompatible with our build system assembly compiler (yasm).

This issue to discuss alternatives/research improvements.

The AV1 video decoder has abysmal performance in comparison to any and all other codecs we have in use in our tree. This is a known issue; libaom is primarily a reference library. Alternatives (e.g. dav1d) are available but rely on assembly that is (currently) incompatible with our build system assembly compiler (yasm). This issue to discuss alternatives/research improvements.
mattatobin commented 2 years ago (Migrated from github.com)
Owner

Is there any reason both assembly compilers can’t be used or that the assembly can’t be made yasm compatibile?

Is there any reason both assembly compilers can't be used or that the assembly can't be made yasm compatibile?
wolfbeast commented 2 years ago (Migrated from github.com)
Owner

I’d like to know what makes dav1d yasm-incompatible, too. I’d think assembly = assembly; there’s much less code type variation there, it being a low-level language.

I'd like to know what makes `dav1d` yasm-incompatible, too. I'd think assembly = assembly; there's much less code type variation there, it being a low-level language.
kn-yami commented 2 years ago (Migrated from github.com)
Owner

According to yasm website it has “Nearly feature-complete lexing and parsing of NASM syntax”, so maybe dav1d would require only trivial changes to be compatible with yasm?

According to [`yasm` website](http://yasm.tortall.net/) it has "Nearly feature-complete lexing and parsing of NASM syntax", so maybe `dav1d` would require only trivial changes to be compatible with `yasm`?
trav90 commented 2 years ago (Migrated from github.com)
Owner

I know that Mozilla reached out to the dav1d developers by opening an issue on their bug tracker and asked them to support yasm, but it was unfortunately WONTFIXED with the comment “just use nasm”.

I’m with @wolfbeast in that I’d think assembly = assembly. I’ll see if I can find out what the incompatibility is and if it’s possible to work around. I’ll also investigate if it’s possible to use nasm for dav1d only while continuing to use yasm for the other assembly optimized libs in our tree.

I know that Mozilla reached out to the dav1d developers by [opening an issue](https://code.videolan.org/videolan/dav1d/issues/114) on their bug tracker and asked them to support yasm, but it was unfortunately WONTFIXED with the comment “just use nasm”. I’m with @wolfbeast in that I’d think assembly = assembly. I’ll see if I can find out what the incompatibility is and if it’s possible to work around. I’ll also investigate if it’s possible to use nasm for dav1d only while continuing to use yasm for the other assembly optimized libs in our tree.
roytam1 commented 2 years ago (Migrated from github.com)
Owner

in mozilla bug tracker ( https://bugzilla.mozilla.org/show_bug.cgi?id=1501796#c1 ) , it is stated that dav1d requires assembler that has AVX512 support, which yasm doesn’t support (yet?).

related issue in yasm: https://github.com/yasm/yasm/issues/101

in mozilla bug tracker ( https://bugzilla.mozilla.org/show_bug.cgi?id=1501796#c1 ) , it is stated that dav1d requires assembler that has AVX512 support, which yasm doesn't support (yet?). related issue in yasm: https://github.com/yasm/yasm/issues/101
NintendoManiac64 commented 2 years ago (Migrated from github.com)
Owner

I do not know if @wolfbeast is still using a Phenom II, but if so then it may be problematic that the dav1d team seems to be focused on optimizing only for SSSE3 at minimum SIMD-wise:

https://code.videolan.org/videolan/dav1d/issues/207#note_26325

I do not know if @wolfbeast is still using a Phenom II, but if so then it may be problematic that the dav1d team seems to be focused on optimizing only for SSSE3 at minimum SIMD-wise: https://code.videolan.org/videolan/dav1d/issues/207#note_26325
wolfbeast commented 2 years ago (Migrated from github.com)
Owner

Oh.. so dav1d is produced by videolan? In that case I’m not at all interested in adopting that code. I suggest we keep our focus on the AOM lib or any other alternatives that might pop up.

(I’m also not surprised about the answer to “just use nasm” as a brush-off to Mozilla bringing that up. VL is a prime example of the “elitist our way only” kind of development of software.)

Oh.. so dav1d is produced by videolan? In that case I'm not at all interested in adopting that code. I suggest we keep our focus on the AOM lib or any other alternatives that might pop up. (I'm also not surprised about the answer to "just use nasm" as a brush-off to Mozilla bringing that up. VL is a prime example of the "elitist our way only" kind of development of software.)
wolfbeast commented 2 years ago (Migrated from github.com)
Owner

Since this is the first time I’ve even heard of AVX512 I did some research into it. It’s quite pointless as a concept even because for that kind of data churn you’d want to be using the GPU, and not the CPU. The decoder lib should be focusing on leveraging DXVA instead of making ridiculous demands of CPU hardware for software decoding. The uptake of AVX2 is already really slow for the same reason.

More technical:
The current CPUs execute AVX on 128-bit vector units, and that seems like the most natural width for code. 4-element FP32 vectors let you accelerate the vast majority of 3D and video code, because XYZW homogeneous coordinates and 4x4 transform matrices are a fundamental part of design there. Not to mention that 4 channels of FP32 gives you all the bandwidth and register space you need for any real-time decoding and filtering. 16-element vectors are a lot trickier to utilize efficiently, and when you start getting data wide enough to saturate AVX-512 you would be better off offloading the entire process to either the integrated or dedicated GPU - In case of integrated it would even work in the same memory pool already making it an easy port. Even using a dedicated GPU would be simple enough using the designated APIs for it. So, I really don’t understand the requirement for AVX512 in dav1d, nor the focus on CPU-only decoding of HD video.

So, what is AVX512 good for otherwise then? Specialized, dedicated vector calculations that you would only see in specialized professional software like CAD and architech software.

tl;dr

AVX512 will be super niche feature that will be irrelevant for 90+% of server/workstation market and 99% of desktop market.

Since this is the first time I've even heard of AVX512 I did some research into it. It's quite pointless as a concept even because for that kind of data churn you'd want to be using the GPU, and not the CPU. The decoder lib should be focusing on leveraging DXVA instead of making ridiculous demands of CPU hardware for software decoding. The uptake of AVX2 is already really slow for the same reason. More technical: The current CPUs execute AVX on 128-bit vector units, and that seems like the most natural width for code. 4-element FP32 vectors let you accelerate the vast majority of 3D and video code, because XYZW homogeneous coordinates and 4x4 transform matrices are a fundamental part of design there. Not to mention that 4 channels of FP32 gives you all the bandwidth and register space you need for any real-time decoding and filtering. 16-element vectors are a lot trickier to utilize efficiently, and when you start getting data wide enough to saturate AVX-512 you would be better off offloading the entire process to either the integrated or dedicated GPU - In case of integrated it would even work in the same memory pool already making it an easy port. Even using a dedicated GPU would be simple enough using the designated APIs for it. So, I really don't understand the requirement for AVX512 in dav1d, nor the focus on CPU-only decoding of HD video. So, what is AVX512 good for otherwise then? Specialized, dedicated vector calculations that you would only see in specialized professional software like CAD and architech software. tl;dr > AVX512 will be super niche feature that will be irrelevant for 90+% of server/workstation market and 99% of desktop market.
roytam1 commented 2 years ago (Migrated from github.com)
Owner

Therefore, the VideoLAN, VLC and FFmpeg communities have started to work on a new decoder, sponsored by the Alliance of Open Media.

and that is dav1d. I don’t think ffvpx-alike for AV1 in ffmpeg will be available in short time, and so do hardware accelerated decoding.

and for AVX instruction set(s), the pentium/celeron class of latest i-core series still has no AVX support, as stated in http://forum.doom9.net/showthread.php?p=1859890#post1859890

and a summary of current dav1d state is here: http://forum.doom9.net/showthread.php?p=1859992#post1859992

> Therefore, the VideoLAN, VLC and FFmpeg communities have started to work on a new decoder, sponsored by the Alliance of Open Media. and that is dav1d. I don't think ffvpx-alike for AV1 in ffmpeg will be available in short time, and so do hardware accelerated decoding. and for AVX instruction set(s), the pentium/celeron class of latest i-core series still has no AVX support, as stated in http://forum.doom9.net/showthread.php?p=1859890#post1859890 and a summary of current dav1d state is here: http://forum.doom9.net/showthread.php?p=1859992#post1859992
mattatobin commented 2 years ago (Migrated from github.com)
Owner

What does it matter.. Like h264 vs VPX .. HEVC will be way more widely used in commercial contexts than AV1 thus the priority of support for AV1 should be considered seconday at best.

Yes, I know the chance for that not to be true in the end goes up when someone makes those kinds of statements.

Regardless, I agree with Moonchild that AOM should be improved rather that take in that videolan abomination.

What does it matter.. Like h264 vs VPX .. HEVC will be way more widely used in commercial contexts than AV1 thus the priority of support for AV1 should be considered seconday at best. Yes, I know the chance for that not to be true in the end goes up when someone makes those kinds of statements. Regardless, I agree with Moonchild that AOM should be improved rather that take in that videolan abomination.
wolfbeast commented 2 years ago (Migrated from github.com)
Owner

Unless some really smart (read: inaccurate with cutting corners) decoding is being done, you can’t expect full-HD video with a high SNR from a tightly-compressed format to reach 30+ fps on CPU-only decoding. That is, if the codec even allows for cutting corners, which AV1 might not at acceptable bitrates. Focusing on that is folly and a total waste of resources. Because of the nature of video media, codecs can also not easily use parallel decoding (which would be required for such demand for performance) on a generic CPU, due to the fact that bitstreams for video compression make heavy use of delta frames which can’t be decoded any other way than sequential/progressive.
Even so, not even doing basic SIMD optimization for the largest available cross-section of processors out of principle in a supposedly-performant alternative to the reference implementation is the wrong direction of development. For this kind of library/codec you want to get solid, broad cross-section optimization in first (i.e.: SSE2) and then further refine with later instruction sets.

Unless some really smart (read: inaccurate with cutting corners) decoding is being done, you can't expect full-HD video with a high SNR from a tightly-compressed format to reach 30+ fps on CPU-only decoding. That is, _if_ the codec even allows for cutting corners, which AV1 might not at acceptable bitrates. Focusing on that is folly and a total waste of resources. Because of the nature of video media, codecs can also not easily use parallel decoding (which would be required for such demand for performance) on a generic CPU, due to the fact that bitstreams for video compression make heavy use of delta frames which can't be decoded any other way than sequential/progressive. Even so, not even doing basic SIMD optimization for the largest available cross-section of processors out of principle in a supposedly-performant alternative to the reference implementation is the wrong direction of development. For this kind of library/codec you want to get solid, broad cross-section optimization in **first** (i.e.: SSE2) and then further refine with later instruction sets.
trav90 commented 2 years ago (Migrated from github.com)
Owner

I suggest we keep our focus on the AOM lib or any other alternatives that might pop up.

With what I’ve seen of videolan and dav1d, I would tend to agree.

Even so, not even doing basic SIMD optimization for the largest available cross-section of processors out of principle in a supposedly-performant alternative to the reference implementation is the wrong direction of development. For this kind of library/codec you want to get solid, broad cross-section optimization in first (i.e.: SSE2) and then further refine with later instruction sets.

videolan’s comments regarding SIMD instructions:

SSE3 is a bunch of float instructions that have no use for a video decoder, and SSE2 is missing way too many instructions that make writing asm considerably more complex, so i fear no one will give that a try.
...We won’t reject patches, of course, for SSE3. But I think the team will focus on SSSE3.

(source)

Also for what it’s worth while it may be a bit slower than dav1d, libaom does at least support SSE2 assembly optimizations.

> I suggest we keep our focus on the AOM lib or any other alternatives that might pop up. With what I've seen of videolan and dav1d, I would tend to agree. > Even so, not even doing basic SIMD optimization for the largest available cross-section of processors out of principle in a supposedly-performant alternative to the reference implementation is the wrong direction of development. For this kind of library/codec you want to get solid, broad cross-section optimization in **first** (i.e.: SSE2) and then further refine with later instruction sets. videolan's comments regarding SIMD instructions: > SSE3 is a bunch of float instructions that have no use for a video decoder, and SSE2 is missing way too many instructions that make writing asm considerably more complex, so i fear no one will give that a try. ...We won't reject patches, of course, for SSE3. But I think the team will focus on SSSE3. ([source](https://code.videolan.org/videolan/dav1d/issues/207)) Also for what it's worth while it may be a bit slower than dav1d, libaom does at least support SSE2 assembly optimizations.
trav90 commented 2 years ago (Migrated from github.com)
Owner

Given that adopting dav1d into our tree is not a preferred or really viable option and to my knowledge at this point in time there are no other alternatives to libaom does this issue need to stay open? I feel we’ve pretty well reached a conclusion on which decoder we are going to use for AV1.

Given that adopting dav1d into our tree is not a preferred or really viable option and to my knowledge at this point in time there are no other alternatives to libaom does this issue need to stay open? I feel we've pretty well reached a conclusion on which decoder we are going to use for AV1.
wolfbeast commented 2 years ago (Migrated from github.com)
Owner

The main reason for this issue to be opened was to investigate performance improvements, regardless of whether it means a different lib or improvements in the current lib. Discarding alternative libs that are currently present doesn’t solve this issue but there’s little reason to keep this open for now since there are no clear ways forward at this time. I’d like to keep it open since it is still something that needs to be addressed, even though there’s little we can do -right now-.

Let’s just put this on hold for now.

The main reason for this issue to be opened was to investigate performance improvements, regardless of whether it means a different lib or improvements in the current lib. Discarding alternative libs that are currently present doesn't solve this issue but there's little reason to keep this open for now since there are no clear ways forward at this time. I'd like to keep it open since it is still something that needs to be addressed, even though there's little we can do -right now-. Let's just put this `on hold` for now.
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.