So a potential downside of VBR would be not being able to follow cluttered fast moving scenes so well when it is done poorly, and being unable to see a clean image when screen capturing a fast moving scene? Or more getting blocks in shadows and banding on walls?
But, then would CBR have the problem that it can't allocate more to a complex/fast scene if its fixed rate is set too low, with similar outcome?
The potential downside of anything done poorly is that the result will suck. So the first part of your question is something of a strawman argument.
What VBR excels at is dealing with both fast moving scenes and static scenes by allocating the bits needed to represent both without wastage.
Like anything with variable quality it depends on how well you do it and what your priorities are for the result.
By and large, none of the providers we work with do encoding poorly. It is just too easy to do it right and too expensive to do it wrong. But storage costs, bandwidth costs, and circumstances change. Thus, any entity that pays for storage (includes me) and bandwidth (me too) is motivated to reduce the size of the encoded video to reduce both costs. On the other side, consumers are no longer satisfied with blocky 320x240 video (if they ever were), so there are limits how small video streams can be.
The balance that the providers strike I would call 'pretty good, but not pristine' for overall quality. My mother can't tell the difference. As an aside, it used to be that most consumers couldn't tell the difference between SD video on an HD display or HD video on that same display. I want to believe that is no longer true, but I am not at all sure.
You mentioned capturing a clean image from 1 of a series of frames in a fast moving scene. This is indeed an area where any encoding method may sacrifice some detail because the human mind perceives visual images differently when studying them at leisure than when they are moving fast. This is part of the hunter-prey paradigm that used to be really important to eating regularly.
With either VBR or CBR when you have to fit a scene into a fixed (CBR) or maximum average (VBR) bit budget, such perceptual encoding may be applied. The more limited the space, the more aggressively you apply it. CBR is usually more constrained because you are always trying to keep the stream full but never overflow and you can't put anything in the 'bank', as VBR does.
You mention banding in walls (I tend to think of the sky, myself). Banding is a artifact not of an encoding strategy like VBR or CBR, but of the structure of the pixels used to represent the images. What I mean by this is that in a digital system (including modern video) each unit of information (pixel) is represented by by a group of numbers which can have only so many separate values in a range. For 8 bits, the number of values is 256, for 10 bits 1024, for 12 bits 4096, etc.
Banding occurs when there is a smooth gradient of shade or color in the original (like the sky or a wall) which changes by only a few steps across a wide space. When the values are 8-bit, the difference between shade 60 and shade 61, for example is, quite visible on modern displays and looks like a discrete band. When the values are 12-bit the difference between shade 960 and shade 961 as so small as to be invisible to the eye, and with such small steps the displayed sky or wall does not appear to have steps but a smooth gradient.
This is deep color (10,12,16 bits per component pixels).
There is similar banding effect that occurs in the very dark and, to a lesser degree, the very light scenes. This is when the range of light to dark between 255 and 0 is narrower than what the original source portrays. Everything looks like a nice gradient until you hit too dark or too light and you get a single band where it gets no darker or lighter. This is the failing of inadequate dynamic range. In order to represent normal lighting well, very bright and very dark quality are sacrificed.
Additional information can be encoded in the video stream to deal with the rare but important scenes with very dark or very light pixels. One way is marketed as HDR (High Dynamic Range).
Both deep color and HDR become compelling because our displays are good enough to show us that we don't have them.
The pixel structure chosen is independent of whether CBR or VBR is used.
If you setup a hypothetical case where the CBR rate is the same as the maximum VBR limit, CBR can look just as good as VBR. With energetic video the CBR will be 4 to 10 times the size of the VBR, though, when encoding bit depth, dynamic range, optimizations and codec are equal. If you cut corners on any of these, with either, the resulting video will show it.
If you set the CBR rate above the maximum VBR limit, it could even look better, as one of the CBR proponents suggested. I give that a 'Duh?' mark.
You will note he imagined such a stream in a hand drawn graph but did not have one for the actual bandwidth display he captured. That CBR stream may be as elusive or apocryphal as the Loch Ness Monster.
If such streams exist and are offered by Amazon (seems doubtful) I would support making them selectable. But I would prioritize it below most anything else that developers might spend their time on. Including a nap.