As has been pointed out, it’s a question of sampling rate, bandwidth, and sync. Video just won’t work in an audio signal path. Think about the video card in your PC… they usually have two video outputs. If you want a third or fourth, you need more video cards! And you need PCIe bus sockets to support them.
SD video is sampled with a 13.5MHz or 27MHz clock. Audio sampling is commonly a 44.1KHz - 192KHz clock. Generalizing here, but say we’re sampling 13.5MHz video and 44.1KHz audio, both at 24 bits per sample (24-bit RGB, 24-bit audio.) We could run 300 channels of audio streaming within the same video bandwith as 1 SD video source.
In addition to the wider frequency range, if we’re going to mix two video sources in relation to each other, every pixel has to be precisely aligned, down to the nanosecond (this is the function of video sync and genlock.) Otherwise when you mix them, the images don’t match up. Furthermore, even though video HSYNC is 15,734KHz (within audio range)… that frequency would have to be a precise division of your audio sampling frequency in order to produce stable sync for video purposes. 13.5MHz is the standard video sampling frequency because it divides down into clocks suitable for both NTSC and PAL timings.
What’s so powerful about hardware, analog instruments for modular video is that you don’t run out of memory bandwidth. Every single CV input and every single image generator has the same computational power as a CPU crunching 10,357,632 pixels per second!