I've spent a while playing about with this, and will make a few comments and observations.
* Firstly, with regard to the injecting of CEGUI into existing apps, I know nothing about this at all; it's clearly hackish (which is fine, nobody likes a good hack more than me!) so results are always going to vary wildly. I guess you should get advise from the creator of the injection system.
* I imagine the issues with the module failing to load with that 'quota' error is tied to the above (if everything worked in a separate app, it must be).
* With regards to making it work with the earliest DDraw version, I got it to compile ok, though my 'flip' function (running in windowed mode) always failed with DDERR_SURFACEBUSY.
With regards to the code itself:
* In the function "DirectDraw7Renderer::doRender(void)" you're clearing the target surface every time (and doing it the hard way, too - why not Blt with DDBLT_COLORFILL?). If you're intending that surface to have been previously drawn to, you surely do not want to be doing this. If you're 'caching' the GUI output to a separate offscreen surface, then that's a different matter.
* You may like to dispense with the 'z' coordinate and the sorting of quads. This is generally not required, unless you're employing a hack to manipulate imagery layering, so can save some cycles.
* Some of the preparation work, such as the setup of the source and dest rects and the colour value, in "DirectDraw7Renderer::doRender(RenderQuad *quad)" could be saved by storing that info precalculated in the RenderQuad structs you queue (ideally you should fill that structure with as close to the final data as possible in queueQuad, rather than recalculating it every rendering pass through the quad list). I know we're using similar techniques in the 0.6.x code, but it's totally wasteful and should have been fixed years ago.
* Ok. The BIG issue. While obviously to get the required blending, you're having to perform the blit operations manually on the processor, the issue is that I can't ever see you getting acceptable performance by doing this (I got between 10 and 11 FPS running the FirstWindow sample from CEGUI). I played about with various things here, but was unable to get any acceptable results from the accelerated functions. I don't know what to suggest as far as this goes and clearly this is the key issue (I have suggestions partly because I've forgotten most of what I ever knew about DDraw - I really struggled to get anything working at all

).
CE.