I can cross-compile the current Python code to support GPU and deploy to our existing AWS cluster and be able to run graphical processing in parallel.
I have worked with YOLO, OpenCV, darknet and tensorflow.
To obtain the desired performance (100 API / REST requests for images in parallel and respond within 1 second for each image request independently) we can work it after measuring results. For this we can work with the instances and resources.
I understand that they could be using standard NVIDIA CUDA based docker containers without problems. Likewise, if this were a problem, we can try one of the container options.
In order to move forward, you would need the REST API sources. Deployment details and detail of used versions.
I use my own instances to develop and test. Then I deploy in yours.
I need a few days to work this because it requires reading, trying and testing a lot.
Regards.