GPL code is the least concern, you can always just say the AI-generated code is GPL. What about training on leaked proprietary code? The training data already known to include medical records, CSAM, etc., wouldn’t be surprised if it also contained proprietary code.
They actually plan to use some local stuff in addition to the rest, otherwise I don’t trust anything OpenAI/Google.