So, we have had a call open with MSPS for more than a few days and the problem became worse yesterday, so much so that we escalated from SEV B to a SEV A. That seemed to get MSPS attention finally.
We enabled Troubleshooting Logs for OLK2011 and uploaded these logs to MS. Within a short space of time the SHD was showing an issue that was being investigated. A little while longer the SHD changed to a more specific notice regarding EWS and application memory (Incident ID EX3517):
Microsoft is working to restore service for some customers served from the European region that are experiencing intermittent connectivity issues when using the Exchange Web Service (EWS). Outlook Web App and Outlook client connectivity is unaffected. Investigation determined that memory utilization and CPU consumption for EWS specific service was unexpectedly high. Exchange Online engineers are recycling the affected EWS application pools to mitigate end-user impact.
As seems to be common with MS, it is up to the customer to prove the issue is not your own but Microsoft's. You need to jump through the hoops to get the point rammed home for them to act. I am surprised there isn't a more proactive approach to what appear to be resource issues on application pools, since the majority of It departments have monitoring for their own in-house services. And even more surprised that it sounds like the rebooted a few servers to fix things..
Oh, and we're not seeing any issues today.