Sorry guys, I didnt realise Kev was relaying my quick responses via whatsapp to the forum, while trying to monitor the spikes. If I had known I would have been a bit more informational. Since a few of you may, or may not understand what most of it actually means i'll try and do an overview of findings.
So, although the site is behind cloudflare, that only offers slight protection when the attack is being forced via the proxy and the origin server is unknown. When you are faced with relatively small attacks front facing, UDP floods and port flooding at origin, cloudflare doesnt help a great deal and being the scale they were we knew there was something else at play as they shouldnt be making a dent, but couldnt pin point it.
Because it was so hit and miss so to speak in terms of it being perfectly fine one moment and not the next, most of the time with normal load averages sitting around the 1-2 range. It's not as simple as looking for traffic, blocking, filtering etc in cloudflare. And we eventually found the compounding factor during one of the spikes.
Bash:
%Cpu(s): 20.2 us, 12.5 sy, 3.0 ni, 19.9 id, 0.0 wa, 0.0 hi, 7.4 si, 45.1 st
What we call noisy neighbours ( Other VPS's on the node stealing resources and limitng ours, slowing down our processes, which soon compounds considerably when you have as many users, small attacks and normal resource usage).
The spikes are caused by external hypervisor contention (steal time), amplified by occasional CPU-heavy PHP requests on our side, resulting in load amplification that's disproportionate to actual work being done.
This is further confirmed when viewing our monitoring graphs, during the spike at 20:13 today ( and multiple times since ), CPU usage on the VM dropped to 15% while load average climbed to 45. Memory usage stayed flat. Network TX briefly dropped to zero.
This pattern is consistent with CPU starvation by the hypervisor, the workload on the server wanted to run but couldn't get scheduled. If it was a localised load spike, it's likely the CPU would increase, possibly to 100%+ not reduce.
I've raised this with the datacenter, awaiting their response. But typically this will mean the DC will offer to move the VM to a new Node, which has less contention. So for the time being, we may see the same issue for a day or so, and then we'll monitor further once the VM has been moved to make sure it was actually that.