used 51, which is latest same problem
its due to our usage of two out of tree modules that I'm guessing is causing the problem. I have a coredump, just gotta dig though it later
its due to our usage of two out of tree modules that I'm guessing is causing the problem. I have a coredump, just gotta dig though it later
Randomly over the past couple weeks the site has become extremely sluggish (5-30 second page loads). At the moment it's pretty horrible, though randomly there will be periods when everything loads quickly.
Is the site becoming Disk I/O bottlenecked, or is this SSL problems again?
Is the site becoming Disk I/O bottlenecked, or is this SSL problems again?
ioload, not willing to buy more hardware.
I've implemented some hard speed limits
zip downloads now max out at 5mbit with one connection max
samples max out at 5mbit per connection with five connections max
image/jpeg max out at 1mbit per connection with five connections max
thumbnails are unchanged
so in total you can get 35mbit by downloading a zip and 10 images at once
I've implemented some hard speed limits
zip downloads now max out at 5mbit with one connection max
samples max out at 5mbit per connection with five connections max
image/jpeg max out at 1mbit per connection with five connections max
thumbnails are unchanged
so in total you can get 35mbit by downloading a zip and 10 images at once
The site seems nice and speedy now, but maybe only temporary (?) since it seems you reloaded the server which kicked everybody off.
peak time is around 7am to 12pm PST so once tomorrow arrives we'll see how fast it is...
extended downtime earlier for the config change was some stupidity on my part, never do sysadmin stuff when you just wake up!
extended downtime earlier for the config change was some stupidity on my part, never do sysadmin stuff when you just wake up!
On second thought, 100KB/s downloading original images and samples seems to be a bit be a painful.
How about:
zip downloads = 512KB/s (4 Mbps) max | one connection max
samples and original images = 1024KB/s (8 Mbps) combined max | combined four connections max
Or possibly:
zip downloads = one connection max
samples and original images = combined four connections max
global speed limit per IP: 1280KB/s (10Mbps)
But like you said, I guess we'll wait and see how it is tomorrow to see if these current restrictive limits you set help. The most important thing is that all the non-image content and thumbnails continue to load quickly during peak time.
How about:
zip downloads = 512KB/s (4 Mbps) max | one connection max
samples and original images = 1024KB/s (8 Mbps) combined max | combined four connections max
Or possibly:
zip downloads = one connection max
samples and original images = combined four connections max
global speed limit per IP: 1280KB/s (10Mbps)
But like you said, I guess we'll wait and see how it is tomorrow to see if these current restrictive limits you set help. The most important thing is that all the non-image content and thumbnails continue to load quickly during peak time.
I might script it where the speed limits only apply during peak time, but we'll see how everything goes after the OS upgrade later tonight.
So the limits after the upgrade are basically the same except:
Now: 128KB/s (~1Mbit) per connection on original images.
Previously: 100KB/s (~0.8 Mbit) per connection on original images.
Now: Original + Sample share a combined limit of 5 connections.
Previously: Original + Sample had 5 connections each.
Now: Unlimited speed on pool zip downloads.
Previously: 512KB/s (~4Mbit) limit.
Now: 128KB/s (~1Mbit) per connection on original images.
Previously: 100KB/s (~0.8 Mbit) per connection on original images.
Now: Original + Sample share a combined limit of 5 connections.
Previously: Original + Sample had 5 connections each.
Now: Unlimited speed on pool zip downloads.
Previously: 512KB/s (~4Mbit) limit.
Again, not willing to throw more money at the problem.
Please understand that.
Please understand that.
I was just observing the changes since the update. The new limits seems to work decently for preventing lag during peak.
That said, after the limits were put in place the server itself certainly seems to have a lot of bandwidth and I/O to spare, considering pools instantly max out my 50Mbit connection with only a single connection during the height of peak time. I'm not suggesting you throw more money at the problem, just play around with different limiting schemes. Higher burst speeds through use of global limits I suspect would offer a superior user experience compared to per-connection limits.
That said, after the limits were put in place the server itself certainly seems to have a lot of bandwidth and I/O to spare, considering pools instantly max out my 50Mbit connection with only a single connection during the height of peak time. I'm not suggesting you throw more money at the problem, just play around with different limiting schemes. Higher burst speeds through use of global limits I suspect would offer a superior user experience compared to per-connection limits.
nginx doesn't have per ip limiting of traffic.
I also didn't test the site without limiting since the upgrade. So caps are off for a few days to judge how well it performs.
Once ioload gets too high all the other hosted VMs perform poorly which effects a lot of things down the way.
I also didn't test the site without limiting since the upgrade. So caps are off for a few days to judge how well it performs.
Once ioload gets too high all the other hosted VMs perform poorly which effects a lot of things down the way.
Any idea why the server was performing significantly better during the few days prior to the raid scrub, compared to after it completed? Were you running non-redundant?
I think at the time you posted it was pretty much peak hour. I can also explain the improved performance before the scrub:
We also switched back to deadline io scheduler, since the upgrade to 12.04 we were stuck on cfq which impacted things greatly.
I've adjusted read ahead values on the underlaying disks on the raid during the scrub so it would finish faster, I don't know if it has a impact on day to day currently.
We also switched back to deadline io scheduler, since the upgrade to 12.04 we were stuck on cfq which impacted things greatly.
I've adjusted read ahead values on the underlaying disks on the raid during the scrub so it would finish faster, I don't know if it has a impact on day to day currently.
All I know is that for the few days prior to the RAID scrub, the server was having no lag even during peak and was extremely responsive. Your performance graphs show the same anomaly, and also that this period of excellent stable performance started a half a week after the upgrade to 12.04.
http://img841.imageshack.us/img841/3562/diskstatslatencyweek.png
http://img850.imageshack.us/img850/558/diskstatsutilizationwee.png
http://img440.imageshack.us/img440/2008/iostatiosweek.png
http://img841.imageshack.us/img841/3562/diskstatslatencyweek.png
http://img850.imageshack.us/img850/558/diskstatsutilizationwee.png
http://img440.imageshack.us/img440/2008/iostatiosweek.png
we had less visitors that week, this Sat we got 10% more then normal, would explain that
Would a 10% increase in visitors really explain a 325% increase in disk latency, 25% increase in disk utilization, along with an entirely different performance profile? Unless an entire country worth of abusive traffic decided to not access the site for a few days, I'd still find the sudden change of performance on that Tueday suspicious.
Peak time performance before RAID scrub:
13k packets/sec
1.9k connections
200M traffic
80% Disk Utilization
200ms latency
Off-Peak time performance before RAID scrub:
7k packets/sec
0.8k connections
100Mbps traffic
50% Disk Utilization
200ms latency
to
Peak time performance after RAID scrub:
14k packets/sec
2.1k connections
220Mbps traffic
100% Disk Utilization
650ms latency
Semi-Peak time performance after RAID scrub (for comparison):
12k packets/sec
1.6k connections
180Mbps traffic
70% Disk Utilization
400ms latency
Off-Peak time performance after RAID scrub:
7k packets/sec
0.8k connections
100Mbps traffic
60% Disk Utilization
300ms latency
I'd suspect that yande.re has a mostly random access pattern, so how you increased the read-ahead may account for some of the performance degradation, but it still wouldn't explain the change which occurred on that Tuesday.
Peak time performance before RAID scrub:
13k packets/sec
1.9k connections
200M traffic
80% Disk Utilization
200ms latency
Off-Peak time performance before RAID scrub:
7k packets/sec
0.8k connections
100Mbps traffic
50% Disk Utilization
200ms latency
to
Peak time performance after RAID scrub:
14k packets/sec
2.1k connections
220Mbps traffic
100% Disk Utilization
650ms latency
Semi-Peak time performance after RAID scrub (for comparison):
12k packets/sec
1.6k connections
180Mbps traffic
70% Disk Utilization
400ms latency
Off-Peak time performance after RAID scrub:
7k packets/sec
0.8k connections
100Mbps traffic
60% Disk Utilization
300ms latency
I'd suspect that yande.re has a mostly random access pattern, so how you increased the read-ahead may account for some of the performance degradation, but it still wouldn't explain the change which occurred on that Tuesday.
remember I host a few other sites besides this one, I don't know their traffic patterns but they all share the same disk
That was actually I thought which came to mind since you recently kicked off all those VN sites. Though it would mean that one of the other hosted sites which remains must be very I/O intensive with horrible caching to have such a large effect on disk load and latency. Though if you're still stuck in a completely filled 1U, I assume it's not possible to kick the hosted sites off the RAID since there is no room for adding a separate physical drive.
If this really is the case, I guess the only good news to take from it is yande.re traffic isn't the primary cause of the I/O bottleneck on your server which emerged out of nowhere this past month. Probably says a lot for how well yande.re has been optimized over the years.
If this really is the case, I guess the only good news to take from it is yande.re traffic isn't the primary cause of the I/O bottleneck on your server which emerged out of nowhere this past month. Probably says a lot for how well yande.re has been optimized over the years.
well, we'll see what happens, there's some options I can take with the VM owners to reduce their IO, but it would degrade the experience of others.
tracked down one of the cause of IO problems, a minecraft server apparently is /very/ IO intensive, luckily there's some extra space on the SSD so I'll just give them some space on that in the near future.
Another option may be to move the minecraft server to a small ramdisk and back it up to the raid daily.
Still dealing with IO problems, download rates have been throttled until resolved.
Might be resolved, need some trending data before uncapping downloads again. Will check back in 12hours.
It seems you're still having issues during peak time. Disk utilization at 100% for the past few hours.
been waiting on a shipment for 2u case, ssd, and multiple hdds
site speed is also capped again until those arrive
once those come in and setup, should be able to remove the caps
site speed is also capped again until those arrive
once those come in and setup, should be able to remove the caps
Disk utilization has been at 100% for the past few hours, even with the caps in place, which is a bit strange to see the server being hammered so much on a Tuesday. I guess it's a good thing you ultimately decided to upgrade to a 2u, since something just caused your I/O performance to degrade again during the past day or so.
We're back in reduced functionality, the raid is rebuilding to add the final drive to the array. Once thats done I'll remove the speed caps.
So I implemented an auto ban system for abusive bots. If you generate 30 503s in 10min, you'll get banned for a hour.
Hopefully there won't be too many false positives.
Hopefully there won't be too many false positives.
I assume you started scrubbing the array 6 hours ago (1AM PDT), and this sudden high load since then isn't a disk failure?
Cyberbeing
http://nginx.org/patches/spdy/
http://nginx.org/patches/spdy/CHANGES.txt
They must have their work cut out for them if the August 8th version is still leaking and producing segfaults.