Site changes | yande.re

Cyberbeing

2012-08-11 04:53:10 UTC

I assume you've already tried the latest version of the SPDY patch?

http://nginx.org/patches/spdy/

http://nginx.org/patches/spdy/CHANGES.txt

They must have their work cut out for them if the August 8th version is still leaking and producing segfaults.

admin2

2012-08-11 05:51:00 UTC

used 51, which is latest same problem

its due to our usage of two out of tree modules that I'm guessing is causing the problem. I have a coredump, just gotta dig though it later

Cyberbeing

2012-08-25 14:52:01 UTC

Randomly over the past couple weeks the site has become extremely sluggish (5-30 second page loads). At the moment it's pretty horrible, though randomly there will be periods when everything loads quickly.

Is the site becoming Disk I/O bottlenecked, or is this SSL problems again?

admin2

2012-08-25 15:25:58 UTC

ioload, not willing to buy more hardware.

I've implemented some hard speed limits

zip downloads now max out at 5mbit with one connection max
samples max out at 5mbit per connection with five connections max
image/jpeg max out at 1mbit per connection with five connections max
thumbnails are unchanged

so in total you can get 35mbit by downloading a zip and 10 images at once

Cyberbeing

2012-08-25 16:07:27 UTC

The site seems nice and speedy now, but maybe only temporary (?) since it seems you reloaded the server which kicked everybody off.

admin2

2012-08-25 16:12:12 UTC

peak time is around 7am to 12pm PST so once tomorrow arrives we'll see how fast it is...

extended downtime earlier for the config change was some stupidity on my part, never do sysadmin stuff when you just wake up!

Cyberbeing

2012-08-25 16:53:29 UTC

On second thought, 100KB/s downloading original images and samples seems to be a bit be a painful.

How about:
zip downloads = 512KB/s (4 Mbps) max | one connection max
samples and original images = 1024KB/s (8 Mbps) combined max | combined four connections max

Or possibly:
zip downloads = one connection max
samples and original images = combined four connections max
global speed limit per IP: 1280KB/s (10Mbps)

But like you said, I guess we'll wait and see how it is tomorrow to see if these current restrictive limits you set help. The most important thing is that all the non-image content and thumbnails continue to load quickly during peak time.

admin2

2012-08-25 17:03:27 UTC

I might script it where the speed limits only apply during peak time, but we'll see how everything goes after the OS upgrade later tonight.

Cyberbeing

2012-08-27 14:49:45 UTC

So the limits after the upgrade are basically the same except:

Now: 128KB/s (~1Mbit) per connection on original images.
Previously: 100KB/s (~0.8 Mbit) per connection on original images.

Now: Original + Sample share a combined limit of 5 connections.
Previously: Original + Sample had 5 connections each.

Now: Unlimited speed on pool zip downloads.
Previously: 512KB/s (~4Mbit) limit.

admin2

2012-08-30 03:33:16 UTC

Again, not willing to throw more money at the problem.
Please understand that.

Cyberbeing

2012-08-30 06:52:56 UTC

I was just observing the changes since the update. The new limits seems to work decently for preventing lag during peak.

That said, after the limits were put in place the server itself certainly seems to have a lot of bandwidth and I/O to spare, considering pools instantly max out my 50Mbit connection with only a single connection during the height of peak time. I'm not suggesting you throw more money at the problem, just play around with different limiting schemes. Higher burst speeds through use of global limits I suspect would offer a superior user experience compared to per-connection limits.

admin2

2012-08-30 14:20:15 UTC

nginx doesn't have per ip limiting of traffic.

I also didn't test the site without limiting since the upgrade. So caps are off for a few days to judge how well it performs.

Once ioload gets too high all the other hosted VMs perform poorly which effects a lot of things down the way.

Cyberbeing

2012-09-09 02:35:14 UTC

Any idea why the server was performing significantly better during the few days prior to the raid scrub, compared to after it completed? Were you running non-redundant?

admin2

2012-09-09 22:27:24 UTC

I think at the time you posted it was pretty much peak hour. I can also explain the improved performance before the scrub:

We also switched back to deadline io scheduler, since the upgrade to 12.04 we were stuck on cfq which impacted things greatly.

I've adjusted read ahead values on the underlaying disks on the raid during the scrub so it would finish faster, I don't know if it has a impact on day to day currently.

Cyberbeing

2012-09-10 02:13:01 UTC

All I know is that for the few days prior to the RAID scrub, the server was having no lag even during peak and was extremely responsive. Your performance graphs show the same anomaly, and also that this period of excellent stable performance started a half a week after the upgrade to 12.04.

http://img841.imageshack.us/img841/3562/diskstatslatencyweek.png
http://img850.imageshack.us/img850/558/diskstatsutilizationwee.png
http://img440.imageshack.us/img440/2008/iostatiosweek.png

admin2

2012-09-10 02:42:03 UTC

we had less visitors that week, this Sat we got 10% more then normal, would explain that

Cyberbeing

2012-09-10 03:32:08 UTC

Would a 10% increase in visitors really explain a 325% increase in disk latency, 25% increase in disk utilization, along with an entirely different performance profile? Unless an entire country worth of abusive traffic decided to not access the site for a few days, I'd still find the sudden change of performance on that Tueday suspicious.

Peak time performance before RAID scrub:
13k packets/sec
1.9k connections
200M traffic
80% Disk Utilization
200ms latency

Off-Peak time performance before RAID scrub:
7k packets/sec
0.8k connections
100Mbps traffic
50% Disk Utilization
200ms latency

to

Peak time performance after RAID scrub:
14k packets/sec
2.1k connections
220Mbps traffic
100% Disk Utilization
650ms latency

Semi-Peak time performance after RAID scrub (for comparison):
12k packets/sec
1.6k connections
180Mbps traffic
70% Disk Utilization
400ms latency

Off-Peak time performance after RAID scrub:
7k packets/sec
0.8k connections
100Mbps traffic
60% Disk Utilization
300ms latency

I'd suspect that yande.re has a mostly random access pattern, so how you increased the read-ahead may account for some of the performance degradation, but it still wouldn't explain the change which occurred on that Tuesday.

admin2

2012-09-10 14:08:50 UTC

remember I host a few other sites besides this one, I don't know their traffic patterns but they all share the same disk

Cyberbeing

2012-09-11 00:07:50 UTC

That was actually I thought which came to mind since you recently kicked off all those VN sites. Though it would mean that one of the other hosted sites which remains must be very I/O intensive with horrible caching to have such a large effect on disk load and latency. Though if you're still stuck in a completely filled 1U, I assume it's not possible to kick the hosted sites off the RAID since there is no room for adding a separate physical drive.

If this really is the case, I guess the only good news to take from it is yande.re traffic isn't the primary cause of the I/O bottleneck on your server which emerged out of nowhere this past month. Probably says a lot for how well yande.re has been optimized over the years.

admin2

2012-09-11 01:50:14 UTC

well, we'll see what happens, there's some options I can take with the VM owners to reduce their IO, but it would degrade the experience of others.

admin2

2012-09-13 23:15:39 UTC

tracked down one of the cause of IO problems, a minecraft server apparently is /very/ IO intensive, luckily there's some extra space on the SSD so I'll just give them some space on that in the near future.

Cyberbeing

2012-09-13 23:56:21 UTC

Another option may be to move the minecraft server to a small ramdisk and back it up to the raid daily.

admin2

2012-09-24 14:24:52 UTC

Still dealing with IO problems, download rates have been throttled until resolved.

admin2

2012-09-25 01:39:08 UTC

Might be resolved, need some trending data before uncapping downloads again. Will check back in 12hours.

Cyberbeing

2012-09-30 08:54:39 UTC

It seems you're still having issues during peak time. Disk utilization at 100% for the past few hours.

admin2

2012-10-01 01:41:32 UTC

been waiting on a shipment for 2u case, ssd, and multiple hdds
site speed is also capped again until those arrive

once those come in and setup, should be able to remove the caps

Cyberbeing

2012-10-02 16:44:42 UTC

Disk utilization has been at 100% for the past few hours, even with the caps in place, which is a bit strange to see the server being hammered so much on a Tuesday. I guess it's a good thing you ultimately decided to upgrade to a 2u, since something just caused your I/O performance to degrade again during the past day or so.

Name
Email
Password
Confirm Password