Questions about the API and guidelines on how to reduce load

Spiller

2012-08-03 16:52:46 UTC

Questions about the API and guidelines on how to reduce load

Introduction
I have been working on a Danbooru Browser like Danbooru Client (I haven't tried it though). It is written in PHP and is accessed with your browser, as I doubt I would be able to top the tab handling and navigation without a lot of work. (Opera is just too awesome in that regard imo.)
I started it because I wanted to fix some issues in the layouts and I was getting tired of keeping my user CSS up-to-date (which was how I had solved it up to then). Mainly I wanted to have as much screen space as possible, so I was getting rid of site logos, huge margins and other useless stuff. Imouto and Konachan wasn't too bad, but some of the others had quite some issues. For example Sankaku chan once had this issue where the whole site borked if the window width became less than something like 1024. (Mixing % and px, bad idea! And lol, they actually still haven't fixed it.)
Anyway, I don't know how you site owners feel about such applications, so I wouldn't make it public unless you are okay with it.

Questions about how API calls affect server load
Moving on to the real topic: Since I'm using PHP, all data retrived through the API is lost when the page finished loading, so if you go from the index to a specific post it needs to fetch the post information again, even though it had it on the previous page. I'm currently rewriting pretty much everything to cache as much as possible in a database in order to reduce the amount of requests significantly.
I want to start working on caching searches. However once I get that done, it is no longer necessary to use a fixed limit which equals the amount of posts on the page. I could continue using a limit of 24 on /post/index.xml, or I could increase that to 100. Obviously the latter would decrease the average page loading time on my end, but how does limit affect load on your servers? I want to find a balance between reducing loading time on my end and avoiding too much load on your servers.

Another question: Updating a cache will be required every once in a while and if I want to fetch the data about a post (lets say with id=267) I could request /post/index.xml?tags=id:267 . However I could also use something like /post/index.xml?tags=id:218..317&limit=100 and attempt to fetch some random posts at the low chance of getting something useful. Again, how does it affect load? Any other suggestions for efficient queries?
I'm still not quite sure how I should calculate when I should refresh a certain search or post though. If any have some suggestions I'm open ears, but it is not so important as I do not intend to work with right now.

How limit affects load have been puzling me a bit, since there is a hard limit for /post/index.xml, but a way to fetch all tags on /tag/index.xml. On Gelbooru for example this results in a massive 16MB file with over 200,000 tags. Really useful for me, but I guess it isn't something you would want happening too often...

Questions about the API
For post data:
First of all, what the heck is 'frames_pending_string' and 'frames_string'? I have never seen anything in those fields and only Yande.re and Konachan have it. I'm starting to expect some kind of awesome special feature, don't disappoint me now ;)
Secondly, (also Yande.re and Konachan only) why do you not provide 'has_comments' and 'has_notes' like every other Booru I have seen? It would require me to trow two extra API calls for every post if I want to include them, even when a post do not have them. Notes are pretty much unused here and I can live without the comments, so I will just disable it to improve performance.
What is 'change' btw?

With /tag/related.xml, how do you get the related tags for a complete search and not just a single tag? If I try to space separate them, it returns a list for each tag.
Btw, does anyone know how to do this on Gelbooru-based boards? There is no documentation at all >:|

Is there any way of finding which pool a certain post is in? I have found that you could use 'id:[some id] pool:*' to check if a post is in a pool or not. Same thing as comments and notes, I will not make an extra call, but I might make a 'search for pool' option. (I was just thinking it would be cool to have some sort of book view for the manga pools on other Boorus.)

Screenshots
In case you are curious... It is still a bit crude since I just rewrote it though.

Comparison of the main post page: (Blacklisting turned off!)
Yande.re: http://i.imgur.com/2sloU.jpg
BooruSurfer: http://i.imgur.com/wt9oM.jpg
Not too different since I stole the colors from Yande.re (as I'm never been good to design stuff), but the cool thing is it looks and works like this on all supported Boorus (seen at the top, ignore that it says it is Konachan, I accidentally hardcoded that and haven't fixed it).
I haven't noticed it before, but what are you basing the tags to the left on? Numbers are quite low and doesn't appear to change on other pages.
Anyway, the preview feature I use is pure CSS based, so you can't enable/disable it with the shift key. Just a quick thing I added yesterday after all. (The images are also sized with the object-fit property which is currently only implemented in Opera. I haven't bothered using something more compatible yet, object-fit is quite nice...)

Post page:
Yande.re: http://i.imgur.com/uACJG.jpg
BooruSurfer: http://i.imgur.com/BXupT.jpg
Haven't done much here, but notice the much smaller margins. Also, previews of parent and child posts. (Hover effect works, but no details because I forgot this feature when I was adding it yesterday.) It is a small but very lovely feature imo, you should add something similar here as well.

Also, automatic resizing. I think you could turn this on here as well, but IIRC it didn't resize when you resized your browser window. Again, no JavaScript shit here and AFAIK it should work with everything except IE6. (The details on hover as well, but might not work if inserted by JavaScript in IE7-8, not sure.)
Both: http://i.imgur.com/VycTR.jpg
(Suble difference: Yande.re resizes the info panel, while BooruSurfer keeps it fixed.)

I'm thinking of fetching all reduced/original images through a simple proxy, partially to prevent hot-linking protection but also to enable custom filenames. Imagine Konachan/Yande.re like filenames on all Boorus.
Other Boorus tend to use a lot of tags though and you shouldn't use file paths which are longer than 260 characters on Windows. What do you use as a limit here?
To have the most relevant tags (which is subjective), I was thinking making the filename like so: 'site_name post_id - artist_tags - copy_tags - character_tags - other_tags.ext'. When the filename become too long, take the last group of tags, sort them by tag count and remove the least occurring tags. Repeat into the other tag groups until the filename is short enough. Suggestions anyone?

EDIT: Oh, the joy of posting on a new forums, second edit now to make the formatting work. Can't preview because: '414 Request-URI Too Large'...

blooregardo

2012-08-03 17:02:44 UTC

It looks nice, keep going. The features you suggest like searching posts that have comments, notes or belong to a pool has been suggested and could be added in the future. I hope so.

admin2

2012-08-04 00:47:30 UTC

Oh man, what a wall of text.

We forked off of danbooru a /long/ time ago and we've not really followed its additional API features.

To be honest such discussion can be made in the IRC* channel instead of the forums. Anyways some answers below:

Anyway, I don't know how you site owners feel about such applications, so I wouldn't make it public unless you are okay with it.

I don't really care, the API is there to be used heh, not sure about others!

server load stuff

Currently we're not really bottle necked, can't say for others. Quite a few people use limit:1000 and its more of an annoyance then a problem currently.

frame stuff and API differences

its used for post/browse and a job task that runs, API differences is same as above. Change is a id you can use to see if a post tags were updated or not (I think!)

I'll have our dev try to answer the other questions, but no promises. The 414 too large thing is probably a bug as well. woohoo

*#moe-imouto@irc.rizon.net

edogawaconan

2012-08-04 02:55:05 UTC

Obviously the latter would decrease the average page loading time on my end, but how does limit affect load on your servers? I want to find a balance between reducing loading time on my end and avoiding too much load on your servers.

The load for xml request is much lower than html request.

Another question: Updating a cache will be required every once in a while and if I want to fetch the data about a post (lets say with id=267) I could request /post/index.xml?tags=id:267 . However I could also use something like /post/index.xml?tags=id:218..317&limit=100 and attempt to fetch some random posts at the low chance of getting something useful. Again, how does it affect load? Any other suggestions for efficient queries?

You can save the last maximum change you have fetched and query everything after that: change:1094888.. (see below)

First of all, what the heck is 'frames_pending_string' and 'frames_string'? I have never seen anything in those fields and only Yande.re and Konachan have it. I'm starting to expect some kind of awesome special feature, don't disappoint me now ;)

First see this as example: https://yande.re/post/browse#/id:52058
The content is in "%dx%d,%dx%d" % [left, top, width, height] format, concatenated by ';' of there are more than one frame.

Secondly, (also Yande.re and Konachan only) why do you not provide 'has_comments' and 'has_notes' like every other Booru I have seen? It would require me to trow two extra API calls for every post if I want to include them, even when a post do not have them. Notes are pretty much unused here and I can live without the comments, so I will just disable it to improve performance.

I guess the API for that haven't been updated in moebooru.

What is 'change' btw?

Think of it as last modified identifier.

With /tag/related.xml, how do you get the related tags for a complete search and not just a single tag? If I try to space separate them, it returns a list for each tag.

currently not possible

Is there any way of finding which pool a certain post is in? I have found that you could use 'id:[some id] pool:*' to check if a post is in a pool or not. Same thing as comments and notes, I will not make an extra call, but I might make a 'search for pool' option. (I was just thinking it would be cool to have some sort of book view for the manga pools on other Boorus.)

currently no better way

I haven't noticed it before, but what are you basing the tags to the left on? Numbers are quite low and doesn't appear to change on other pages.

Magic. Or the tags of posts from last 24 hours. (hint: the latter)

Other Boorus tend to use a lot of tags though and you shouldn't use file paths which are longer than 260 characters on Windows. What do you use as a limit here?

No limit(tm). The actual filename on server is totally different anyway.

To have the most relevant tags (which is subjective), I was thinking making the filename like so: 'site_name post_id - artist_tags - copy_tags - character_tags - other_tags.ext'. When the filename become too long, take the last group of tags, sort them by tag count and remove the least occurring tags. Repeat into the other tag groups until the filename is short enough. Suggestions anyone?

Yeah, I guess some kind of filename sanitizing would be nice.

EDIT: Oh, the joy of posting on a new forums, second edit now to make the formatting work. Can't preview because: '414 Request-URI Too Large'...

A bug. Fixed.

Spiller

2012-08-04 21:05:40 UTC

admin2 said:
Quite a few people use limit:1000 and its more of an annoyance then a problem currently.

I didn't even consider you could that, I shouldn't have trusted that copypasta documentation...

I haven't heard much rage about the batch downloaders either, so I will make the repository public. (If anyone want the link, send me a PM.)

edogawaconan said:
First see this as example: https://yande.re/post/browse#/id:52058
The content is in "%dx%d,%dx%d" % [left, top, width, height] format, concatenated by ';' of there are more than one frame.

Interesting. What is the difference between 'frames' and 'frames_pending'? For this post, they contain exactly the same data. (It is not something I want to use, I'm just curious.)

The 'change' parameter is nice, but it also appears a bit risky to me. What if the amount of changed posts are larger than the limit I use, and the wanted post is not on that page? But it gave me some other ideas nevertheless.

For the tags, I might just calculate it myself based on the current cache in these cases then. I noticed that most Danbooru-based sites appears to use a limit on something like 300 anyway and Gelbooru based boards doesn't support this API access at all...

edogawaconan said:
Yeah, I guess some kind of filename sanitizing would be nice.

FYI, Opera sometimes bugs on certain characters and crops parts of the filename. Once I browsed after steins;gate images I ended up with several images named 'gate.jpg' -__-

Thanks for the help!

admin2

2012-08-04 22:48:18 UTC

Spiller said:
The 'change' parameter is nice, but it also appears a bit risky to me. What if the amount of changed posts are larger than the limit I use, and the wanted post is not on that page? But it gave me some other ideas nevertheless.

iqdb queries every hour or so to get tag cache and every half hour for new posts.

Also feel free to make suggestions for the site on the bug tracker.

Spiller

2012-08-05 12:17:08 UTC

I'm not interested in keeping a copy of the whole database (and the other 10+ sites I added just because I could) like iqdb and the Booru search engines. It is overkill for just browsing around in my opinion. This does make the caching system quite a bit more complex though, if you want it running efficiently that is.

With my old code this was happening each time you entered a search or changed page:
- /post/index.xml?limit=1&tags=x (Just to get total amount of pages, because of sankaku does not include any post data here, but does provide 'count' and 'offset')
- /post/index.json?tags=x&page=y
- /tag/related.json?tags=x
When you just change page, the first and third request is exactly the same and unneeded. However even though it wasted quite some time on those extra requests, performance wasn't actually that bad on most sites.

So by avoiding repeating the same requests and a little bit of prefetching, I'm sure most people would be satisfied with the performance. (But honestly, I'm just embarrassed at how inefficient it was and want to improve it...)

kompil

2020-09-29 17:48:08 UTC

Hi there :)

I'm exhuming this topic because I have a very similar question (although my constraints are different). Hope it's the right place to ask.

I'm writing a shell & C++ script to build a local, partial representation of yande.re database. I don't want to run this script too often (manually, say once every 6 months or so), but I absolutely want to reduce server load as much as possible. (The motivation is to bother the least possible, to be environmentally friendly, and also as an exercise.)

I basically want to query:
For all posts: (id, width, height, file_size, file_url, tags, rating, parent_id)
For all pools: (id, name, description, posts)
(All these to be understood as raw strings/number such as returned by the API.)

I already manage a basic JSON parser on:
https://yande.re/pool/show.json?id=2100
https://yande.re/post.json?limit=100&tags=id:123400..123499

Now my plan is as follow: I didn't try it yet, but I believe that `pool/show.json` without parameter will return all the pools at once. I will take advantage of the fact that posts are described in details in `pool/show` responses to parse and store them right away. (If that doesn't work, I will just query all pools one by one :/. At the moment, I don't know of a way of finding the maximum pool id :(. But that doesn't prevent me to reuse post info in it.)
Then I will query all *remaining* posts. I identified two types of query that would work:
https://yande.re/post.json?limit=1000&tags=id:123000..123999
https://yande.re/post.json?limit=750&tags=id:123000,123002,123010 ...
I tried to combine ranges and singletons, or to use multiple ranges, without success.
The former is (I assume) the most server-friendly, but in case I have batch of unknown-yet-posts of say, less than 50 posts, is it still interesting?
The later consist of specifying explicitly all post_ids I'm interested in (there is a limit around 750 posts, probably because the url starts to be too long beyond that). I expect this request to be harder to process (or is it?).
I guess if I have say 500 continuous posts to query, but the 200th, I'd better ask for the whole range, even if that implies asking for something that I already have. But what is your opinion? What kind of rule could I follow to choose one type of query or the other? Is there a better way to achieve the same result?

EDIT 1:
I think these last questions could be reworded as: Is it better to optimize the number of requests; the expected volume (in bytes) of the responses; the complexity of the requests; or a combination of these?
Also I noticed in the examples above I get a significant amount of *deleted* posts. I manage that already, but I guess ideally if I could instruct the server not to list them it would be better for everyone. I tried adding `deleted:false` to the list of tags, but no luck. Is there a proper way of doing that?

Thanks :D

Name
Email
Password
Confirm Password