yande.re

kompil

2020-09-29 17:48:08 UTC

Hi there :)

I'm exhuming this topic because I have a very similar question (although my constraints are different). Hope it's the right place to ask.

I'm writing a shell & C++ script to build a local, partial representation of yande.re database. I don't want to run this script too often (manually, say once every 6 months or so), but I absolutely want to reduce server load as much as possible. (The motivation is to bother the least possible, to be environmentally friendly, and also as an exercise.)

I basically want to query:
For all posts: (id, width, height, file_size, file_url, tags, rating, parent_id)
For all pools: (id, name, description, posts)
(All these to be understood as raw strings/number such as returned by the API.)

I already manage a basic JSON parser on:
https://yande.re/pool/show.json?id=2100
https://yande.re/post.json?limit=100&tags=id:123400..123499

Now my plan is as follow: I didn't try it yet, but I believe that `pool/show.json` without parameter will return all the pools at once. I will take advantage of the fact that posts are described in details in `pool/show` responses to parse and store them right away. (If that doesn't work, I will just query all pools one by one :/. At the moment, I don't know of a way of finding the maximum pool id :(. But that doesn't prevent me to reuse post info in it.)
Then I will query all *remaining* posts. I identified two types of query that would work:
https://yande.re/post.json?limit=1000&tags=id:123000..123999
https://yande.re/post.json?limit=750&tags=id:123000,123002,123010 ...
I tried to combine ranges and singletons, or to use multiple ranges, without success.
The former is (I assume) the most server-friendly, but in case I have batch of unknown-yet-posts of say, less than 50 posts, is it still interesting?
The later consist of specifying explicitly all post_ids I'm interested in (there is a limit around 750 posts, probably because the url starts to be too long beyond that). I expect this request to be harder to process (or is it?).
I guess if I have say 500 continuous posts to query, but the 200th, I'd better ask for the whole range, even if that implies asking for something that I already have. But what is your opinion? What kind of rule could I follow to choose one type of query or the other? Is there a better way to achieve the same result?

EDIT 1:
I think these last questions could be reworded as: Is it better to optimize the number of requests; the expected volume (in bytes) of the responses; the complexity of the requests; or a combination of these?
Also I noticed in the examples above I get a significant amount of *deleted* posts. I manage that already, but I guess ideally if I could instruct the server not to list them it would be better for everyone. I tried adding `deleted:false` to the list of tags, but no luck. Is there a proper way of doing that?

Thanks :D

Name
Email
Password
Confirm Password