USChess.org Site Down?

Web scraping is basically a program that does a web call, captures the results and parses it for the information it is after, usually saving it in a file or database.

The main difference between web scraping and user browsing is speed, an automated program doesn’t have to wait for the user to read it, it just goes on to the next. I’ve seen automated webscraping tools make 100 queries a second! And some companies shotgun it, setting up multiple sites all hitting the same system. You need huge resources to handle that many queries, and that can cost tens of thousands of dollars–a week!

My son works as a site reliability engineer for a large-scale database company. (Their ‘small’ clients have monthly bills of $50,000 or more.) He’s helped to build sites that can handle tens of thousands of queries a second. (Too bad the ticketing company handling the Taylor Swift tour wasn’t using them.)

When you submit a JTP, you get “Tournament directors: please note that submitted memberships will appear in the ratings system shortly (typically within 5 minutes).” When you submit a paid membership, you get the name and the ID back along with other verbiage. JTPs also do not appear in the Submitted Memberships list for an affiliate.

I’m looking to see what’s in the CIVI-CRM database for JTP memberships. You have to be associated with an affiliate to submit a membership batch, so surely it has the affiliate ID somewhere.

Why would someone need to scrape data from the MUIR system? Is he or she trying to get data for a computer to process immediately such as ratings? Or say ratings from the Monthly database for tournament purposes?

There are probably lots of reasons why, but I can only guess at a few. Some people might just want to play with the data, or have a local database of players from their club or their state, others are handling event registrations and want to update their local player database more frequently than once a month. Some may not like the US Chess site and would prefer to have (and possibly offer to others) a site organized differently.

The disadvantage all 3rd party sites have is that their data is seldom going to be more up-to-date than the US Chess site’s data.

MUIR also appears unusable now; I know you said MSA was having trouble but has that spread?

Seems to be working for me, how are you trying to get to it? www.uschess.org is not bringing up new.uschess.org right now, though if you enter new.uschess.org that seems to be OK. ratings.uschess.org seems to be up, as does the TD portal. I haven’t tried to upload a tournament, though.

@nolan We are still experiencing issues with the website. We have seen ‘Unable to connect to the API’ multiple times today when trying to submit ratings report. Once you get that error, it deletes the draft and you have to start all over again. Is there an ETA when this will be fixed?

Did the old site ever experience web scraping issues?

Trying to upload a tournament, getting a lot of this:

This is the same error that we are getting.

Are you starting at new.uschess.org?

If I start there, I get on.

If I go direct to ratings.uschess.org and try to enter.the TD portal, I get that error. This could be related to the high traffic issue.

I have seen some indication that new.uschess.org is also getting higher than normal hits, sometimes it doesn’t completely draw the site, like it’s missing a css file.

This is web scraping on the MSA system, but then MUIR is starting to have issues now too.

There are legitimate reasons for web scraping. However, if you are a competent and ethical programmer, you keep the server health in mind and keep the automatic page loads to a minimum.

As I told our new ED, looking at the logs it looks like (poorly programmed) web scraping, but it might as well be a deliberate DDOS attack for the effect it has on us.

I’m sure this will be the major focus of the IT team on Tuesday.

MUIR seems to be having major page load issues right now. Makes me think it’s having its own DDOS type issue.

Is it possible to place an internet “throttle” or "governer” on the speed or frequency of scrape requests processed at every given moment? Or to set the action of a toll booth of sorts which processes one car payment scrape request at a time and doesn’t let the request through the highway until its payment scrape request is processed. Or is that what already happens?

@nolan There are still ongoing issues.

e.g. new.uschess.org is not loading properly. Affiliate page looks like this.

While the developers are trying to fix the issues, is there a workaround? Thanks!

It’s difficult to tell which requests are web-scraping and which are just people looking for data. There are rate-limiting filters in place, but they may need refinement.

Without going into details, we’re working on ways to ameliorate the issue and will continue that process until we’re satisfied it’s under control.

I find if I hit the new.uschess.org page two or three times, it usually finishes building.

Well, for now at least it appears we’ve got things under control, but we’ll keep watching t.he load patterns.

This may have a negative impact on legitimate users of US Chess data, and we’re sorry that abusive users (whether bad intent or bad programmers) caused us to tighten things up.

A project to have key-based APIs that we can use to control bandwidth usage in a more fine-grained fashion is on the planning boards, it might have to be put higher up the priority list, but that’s a decision for senior staff.