Unfortunately last night the incremental restore borked, pushing the cluster into an unusable state. It started actively refusing any new connections and to prevent damages the grid has been again taken offline and put into maintenance mode. The backend team will need more time to fine tune the process to strip free space out of the existing assets.
The idea was to have the ‘conversion’ process use smaller batches of around 10 million assets at a time. This to reduce overall unavailability of given assets that are in that process. It also enables the team to interfere more efficiently and resolve any issues if they would surface. Unfortunately we pushed this a little sooner than we probably should have. It’s going to take more time. We were already aware we’re pushing the envelope qua load, as we noticed in-world things are not as snappy as we would like them to be.
Q & A
Why can’t you immediately tell us exactly what failed ? Because we are technicians and not computer psychics. We observe an issue, try to find the cause, and see if we can fix it. But not everything is simple and obvious, and despite failures being obvious, finding their source sometimes takes time. So the grid goes off, and we place an alert message. We know the cluster is under a lot of load. And issues have been found to be due to this workload, whilst writing a process went hanging. But its not always immediately obvious why something fails. Stuff can be caused by a lack of resources, a memory leak, an attack on our servers, a network failure, broken code, system or config issue, or any of a gazillion of other things. But it’s assets we deal with. So prio1 is to protect your “things”. We don’t take the grid down unless we feel we need to.
How long will this take ? We can’t answer this question. The educated guess is we probably opened back up too soon. Opening up ASAP would still have my personal preference, but the backend team will have the only sensible, and probably a more conservative opinion on this. So we have no ETA to share at this moment, we’re working on it. We explained in the previous post we are online provisionally. I know it’s not what you want to read, but we’re in an ongoing process here. It’s not a new issue. We can only request you to hang in there.
Were there a lot of issues reported last week ? So far we have only seen sporadic issues like an avatar or texture (like my OSgrid shirt from 2013) that didn’t want to load. Loading outfit also appeared slower than normal. But no reports of significant failures, wiped regions or things going completely MIA. Yesterday region servers started to report asset failures due to refused connections. Note that does not read “broken assets” anywhere. The cluster was just so busy / stuck it decided, “Nope, just go away” at every new request, so we had to turn of access.
Community Meeting Sunday December 23rd OSgrid has it’s annual Membership meeting planned at noon, Hurliman Plaza. We welcome your input and feedback. This is a good moment to reflect where you stand, as resident, donator, entertainer, artist, host greeter, on your existence in here, and can provide feedback on where you’d like to see things go, and what concerns you.