David Gasaway
2018-06-28 17:27:45 UTC
Hi,
I've had a couple instances lately were an s3ql filesystem crashed due
to temporary network issues, "ConnectionTimedOut" or "Network is
unreachable", after the 12th failure. The first time, it looks like
the crash happened after about 30 minutes. The second time, about 8
minutes. First, I'd like to know if there is an option to make s3ql
more forgiving of these network issues. The documentation seems to
say that s3ql will wait until 24 hours have passed to fail in this
way, but that isn't what happened in my case, it seems.
Anyway, my bigger question has to do with recovery from the crash.
Both times, I've had to 'fusermount -u' and 'fsck.s3ql' to get the
filesystem back. The second time, files were moved to lost+found,
which likely caused backup repo corruption (uploads were in progress).
Is there some way that I'm missing to get s3ql to pick back up where
it left off, preferably without unmounting first?
Unfortunately, I was running a little bit older version at the time
(2.24, which I've now upgraded to 2.28). I see the version 2.27
release notes say "fsck.s3ql is now able to recover blocks that were
in transit during a file system crash." I'm not sure exactly what
that means. Does fsck.s3ql followed by mount.s3ql resume any pending
operations (uploads) with 2.27+?
Finally, I'm wondering whether a "suspended" mode could be added to
s3ql that is somewhere between "active" and "crashed". A mode where
uncached reads fail, writes succeed, but nothing goes over the
network. Persistent network errors would drop s3ql into this mode.
This would be accompanied by a s3qlctrl action to resume network
operations. Taking it a step further, there could be a s3qlctrl
action to intentionally trigger this mode, which I would find really
helpful. I studied the code a little myself to figure out how this
might be implemented. Perhaps an attribute on BlockCache that
_upload_loop() and get() could check. I'm not sure how get the
uncached read failure to fail the client filesystem operation without
an s3ql failure, though.
Thanks.
I've had a couple instances lately were an s3ql filesystem crashed due
to temporary network issues, "ConnectionTimedOut" or "Network is
unreachable", after the 12th failure. The first time, it looks like
the crash happened after about 30 minutes. The second time, about 8
minutes. First, I'd like to know if there is an option to make s3ql
more forgiving of these network issues. The documentation seems to
say that s3ql will wait until 24 hours have passed to fail in this
way, but that isn't what happened in my case, it seems.
Anyway, my bigger question has to do with recovery from the crash.
Both times, I've had to 'fusermount -u' and 'fsck.s3ql' to get the
filesystem back. The second time, files were moved to lost+found,
which likely caused backup repo corruption (uploads were in progress).
Is there some way that I'm missing to get s3ql to pick back up where
it left off, preferably without unmounting first?
Unfortunately, I was running a little bit older version at the time
(2.24, which I've now upgraded to 2.28). I see the version 2.27
release notes say "fsck.s3ql is now able to recover blocks that were
in transit during a file system crash." I'm not sure exactly what
that means. Does fsck.s3ql followed by mount.s3ql resume any pending
operations (uploads) with 2.27+?
Finally, I'm wondering whether a "suspended" mode could be added to
s3ql that is somewhere between "active" and "crashed". A mode where
uncached reads fail, writes succeed, but nothing goes over the
network. Persistent network errors would drop s3ql into this mode.
This would be accompanied by a s3qlctrl action to resume network
operations. Taking it a step further, there could be a s3qlctrl
action to intentionally trigger this mode, which I would find really
helpful. I studied the code a little myself to figure out how this
might be implemented. Perhaps an attribute on BlockCache that
_upload_loop() and get() could check. I'm not sure how get the
uncached read failure to fail the client filesystem operation without
an s3ql failure, though.
Thanks.
--
-:-:- David K. Gasaway
-:-:- Email: ***@gasaway.org
--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-:-:- David K. Gasaway
-:-:- Email: ***@gasaway.org
--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.