[s3ql] Recovery from network issue

Discussion:

David Gasaway

2018-06-28 17:27:45 UTC

Hi,

I've had a couple instances lately were an s3ql filesystem crashed due
to temporary network issues, "ConnectionTimedOut" or "Network is
unreachable", after the 12th failure. The first time, it looks like
the crash happened after about 30 minutes. The second time, about 8
minutes. First, I'd like to know if there is an option to make s3ql
more forgiving of these network issues. The documentation seems to
say that s3ql will wait until 24 hours have passed to fail in this
way, but that isn't what happened in my case, it seems.

Anyway, my bigger question has to do with recovery from the crash.
Both times, I've had to 'fusermount -u' and 'fsck.s3ql' to get the
filesystem back. The second time, files were moved to lost+found,
which likely caused backup repo corruption (uploads were in progress).
Is there some way that I'm missing to get s3ql to pick back up where
it left off, preferably without unmounting first?

Unfortunately, I was running a little bit older version at the time
(2.24, which I've now upgraded to 2.28). I see the version 2.27
release notes say "fsck.s3ql is now able to recover blocks that were
in transit during a file system crash." I'm not sure exactly what
that means. Does fsck.s3ql followed by mount.s3ql resume any pending
operations (uploads) with 2.27+?

Finally, I'm wondering whether a "suspended" mode could be added to
s3ql that is somewhere between "active" and "crashed". A mode where
uncached reads fail, writes succeed, but nothing goes over the
network. Persistent network errors would drop s3ql into this mode.
This would be accompanied by a s3qlctrl action to resume network
operations. Taking it a step further, there could be a s3qlctrl
action to intentionally trigger this mode, which I would find really
helpful. I studied the code a little myself to figure out how this
might be implemented. Perhaps an attribute on BlockCache that
_upload_loop() and get() could check. I'm not sure how get the
uncached read failure to fail the client filesystem operation without
an s3ql failure, though.

Thanks.

--
-:-:- David K. Gasaway
-:-:- Email: ***@gasaway.org
--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nikolaus Rath

2018-06-28 19:33:34 UTC

Permalink

Post by David Gasaway
Hi,
I've had a couple instances lately were an s3ql filesystem crashed due
to temporary network issues, "ConnectionTimedOut" or "Network is
unreachable", after the 12th failure. The first time, it looks like
the crash happened after about 30 minutes. The second time, about 8
minutes. First, I'd like to know if there is an option to make s3ql
more forgiving of these network issues. The documentation seems to
say that s3ql will wait until 24 hours have passed to fail in this
way, but that isn't what happened in my case, it seems.

Can you take a look at the logs? You should see entries like
"Encountered %s (%s), retrying %s.%s (attempt %d)...", and this should
indeed repeat for up to 24 hours. If you want to change this, look at
src/s3ql/backends/common.py, RETRY_TIMEOUT variable. However, I suspect
that in your case the error eventually changed from something that was
considered temporary to something that was considered permanent.

Post by David Gasaway
Anyway, my bigger question has to do with recovery from the crash.
Both times, I've had to 'fusermount -u' and 'fsck.s3ql' to get the
filesystem back. The second time, files were moved to lost+found,
which likely caused backup repo corruption (uploads were in progress).

That should *prevent* repo corruption, not cause it. Otherwise you'd
be left with partial data in the files that were being written when the
network was interrupted. By moving them into lost+found, fsck.s3ql makes
you aware of the issue (so you can re-run the backup and re-copy the
files).

Post by David Gasaway
Is there some way that I'm missing to get s3ql to pick back up where
it left off, preferably without unmounting first?

Unfortunately no.

Post by David Gasaway
Unfortunately, I was running a little bit older version at the time
(2.24, which I've now upgraded to 2.28). I see the version 2.27
release notes say "fsck.s3ql is now able to recover blocks that were
in transit during a file system crash." I'm not sure exactly what
that means. Does fsck.s3ql followed by mount.s3ql resume any pending
operations (uploads) with 2.27+?

fsck.s3ql has always uploaded dirty cache entries that weren't uploaded
yet. You are only loosing data that was truly in-flight when mount.s3ql
crashed (otherwise you'd probably have seen many more files being moved
to lost+found). With S3QL 2.27, being "in transit" has been somewhat
redefined so you should see even less losses, but there is still a
chance of loosing something.

Post by David Gasaway
Finally, I'm wondering whether a "suspended" mode could be added to
s3ql that is somewhere between "active" and "crashed". A mode where
uncached reads fail, writes succeed, but nothing goes over the
network. Persistent network errors would drop s3ql into this mode.
This would be accompanied by a s3qlctrl action to resume network
operations. Taking it a step further, there could be a s3qlctrl
action to intentionally trigger this mode, which I would find really
helpful. I studied the code a little myself to figure out how this
might be implemented. Perhaps an attribute on BlockCache that
_upload_loop() and get() could check. I'm not sure how get the
uncached read failure to fail the client filesystem operation without
an s3ql failure, though.

In principle this could be done, yes. It just needs someone to do the
work.

Best,
-Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«
--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Gasaway

2018-06-28 21:50:59 UTC

Permalink

Post by Nikolaus Rath
Can you take a look at the logs? You should see entries like
"Encountered %s (%s), retrying %s.%s (attempt %d)...", and this should
indeed repeat for up to 24 hours.

I was looking that the logs when I wrote my post. :) First warning to
crash is 8 minutes. The filesystem wasn't even mounted 24 hours.
Following is the log leading up to the exception - let me know if
you'd like to see the full stack trace.

Jun 26 20:30:28 wolfie mount.s3ql[29667]: fuse-worker-9]
s3ql.backends.gs._get_access_token: Requesting new access token
Jun 26 21:17:21 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 3)...
Jun 26 21:17:24 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 4)...
Jun 26 21:17:27 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 5)...
Jun 26 21:17:31 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 6)...
Jun 26 21:17:35 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 7)...
Jun 26 21:17:37 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 8)...
Jun 26 21:17:40 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 9)...
Jun 26 21:17:47 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 10)...
Jun 26 21:18:00 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 11)...
Jun 26 21:18:24 wolfie mount.s3ql[29667]: fuse-worker-8]
s3ql.backends.common.wrapped: Encountered OSError ([Errno 101] Network
is unreachable), retrying Backend.open_read (attempt 12)...
Jun 26 21:25:37 wolfie mount.s3ql[29667]: fuse-worker-17]
llfuse.(unknown function): handler raised <class
'dugong.HostnameNotResolvable'> exception (Host
commondatastorage.googleapis.com does not have any ip addresses),
terminating main loop.
Jun 26 21:25:57 wolfie mount.s3ql[29667]: fuse-worker-26]
llfuse.(unknown function): Only one exception can be re-raised in
`llfuse.main`, the following exception will be lost
Traceback (most recent call last):
File "/usr/lib64/python3.4/site-packages/s3ql/block_cache.py", line
751, in _get_entry
el = self.cache[(inode, blockno)]
KeyError: (548156, 1)

Post by Nikolaus Rath
If you want to change this, look at
src/s3ql/backends/common.py, RETRY_TIMEOUT variable. However, I suspect
that in your case the error eventually changed from something that was
considered temporary to something that was considered permanent.

I suppose you are right that there was another trigger -
HostnameNotResolvable. I made a poor assumption that
HostnameNotResolvable was the underlying issue behind the retries.
This is considered permanent, then?

Post by Nikolaus Rath
That should *prevent* repo corruption, not cause it. Otherwise you'd
be left with partial data in the files that were being written when the
network was interrupted. By moving them into lost+found, fsck.s3ql makes
you aware of the issue (so you can re-run the backup and re-copy the
files).

Yes and no. Please bear in mind that "repo" here does not mean
filesystem. It means a borg repository stored in the filesystem.
Partial or missing data files both corrupt the repo (metadata does not
match filesystem). Another backup does nothing as repo believes the
data is already stored. It's looking like I'll have to download all
the data to repair. I'm aware that my setup may be ill-conceived.
There is a reason I posted to the list a while back asking for other
folks' experience. :)

At any rate, I appreciate that fsck.s3ql made me aware of the issue.
I'm just trying to sort out how to get out of this quandary.

Post by Nikolaus Rath
fsck.s3ql has always uploaded dirty cache entries that weren't uploaded
yet. You are only loosing data that was truly in-flight when mount.s3ql
crashed (otherwise you'd probably have seen many more files being moved
to lost+found). With S3QL 2.27, being "in transit" has been somewhat
redefined so you should see even less losses, but there is still a
chance of loosing something.

Allow me to play devil's advocate. Assuming the filesystem has not
been mounted elsewhere in the meantime, why can't all dirty cache
blocks be uploaded, including those that were in transit at the time
of crash?

Post by Nikolaus Rath

Post by David Gasaway
Perhaps an attribute on BlockCache that
_upload_loop() and get() could check. I'm not sure how get the
uncached read failure to fail the client filesystem operation without
an s3ql failure, though.

In principle this could be done, yes. It just needs someone to do the
work.

I'm not ruling myself out here. Can you say if my specific ideas
above are off the mark?

Thanks.

Nikolaus Rath

2018-06-29 15:26:55 UTC

Permalink

Post by David Gasaway

Yes. But HostnameNotResolvable is only raised if the specific hostname
is not resolvable. On lookup failure, S3QL (or, rather, python-dugong)
tries to lookup a number of well known hostnames (google,
root-servers). If those lookups fail too, the problem is considered
temporary. If only the hostname of the service fails to resolve, the
problem is considered permanent.

Post by David Gasaway

They can, as of version 2.27 :-). What I meant is that there's still
lots of code in fsck.s3ql that will move things to lost&found. It's hard
to give an exhaustive list of how the respective problems could be
triggered, so I'm hedging my assurances.

Post by David Gasaway

Post by Nikolaus Rath

In principle this could be done, yes. It just needs someone to do the
work.

I'm not ruling myself out here. Can you say if my specific ideas
above are off the mark?

No, I was being serious. This could be done. The read() failure can just
return an error code (EIO probably).

Best,
-Nikolaus

David Gasaway

2018-06-29 16:39:53 UTC

Permalink

Post by Nikolaus Rath
Yes. But HostnameNotResolvable is only raised if the specific hostname
is not resolvable. On lookup failure, S3QL (or, rather, python-dugong)
tries to lookup a number of well known hostnames (google,
root-servers). If those lookups fail too, the problem is considered
temporary. If only the hostname of the service fails to resolve, the
problem is considered permanent.

Hmm. I use dnsmasq and Google's DNS servers. I'm going to guess that
dnsmasq expired the Google Storage cache entry while at least one
other was still active. The only other scenario I can imagine is
Google DNS not resolving their own Google Storage hostnames. The
python-dugong behavior certainly seems reasonable, though.

Post by Nikolaus Rath
They can, as of version 2.27 :-). What I meant is that there's still
lots of code in fsck.s3ql that will move things to lost&found. It's hard
to give an exhaustive list of how the respective problems could be
triggered, so I'm hedging my assurances.

Ah, nice! I regret not upgrading sooner.

Post by Nikolaus Rath
No, I was being serious. This could be done. The read() failure can just
return an error code (EIO probably).

Ok. Don't know when I'll find enough time to dig deep into this, but
I'll give it a try.

Related question. I was thinking it would be nice to change the
filesystem to read-only as soon as the backup is finished writing to
cache. Looks like that would be really easy to implement with a
s3qlctrl action and failsafe mode. Yet, failsafe mode is currently
only used when corruption is detected. In fact, the mount.s3ql man
page actually lists an error code stating that read-only mount is not
supported. Is there some good reason this isn't allowed now?

Thanks.

Nikolaus Rath

2018-06-29 18:55:02 UTC

Permalink

Post by David Gasaway
In fact, the mount.s3ql man
page actually lists an error code stating that read-only mount is not
supported. Is there some good reason this isn't allowed now?

No one wrote the code to support this.

Best,
-Nikolaus