Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...

Date: Tue, 05 Oct 2021 21:58:23 +0900
From: Jim Blackson <blackson@example.com>
Subject: Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...
References: <20210928184725.4C0F.A5A534A3@a1d.co.jp> <CAAhy3dud=yzGQyxdSnjDrbVCyzfbNbQeUFTVrB_HUp=P9Hyg=Q@mail.gmail.com>

Hi Ray,

On Tue, 5 Oct 2021 13:52:01 +0800 Raymond Wan <rwan.kyoto@example.com> wrote:

> Sorry for the late reply and thanks a lot for your advice!  After
> Christian's first reply, I had already started shifting away from
> OCFS2 and over to NFS/ext4.  And that meant not doing fsck any more
> and copying files to external hard drives so that I can do a reformat.
> So, in terms of the danger of data loss, I am fine now!  Thank you!

You're welcome.  I am relieved to hear you are out of danger of data
loss.

> So, the problem was at one of the upper layers.  The hard disks appear
> to be healthy.  While one server was writing to the OCFS2 file system,
> it was restarted because it froze.  However, whether something within
> the SAN caused it to freeze, I don't know.  

Just being a little paranoid here...   You have a large file system
full of important data running on top of a complicated storage system 
(the SAN).  There was an outage.  Before the cause of the outage was
confirmed, it looked like the top level file system was being modified 
(by fsck). 

This is a common recipe for data loss. I was worried fsck could be
modifying or deleting critical inodes/pointer to your files, which means
losing the location of the data.  Taking a dd copy of the logical volume
containing the whole file system, then performing a recovery or file
copy is one safe option.  True, 70TB is pretty big ... :-)

I was also worried about the integrity of the SAN. It is very bad if the
SAN becomes corrupted.  Taking a dd copy of each physical HDD is my
preferred backup option if the SAN is suspect.  Recovery here means
virtually rebuilding the SAN and file system from the physical HDDs -
not for the faint of heart. $$$ helps too. :-)

> The SAN is healthy.  All of the lights are fine and the hard drives appear fine.

How about the SAN's internal system logs: any errors show up?

> For now, I've moved everything off to external hard drives,

Nice. Could be valuable hanging on to those drives for a (long) while.

> So, back to one of my earlier ideas.  If I knew some part of the
> directory structure was corrupted, is it possible to "edit" the data
> structure of the directories (i.e., using debugfs.ocfs2, for example,
> which I believe has an ext4 equivalent) to "fix" the problem?  If this
> were to happen again, then I would like to consider this as one
> solution.

To me, it looks the same as using fsck. First I would confirm the
integrity of the SAN, then make backups and/or copy off what I could. 
Once important data was safe, then one can try to "fix" the file system.

Best regards,
jimb.

References:
- Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...
  - From: Raymond Wan

Prev by Date: Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...
Next by Date: [tlug] Job: SRE position at Robotics Startup
Previous by thread: Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...
Next by thread: [tlug] Job: SRE position at Robotics Startup
Index(es):
- Date
- Thread

Home | Main Index | Thread Index