Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...



Hi Ray,

Raymond Wan <rwan.kyoto@example.com> wrote:
> ... I'm more concerned about data loss and how I can get this file system back into read/write mode.

Raymond Wan <rwan.kyoto@example.com> wrote:
> The data itself isn't "raw" data, but processed data that represents 
> about 9 months of processing time...I can't lose it or else I'm doomed... 
> Wait -- is this mailing list public?  Oops...  :-P

If data loss is a primary concern, please do not fsck around with (write
to) the SAN.  Worst case is losing the SAN, so how about shutting down
the SAN, carefully label and record the position of each HDD/SDD in the
SAN (so you known which disk goes in which slot in which array), then
"dd/ddrescue" duplicate each and every HDD to another set of matching
HDDs.  This new set is your backup.

After that, you can reboot the SAN and immediately copy off as many
files as you can.

Once your critical data is copied and verified, then you can debug the
SAN and file system with a little less fear. :-)

As for SANs, sorry I'm not familiar with ocfs2 and don't know your
configuration.  However, many commercial SANs I have seen come in for
recovery are composed of 5 or 6 layers.  One challenge is identifying
source of errors while not making things worse.

The lowest layer is the individual SSD/HDD drives themselves.  These
are formed into hardware or software low-level RAIDs, then formed
again into a few large RAIDs.  These second-level RAIDs are gathered
into "pools" or "tiers" of storage for use by the SAN system.  The pools
are often divided into system, snapshot, and logical volumes by the SAN
software.  The logical volumes are for user data; system for mapping
user blocks to pool storage, and snapshot area for internal system
backups if configured. 

The worst case is a failed rebuild of a live low-level RAID.  A rebuild
overwrites all the data on that RAID; a failure will corrupt the
second-level RAID, which corrupts the pool, which corrupts the system
mapping and logical user data. When that happens you don't know what you
have, and you don't know where it is.

Is your SAN healthy?  One possiblity is a failing HDD; trying to read
bad sectors causes an access delay, bad data causing a bad inode number?

Hope this helps,
jimb.




Home | Main Index | Thread Index