Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...



Hoi Raymond,

On Sat, Sep 18, 2021 at 10:24:50PM +0800, Raymond Wan wrote:
> The file system is a Oracle Cluster Filesystem (ocfs2) on a SAN.
    
[recalling]
- with ocfs2, there is basically one or more blockdevices, which 
  multple systems are accessing
- so the data is in one place, and the accessing systems are talking
  via network, so that they can lock files for writing

> Upon restarting, it was mounted as read only.  I ran fsck.ocfs2 and
> many lost files were found.  But I ran it again to make sure it was
> fine and this error came up: [..]
> No matter how many times I run the command, I get these same errors.
> There is a debugfs.ocfs2 command which I can run interactively and
> figure out that 4273782381 isn't a file I need.  But I don't know if
> this command can fix my problem -- I've never used it or the ext4
> equivalent.

I can't make sense from the output either.

> When I mount it, it now mounts as read-write but once I enter the
> direction in question, the file system switches to read-only.  There
> isn't a whole lot of information for ocfs2.  I tried to register
> myself on the ocfs2 mailing list and it seems they're not approving my
> application.  I also tried ServerFault with no luck.

I think oracle provides support for ocfs2, so if these servers are
under a support contract, I would contact them.

> Anyway, any suggestions would be appreciated!  I am considering
> copying the files out and reformatting as a last resort, but at 70 TB,
> I would rather not...  :-(

- At least GFS2 has also a mount mode which basically disables the network
  locking for a moment and says "believe me, you are the only node with
  access", that might work better
- if the block device was smaller, it would be best practise to make a
  copy to a file or other block device, and then just then try repairs.
  If that 70TB buffer does not exist, taking snapshots (dmsetup snapshot,
  or on the SAN storage itself) could also provide a safety net for
  repair attempts
- But I suspect you will end up to copy off from there what you can
  still read, and restore the rest from a backup

> On an unrelated note, does anyone have an opinion about GFS2 and if
> it's better than ocfs2?

GFS2 has the same operation mode, so it's also "coordinating multiple
systems accessing a single block device".  We offer support for GFS2,
but honestly, in most cases I would like to hear the exact usecase and
based on that decide whether gfs2 is really needed.  In many cases,
keeping data on an XFS or ext4, and sharing via NFS, would provide
better availability and be easier to administer.
Ideally, of one or multiple nodes accessing an ocfs2 or gfs2 volume
go down, the others stay operating.  If a single/unclustered NFS
server goes down, that whole service is down.  But the easier
administration might effectively make up for that.

Christian

On Sat, Sep 18, 2021 at 10:24:50PM +0800, Raymond Wan wrote:
> Hi all,
> 
> I'm having a problem with a file system at work and I'm unfortunately
> not very good with these type of problems.
> 
> The file system is a Oracle Cluster Filesystem (ocfs2) on a SAN.
> While performing a copy, one of the servers attached to it was frozen
> so I thought I should restart it.  Well, not surprisingly, the file
> system didn't like it.
> 
> Upon restarting, it was mounted as read only.  I ran fsck.ocfs2 and
> many lost files were found.  But I ran it again to make sure it was
> fine and this error came up:
> 
> ...
> Pass 1: Checking inodes and blocks
> [Scanning inodes 100%]
>   I/O read disk/cache: 1000MB / 392MB, write: 0MB, rate: 2.10MB/s
>   Times real: 11m57.073s, user: 4m1.084s, sys: 0m1.030s
> Pass 2: Checking directory entries
> pass2: Bad magic number in directory block while reading dir block 1439634968
> pass2: Bad magic number in directory block while reading dir block 1439634969
> pass2: Bad magic number in directory block while reading dir block 1439634970
>   I/O read disk/cache: 16MB / 2239MB, write: 0MB, rate: 1.48MB/s
>   Times real: 0m11.971s, user: 0m1.156s, sys: 0m0.041s
> Pass 3: Checking directory connectivity
> [DIR_DOTDOT] Directory inode 4273782381 is referenced by a dirent in
> directory  4273782380 but its '..' entry points to inode 0. Fix the
> '..' entry to reference 4273782380? <y> y
>  fix_dot_dot: Bad magic number in directory block while iterating
> through dir inode 4273782380's directory entries.
>  I/O read disk/cache: 0MB / 1MB, write: 0MB, rate: 0.00MB/s
>  ...
>  All passes succeeded
> 
> No matter how many times I run the command, I get these same errors.
> There is a debugfs.ocfs2 command which I can run interactively and
> figure out that 4273782381 isn't a file I need.  But I don't know if
> this command can fix my problem -- I've never used it or the ext4
> equivalent.
> 
> When I mount it, it now mounts as read-write but once I enter the
> direction in question, the file system switches to read-only.  There
> isn't a whole lot of information for ocfs2.  I tried to register
> myself on the ocfs2 mailing list and it seems they're not approving my
> application.  I also tried ServerFault with no luck.
> 
> I expanded my search to see what steps are there for ext4 and what
> I've seen so far is "keep running fsck until it's happy".  And in the
> end, if you can't make fsck happy, then you copy the files out and
> reformat...  I'm unable to find another solution.  I'm happy to erase
> the problematic files, actually...  Anything to save the other
> files...
> 
> Anyway, any suggestions would be appreciated!  I am considering
> copying the files out and reformatting as a last resort, but at 70 TB,
> I would rather not...  :-(
> 
> On an unrelated note, does anyone have an opinion about GFS2 and if
> it's better than ocfs2?
> 
> Thank you!
> 
> Ray
> 


Home | Main Index | Thread Index