Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Help with fsck and ocfs2 (or even ext4?)...

Hi Christian,

Thanks a lot for this! I've been struggling with this problem for a few days but OCFS2 in general for a lot longer and have had very little luck hearing what more experienced people have to say...

I really appreciate your time in your response as I have very little experience with SANs...learning as I go...

On 19/9/2021 3:40 pm, Christian Horn wrote:
On Sat, Sep 18, 2021 at 10:24:50PM +0800, Raymond Wan wrote:
The file system is a Oracle Cluster Filesystem (ocfs2) on a SAN.
- with ocfs2, there is basically one or more blockdevices, which
   multple systems are accessing
- so the data is in one place, and the accessing systems are talking
   via network, so that they can lock files for writing

Based on my limited understanding, that is correct.

When I mount it, it now mounts as read-write but once I enter the
direction in question, the file system switches to read-only.  There
isn't a whole lot of information for ocfs2.  I tried to register
myself on the ocfs2 mailing list and it seems they're not approving my
application.  I also tried ServerFault with no luck.
I think oracle provides support for ocfs2, so if these servers are
under a support contract, I would contact them.

Unfortunately, we don't have a support contract with OCFS2. We're running Ubuntu and I'm using it via an official package that's distributed with Ubuntu 20.04 .

I'm not sure if a support contract is still possible. But if my employers were willing to purchase that, they should have bought double the disk space so that I could do a proper backup... :-P

Anyway, any suggestions would be appreciated!  I am considering
copying the files out and reformatting as a last resort, but at 70 TB,
I would rather not...  :-(

- At least GFS2 has also a mount mode which basically disables the network
   locking for a moment and says "believe me, you are the only node with
   access", that might work better
- if the block device was smaller, it would be best practise to make a
   copy to a file or other block device, and then just then try repairs.
   If that 70TB buffer does not exist, taking snapshots (dmsetup snapshot,
   or on the SAN storage itself) could also provide a safety net for
   repair attempts
- But I suspect you will end up to copy off from there what you can
   still read, and restore the rest from a backup

I see... I was hoping that there was some way to "repair" the inode information. Actually, the area on the disk that is causing problems for me can be erased...if I could just do that, that would be great!

All those times I ran (or auto-ran) fsck on ext4 partitions which saved my data due to the journal, etc., and I guess it's about time that my luck ran out. I guess I'm just a bit disappointed that there's no manual way to fix this problem, even when fsck.ocfs2 tells me exactly where the problems are.

I've been spending the last couple of days scrounging up 70 TB of space to copy the files that are fine. Guess I will end up formatting this drive and restoring from this backup.

Thank you for your comments! I thought what I was doing was "dumb"...I see now that it might be my only option.

The data itself isn't "raw" data, but processed data that represents about 9 months of processing time...I can't lose it or else I'm doomed... Wait -- is this mailing list public? Oops... :-P

On an unrelated note, does anyone have an opinion about GFS2 and if
it's better than ocfs2?

GFS2 has the same operation mode, so it's also "coordinating multiple
systems accessing a single block device".  We offer support for GFS2,
but honestly, in most cases I would like to hear the exact usecase and
based on that decide whether gfs2 is really needed.  In many cases,
keeping data on an XFS or ext4, and sharing via NFS, would provide
better availability and be easier to administer.
Ideally, of one or multiple nodes accessing an ocfs2 or gfs2 volume
go down, the others stay operating.  If a single/unclustered NFS
server goes down, that whole service is down.  But the easier
administration might effectively make up for that.

My limited understanding was that since this is a SAN, my only option was to use OCFS2 or GFS2.

Actually, the vendor of the SAN performed the initial installation (I won't say who the vendor was, but let's say their name rhymes with "Dell" :-P ). And they used ext4. Since they're the experts, I didn't question it. Within minutes of using it on our cluster, files started mysteriously disappearing. It was quite frustrating.

I asked on ServerFault and a couple of people clarified to me that ext4 wouldn't work. I still don't understand it...I thought a SAN could look after the disk the same way a server looks after an ext4 disk that is NFS exported...

I guess it's because it's a block device? The decision to use OCFS2 over GFS2 was a 50/50 decision. Now that I've had to remove all of the data to format the drive, I'm thinking maybe I should give GFS2 a try...

Neither seem to have a lot of help pages on the Internet. I presume it's because the number of users is few...

Perhaps I will give GFS2 a try. I just hope it's better and not worse... We have many file systems on the servers that are ext4 and NFS mounted. So far...none of them have given me any problems the last few years. But this SAN is making me want to scream every couple of months... *sigh*

Thanks again for your reply!


Home | Main Index | Thread Index