Re: [HPDD-discuss] Anyone Backing up a Large LUSTRE file systems, any issues
by Dilger, Andreas
On 2013/29/05 3:17 PM, "gary.k.sweely(a)census.gov"
<gary.k.sweely(a)census.gov> wrote:
>Our backend strorage is all virtualized using IBM SVC clusters.
That is probably not going to work very well for performance, especially
if OSTs are
going to share the same underlying disks.
> We can have the SVC cluster maintain a mirror or create a snapshot for
>the backup. Assuming the snapshot requires short lock up of the MDS
>write services, it seems creating a mirror and
> snapshot of that would maintain accessibility of the MDS. The
>snapshot mirror/snapshot could be presented to backup for DR or major
>file corruption Backup/restores of the MDS. The MDS space is small
>enough that we can probably accept cost to mirror that.
>
>
>We can't do the same for the OSTs. It is to much to mirror. I'm
>thinking the OSTs would be backed up by several LUSTRE Client backup
>media servers. The clients would be backing up different subdirectories
>concurrently and then be restoring them concurrently to meet our RTO.
>Seems like it should handled ok. But I may be misunderstanding how
>LUSTRE works and clearly don't know the internals to know if this might
>cause it to choke.
That is exactly what Lustre is designed to do - parallel IO.
>
>So question becomes, I have major failure. I restore the MDS from backup.
You only need to restore the MDT from backup if the MDT had a critical
failure.
In every other case, you restore specific files from your file-level
backup.
It is possible to use e.g. "lfs find" to find files located on a specific
failed
OST if there is a critical OST failure. In that case, you would unlink
the failed
files located on that OST, and then restore them from backup.
>I rebuild the OST server OS from image satellite service. Then restore
>the files from most recent client backup. How well does the MDS handle
>all the files being recreated by several client when the restored MDS had
>a previous image of metadata that no longer exists?
I think you are confused. If you are restoring a file-level backup then
Lustre doesn't care what you are writing (i.e. restoring), it is just a
file write.
Cheers, Andreas
>Does it just overwrite all of its tables. Or does it start choking
>because of the streaming writes from multiple concurrent restore sessions.
>
>
>
>
>
>---------------------------------------------------------------
>Truth 23. Your common sense is not always someone else's common sense.
>Don't assume that just because it's obvious to you, it will be obvious to
>others.
>
>---------------------------------------------------------------
>Gary K Sweely, 301-763-5532, Cell 301-651-2481
>SAN and Storage Systems Manager
>US Census, Bowie Computer Center
>
>Paper Mail to:
>US Census, CSVD-BCC
>Washington DC 20233
>
>Office Physical or Delivery Address
>17101 Melford Blvd
>Bowie MD, 20715
>
>
>-----"Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com> wrote: -----
>To: "gary.k.sweely(a)census.gov" <gary.k.sweely(a)census.gov>,
>"hpdd-discuss(a)lists.01.org \" <hpdd-discuss(a)lists.01.org>\""
><hpdd-discuss(a)lists.01.org " <hpdd-discuss(a)lists.01.org>">
>From: "Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com>
>Date: 05/29/2013 04:51PM
>Cc: "Dilger, Andreas" <andreas.dilger(a)intel.com>, Chris Churchey
><churchey(a)theatsgroup.com>, "\"\" james.m.lessard(a)census.gov \"
><james.m.lessard(a)census.gov>, "Holtz, JohnX" <johnx.holtz(a)intel.com>,
>Prasad Surampudi <prasad.surampudi(a)theatsgroup.com>,
>"raymond.illian(a)census.gov"
> <raymond.illian(a)census.gov>, "anthony.t.li(a)census.gov"
><anthony.t.li(a)census.gov>
>Subject: RE: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
>systems, any issues
>
>Gary,
>
>
>The MDT backup is there to provide a fast time to restore of the MDS
>should that service be irrevocably compromised. It would not play a part
>in normal file restore operations
> of e.g. individual files or directories (nor would it provide any
>benefit if the whole file system was lost completely). Backing up the MDT
>is a shortcut for the eventuality of losing just the metadata. In your
>scenario, how do you intend to mirror the MDTs?
>
>You can multi-home the Lustre servers and attach backup hosts exclusively
>to that network. You will need to do this for the MGS, MDS and OSS
>servers. If your backup system
> supports it, attaching the backup servers to the tape library through an
>FC SAN is way less overhead than Ethernet.
>
>Ethernet bonding can deliver benefits but is very implementation
>dependent (switches also play a part here) and I¹m not well versed. The
>use cases for Linux are pretty widely
> documented though.
>
>Malcolm.
>
>--
>Malcolm Cowe, Systems Engineer
>Intel High Performance Data Division
>malcolm.j.cowe(a)intel.com
>+61 408 573 001
>
>From: gary.k.sweely(a)census.gov [mailto:gary.k.sweely@census.gov]
>
>Sent: Thursday, May 30, 2013 1:49 AM
>To: Cowe, Malcolm J; hpdd-discuss(a)lists.01.org "
><hpdd-discuss(a)lists.01.org>"
>Cc: Dilger, Andreas; Chris Churchey; "" james.m.lessard(a)census.gov "
><james.m.lessard(a)census.gov>, " Holtz; JohnX <johnx.holtz(a)intel.com>,
>Prasad Surampudi <prasad.surampudi(a)theatsgroup.com>,
>raymond.illian(a)census.gov, anthony.t.li(a)census.gov
>Subject: RE: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
>systems, any issues
>
>
>
>Thanks, Lots of helpful information.
>
>
>
>Our LUSTRE service requirement doesn't have urgent enough requirement
>(clout) to justify a full mirror of the LUSTRE FS for HA and DR purposes.
> So we will be falling back
> to a lower level of backup/DR functionality focused on DR restoration in
>5 days and ability to rapidly restore a file from previous days backup.
>
>
>
>If I create an image backup of the MDSs, get client based File backup of
>the OST's over a longer window or different time period, and then had to
>do a full restore, will
> the OST's be so far out of synch with the MDT's that the file system
>would be unreadable, or is their a process to resynchronize the MDT data
>with the actual files in the OST's?
>
>
>
>Sounds like I should be focused on;
>
>
>* Getting solid backup of the MDS servers and at least the MDT volumes.
> Configuration design would be to mirror the MDS server/servers. Snapshot
>the mirror. Run backup against the snapshot.
> This provides rapid recovery of the MDS using the mirror, and DR using
>the backup with no down time to the MDS/MDTs. Because MDS/MDTs are
>smaller in size with total less then 1TB I can probably afford to mirror
>these.
>
>
>
>If I don't want to tie up the disk space of a Mirror then just snapshot
>the MDS and run backup against the snapshot, but this requires short
>period of locked access to the MDT's while snapshot
> is being generated.
>
>
>
>* Getting basic file backups from within the LUSTRE file system for
>individual file restores. I need to set up a few more LUSTRE Clients to
>act as backup media servers that can concurrently backup
> directories and files of the LUSTRE file systems. Quantity of "media"
>servers depends on how many directories they can backup in the backup
>window and how much load they create on the OSSs and MDS running the
>backup.
>
>Because we have jobs running round the clock the backups will be
>competing for IO from the OSSs.
>
>Our OSS ethernet fabric is currently limited to 10ge. This leads more
>research questions into what kind of load impacts the backup will
>generate against the OSSs to determine
> how it will impact regular workload activities.
>
>
>
>Can an OST be presented out over two different IP address on a single OSS
>such that we could push the backup/restores over their own interface and
>IP address?
>
>
>
>
>Alternative would be bonding multiple 10ge together.
>
>
>Has anyone tested to see if 2 bonded 10ge interfaces perform equally as 2
>separate fully loaded 10ge?
>
>
>
>Which also leads to question about CPU load. Does LUSTRE distribute IO
>load well across multiple CPUs?
>
>The ethernet traffic for bonded 10ge is likely to use full capacity of a
>couple of CPUs. So we would need to see load distributed across multiple
>CPU cores and hopefully
> having adjacencies of ethernet IO to the core servicing the file
>activity.
>
>
>
>We will need to also take into consideration backup network load and
>number of the LUSTRE backup clients that are running to help define how
>many OSSs are needed and how
> many OSTs of what sizes would be on each OSS.
>
>
>
>Additional research digging, Is a file service backup process likely to
>thrash the LUSTRE caching tables?
>
>We are going to be running a test configuration with a few OSSs and few
>OSTs and a couple of client servers so I guess we will see what our
>backup system creates in load
> for performance design issues. This is going to turn into an
>interesting exercise, not meant to be a full benchmark, but instead a
>feasibility analysis to help determine LUSTRE design requirements for our
>go to environment.
>
>
>
>---------------------------------------------------------------
>Truth 23. Your common sense is not always someone else's common sense.
>Don't assume that just because it's obvious to you, it will be obvious to
>others.
>
>---------------------------------------------------------------
>Gary K Sweely, 301-763-5532, Cell 301-651-2481
>SAN and Storage Systems Manager
>US Census, Bowie Computer Center
>
>Paper Mail to:
>US Census, CSVD-BCC
>Washington DC 20233
>
>Office Physical or Delivery Address
>17101 Melford Blvd
>Bowie MD, 20715
>
>
>
>-----"Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com> wrote: -----
>To: "Dilger, Andreas" <andreas.dilger(a)intel.com>
>From: "Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com>
>Date: 05/29/2013 06:30AM
>Cc: "gary.k.sweely(a)census.gov" <gary.k.sweely(a)census.gov>,
>"hpdd-discuss(a)lists.01.org" <hpdd-discuss(a)lists.01.org>,
> Prasad Surampudi <prasad.surampudi(a)theatsgroup.com>, "Holtz, JohnX"
><johnx.holtz(a)intel.com>, "raymond.illian(a)census.gov"
> <raymond.illian(a)census.gov>, Chris Churchey <churchey(a)theatsgroup.com>,
>"james.m.lessard(a)census.gov" <james.m.lessard(a)census.gov>
>Subject: RE: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
>systems, any issues
>Agreed. In addition to a file based backup strategy, capturing the MDTs
>with a device level backup can protect against catastrophic loss of the
>MDT. In this situation, restoring the MDT from
> backup and running consistency checks on the file system will be far
>quicker than recreating the FS and implementing a full restore.
>
>A lot depends on the criticality of the service being supported and the
>SLA for operational availability of the platform. A strategy that is
>built on device level backup of the MDTs along with
> file-based backup of the whole file system should provide sound
>coverage. In addition, replication, properly implemented, represents the
>fastest time to recovery in the context of a DR plan and can be useful in
>quickly rectifying mistakes in production systems
> as well. The overhead is reduced overall capacity as well as the
>additional processes required to fail-over and fail-back.
>
>Regards,
>
>Malcolm.
>
>
>From: Dilger, Andreas
>
>Sent: Wednesday, May 29, 2013 7:20 PM
>To: Cowe, Malcolm J
>Cc: gary.k.sweely(a)census.gov;
>hpdd-discuss(a)lists.01.org; Prasad Surampudi; Holtz, JohnX;
>raymond.illian(a)census.gov; Chris Churchey;
>james.m.lessard(a)census.gov
>Subject: Re: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
>systems, any issues
>
>
>
>I'd still also recommend a device level backup (using "dd", preferably of
>a snapshot) for the MDT filesystem. This is absolutely critical
>information, and backup/restore using "dd" is much more efficient than
>file-level backups, and not unreasonable given the
> relatively small size of the MDT compared to the total filesystem size.
>
>Cheers, Andreas
>
>
>On 2013-05-28, at 17:27, "Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com>
>wrote:
>
>
>Hi Gary,
>
>I would recommend a file-based backup strategy where the backup processes
>run on Lustre clients that are connected to the backup infrastructure. In
>fact this is the only realistic way to be
> able to provide targeted restores of files/directories. We quite often
>see data management or mover nodes in HPC architectures servers on the
>boundary of the cluster that can interface with external data systems
>such as tape libraries, either over a network
> or fibre channel. By managing the backups like this, there is no need to
>interface directly with the OSTs or MDTs and most if not all backup
>applications will work perfectly well on the data management Lustre
>client.
>
>One might also want to consider an online duplicate of the most critical
>data by syncing to a separate lustre fs, since restore time from a tape
>vault can be considerable for a large volume
> of data. Several strategies exist, depending on requirements and the
>applications in use.
>
>
>Regards,
>
>Malcolm.
>
>--
>Malcolm Cowe, Systems Engineer
>Intel High Performance Data Division
>malcolm.j.cowe(a)intel.com
>+61 408 573 001
>
>From:hpdd-discuss-bounces@lists.01.org
>[mailto:hpdd-discuss-bounces@lists.01.org]
>On Behalf Of gary.k.sweely(a)census.gov
>Sent: Wednesday, May 29, 2013 1:16 AM
>To: hpdd-discuss(a)lists.01.org
>Cc: Prasad Surampudi; Chris Churchey; Holtz, JohnX;
>raymond.illian(a)census.gov <mailto:raymond.illian@census.gov>;
>james.m.lessard(a)census.gov
>Subject: [HPDD-discuss] Anyone Backing up a Large LUSTRE file systems,
>any issues
>
>
>
>Has anyone identified issues with backing up and restoring a large LUSTRE
>file system.
>
>
>
>We want to be able to backup the file system and restore both individual
>files, and the full file system.
>
>Has anyone identified specific issues with backup and restore of the
>LUSTRE file system.
>
>Backup needs to run while users are accessing and writing files to the
>file system.
>
>
>
>Backup concern:
>
>
>1. How does it handle backup of data spread across multiple OST/OSS's yet
>maintain consistency of the file segments?
>2. Will backup system require backup media service pulling data over
>Ethernet, or can the OSS's do direct backup and restore of EXT4 file
>systems for full system backup/restores while maintaining
> consistency of the files spread across OSTs?
>3. Is there a specific backup product used to solve some of the file
>consistency issues?
>
>We would be using a large tape drive library cluster that can strip the
>backup across multiple tape drives to improve backup media performance.
>This would most likely mean having several systems
> running backup concurrently to multiple tape drive strip sets. I expect
>we would need to break the LUSTRE file systems into several backup
>segments running concurrently, which would also mean several independent
>restores to restore the whole system. But one
> major requirement is being able to restore a single file or directory
>when needed.
>Backup windows would be 8-14 hours.
>
>RTO of single file would need to be under 1 hour.
>
>RTO of full file system would be 4 days.
>
>RPO is one day's worth of project data, 1 week's worth of source data.
>
>
>
>
>
>We are considering a LUSTRE environment as follows;
>
>
>
>30TB-50TB source data, potentially will grow out to about 200TB.
>
>
>100TB to 500TB Project workspace.
>
>30TB of user Scratch space (does not need to be backed up).
>
>
>
>Initial total capacity 170TB growing to max size of 1PB.
>
>
>
>Most likely initially using 2TB OST's, across 11+ OSS's. May user larger
>OST's if no issues found in services/supportability/throughput.
>
>
>
>We were thinking of breaking the total space into separate file systems
>to allow using multiple MDS/MDT's for improving performance of the MDS's,
>which would also facilitate easier full LUSTRE
> file system backup/restores. But this means loosing the flexibility of
>having one large file system.
>
>
>
>OSTs using EXT4 or XFS file systems.
>
>
>
>About 25 Dedicated Clients servers with 20 to 40 CPU cores and 200GB-1TB
>RAM running scheduled batch compute jobs. Grows as loads dictate.
>
>Potentially add about 10-100 VMware Virtual client compute servers
>running batch jobs. (4 or 8 cores with 8 to 32GB ram).
>
>About 2-5 interactive user nodes, nodes added as load needs dictate.
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
9 years
Porting patches between branches
by Dilger, Andreas
I'd like to get some consistency with the commit messages when porting
patches between branches, and when porting patches to the upstream kernel.
There are a number of standard tags that are already used, such as
Signed-off-by: and Change-Id: that should continue to be used, but there
needs to be some consistent way to track if a patch landed on b2_1 is
based on a patch landed originally on another branch, possibly originally
under a different bug number and with a different commit summary.
Firstly, when porting a patch it is important to use "git cherry-pick
{hash}" or "git am {patch}" to preserve the original author, commit
comments, and patch summary line. It is fine to _expand_ the commit
comments if they were originally not clear enough, but rarely does it make
sense to reduce the comments. Patch descriptions like "backport LU-1234
fix to b2_1" are not very useful, since they impose a burden on the reader
to dig into Jira and Gerrit to find what the patch actually does.
If the patch was originally from a different Jira bug that was
rediscovered on a different branch, or if it is landing upstream then it
makes sense to link the old patch to the new bug for tracking. There has
already been the XXXXX-bug-id: tag in use for identifying patches from
Xyratex Jira, or originally from an Oracle bugzilla, so it is possible to
use:
Intel-bug-id: LU-NNNN
to identify the duplicate issue. The original Jira ticket should stay in
the patch summary, so that "git shortlog" can be used for comparison.
The current version of Gerrit does not play well with re-using the same
Change-Id: line for multiple branches, so a new Change-Id: line is needed.
We are planning to upgrade Gerrit after 2.4.0 is released, and maybe it
will be possible to keep the same Change-Id: in the future.
I'd like to keep links to the original Git commit hash and change like:
Lustre-commit: f62391735dce440addba54c9a7640e2fdff5ea23
Lustre-change: http://review.whamcloud.com/6383
The full commit hash is important for upstream kernel submission, since a
short commit hash like f62391 has a good chance of collisions with the
upstream kernel (the kernel has about 200k patches, which is about 1/8 of
all the 6-character hashes already).
Finally, it makes sense to preserve the original reviewers of the patch.
This automatically adds them as reviewers for the new patch in Gerrit as
well:
Reviewed-by: Andreas Dilger <andreas.dilger(a)intel.com>
Reviewed-by: Mike Pershin <mike.pershin(a)intel.com>
I'll add all of this information to the wiki as well, but wanted to get it
to a wider audience also.
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
9 years
filter_do_bio()) ASSERTION(rw == OBD_BRW_READ)
by Wojciech Turek
We have experienced an LBUG on one of ours OSS servers which I have not
seen before.
LustreError: 16066:0:(filter_io_26.c:344:filter_do_bio()) ASSERTION(rw ==
OBD_BRW_READ) failed
LustreError: 16066:0:(filter_io_26.c:344:filter_do_bio()) LBUG
Pid: 16066, comm: ll_ost_io_126
Now after rebooting that OSS same LBUG is triggered as soon as OSTs
finishes recovery and start servicing their data. Has anyone seen this
before ?
Our environment:
servers: RHEL6 2.6.32-220.17.1.el6_lustre.x86_64 Lustre-2.1.2
clients: RHEL6 2.6.32-358.6.2.el6.x86_64 Lustre-2.1.5 patchless
We have resolved this by first identifying which OST was involved in the
LBUG and then running fsck on that OST. File system check found an
incorrect inode number and repaired it. This is very worrying though how
this corruption has crept in onto the filesystem. We did not experienced
any hardware problems or unexpected crashes of the storage that could cause
this.
--
Wojciech Turek
9 years, 1 month
Lustre 2.4 update - May 24th 2013
by Jones, Peter A
Hi there
Here is an update on the Lustre 2.4 release.
Landings
========
-A number of landings made - see http://git.whamcloud.com/?p=fs/lustre-release.git;a=shortlog;h=refs/heads...
-Landed support for RHEL6.4 update (LU-3354); this kernel update meant that an RC2 was required
Testing
=======
-Testing on the 2.4.0-RC1 tag has completed; testing on RC2 is underway
-RC1 was tested during dedicated time on Hyperion (17 OSS, 34 OSTs and 737 clients)
-ORNL will be testing the latest tag on Titan during a maintenance window this coming week
-Cray have been testing with RC1
Blockers
========
-Full list available at https://jira.hpdd.intel.com/issues/?filter=10292
-If there are any issues not presently marked as blockers that you believe should be, please let me know
Other
=====
-b2_4 branch now created
Thanks
Peter
9 years, 1 month
Fwd: Regarding the Parallel File System..
by linux freaker
I have few questions regarding the Lustre Filesystem with 1 MDS/MDT on the
same machine, 12 OSTs configured with LVM for 2 OSS total and 4 Lustre
Client running Hadoop(1 namenode and 3 datanode). Hadoop using Lustre
instead of HDFS.
Question: I have created LVM for OSTs instead of physical hard disk? How is
it going to affect my wordcount example running on 1 Namenode and 3
Datanodes. Say, if its 30 min for 18GB plain data wordcount to finish,
using physical hard disk will lessen the time?
Question: I would like to use other dataset like wikipedia dump instead of
simple wordcount. How shall I put the
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml...
into
Lustre.
In case of HDFS, I simply loaded into HDFS through -copyFromLocal command.
Please suggest for Lustre.
9 years, 1 month
Re: [HPDD-discuss] [Lustre-discuss] will obdfilter-survey destroy an already formatted file system
by Dilger, Andreas
On 2013/21/03 4:09 AM, "Michael Kluge" <Michael.Kluge(a)tu-dresden.de> wrote:
>I have read through the documentation for obdfilter-survey but could not
>found any information on how invasive the test is. Will it destroy an
>already formatted OST or render user data unusable?
It shouldn't - the obdfilter-survey uses a different object sequence (2)
compared to normal filesystem objects (currently always 0), so the two do
not collide.
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
9 years, 1 month
Getting per OST quota stats
by Brock Palen
We have some OST's that are way out of balance. We think a user single user setup up their stripe settings in an odd way.
Is there a way to read the quota data base for a single OST? We have quotas on on this filesystem and we can quickly get total usage but we want just the single OST.
Right now we are grinding through robinhood SQL directly or lfs find messiness.
Thanks!
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
brockp(a)umich.edu
(734)936-1985
9 years, 1 month
lustre-diagnostics tool dead?
by Nate Pearlstein
Has the lustre-diagnostics tool died? Is it no longer public?
--
Nate Pearlstein - npearl(a)sgi.com - Product Support Engineer
9 years, 1 month