First, thanks for your patches and efforts spent on these cleanups.
On Sep 19, 2014, at 12:45 AM, Julia Lawall wrote:
> With respect to the upper case lower case issue, does the thing need to be
> a macro? I think that the lowercase is more or less fine, but only if
> what is behind it is a function.
I don't have a strong opinion either way as long as we have all the functionality
> I say more or less fine, because normally in the kernel the special
> allocators have special purposes, eg allocating and initializing the xyz
> structure. Here what is wanted is a general purpose allocator with lots
> of special tracing features, so it is not quite the same thing. And one
> can wonder why all of these special tracing features are not relevant to
> the kernel as a whole?
Like I explained in my previous email, many of the tracing features are already
possible to replace with other existing in-kernel mechanisms like kmemleak.
Except the total tally of allocations/frees so that a memleak could be visibly
easily seen on module unload time. I think this would be useful for other
kinds of modules too, not just lustre, so having that as a generic allocator
feature would be cool too.
> In reading through the description of the needed features, it seems like
> only the _ptr extension requires being a macro. Do we need that? The
We only need that as a less error-prone way of having
x = obd_kzalloc(sizeof(*x), ….)
Real free function does not take size argument, but we need that for
total allocated/freed accounting. Easy to have disconnect with
the size argument of obd_free to be wrong.
> rest of the kernel manages to do x = kzalloc(sizeof(*x),...) ok. It's
> unpleasant to have an assignment hidden in this way. And currently it is
> not used consistently. There are some OBD_ALLOCs that have the same form.
Yes, those are converted as thy are noticed.
> Sorry for overlooking the frees. I was focusing on trying one thing at a
I kind of think it's a related issue.
Touching ones needs to touch the other if not in the same patch then in
a next patch. And that's why I think consideations for what FREEs would need
is needed from the start, so the FREEs removal patch does not goes and patches a bunch of just patched allocs.
On Sep 19, 2014, at 5:11 AM, Dan Carpenter wrote:
> On Fri, Sep 19, 2014 at 02:57:03AM +0000, Drokin, Oleg wrote:
>> 4. Sometimes we need large allocations. general kmalloc is less
>> reliable as system lives on and memory fragmentation worsens. So we
>> have this "allocations over 2-4 pages get switched to vmalloc" logic,
>> if there's a way to do that automatically - that would be great.
> Julia's patch only changes OBD_ALLOC() functions and those are always
> kmalloc so that's not an issue here.
That's true. Though it would be strange to convert just a subset of those
allocating macros, and the solution for the OBD_ALLOC_LARGE would
also be needed.
> The OBD_ALLOC_LARGE() macro is vmalloc() or kmalloc() if the size is
> small enough. We don't really want to choose between kmalloc and
> vmalloc automatically. My instinct is that we should change all the
> OBD_ALLOC_LARGE() to vmalloc() and trust it to allocate them in the most
> sane way possible. But I haven't really looked very closely.
We found that vmalloc (at least with larger allocs, I did not look into details)
has quite a big penalty with some internal lock that creates big contention at
So of course the desire is there to reduce our allocations when we can.
(additionally on 32 bit arches there's always this issue of vmalloc region
possibly not being large enough to allocate everything we might want whenever
we see a way for that).
On Sep 18, 2014, at 7:43 PM, Dan Carpenter wrote:
> On Thu, Sep 18, 2014 at 10:24:02PM +0200, Julia Lawall wrote:
>> From: Julia Lawall <Julia.Lawall(a)lip6.fr>
>> This patch removes some kzalloc-related macros and rewrites the
>> associated null tests to use !x rather than x == NULL.
> This is sort of exactly what Oleg asked us not to do in his previous
> email. ;P
Hey, Thanks for remembering me ;)
> I think there might be ways to get good enough tracing using standard
> kernel features but it's legitimately tricky to update everything to
> using mempools or whatever. Maybe we should give Oleg some breathing
> room to do this.
In fact I was mostly mourning ENTRY/EXIT/GOTO stuff back then - I don't know how to
replace anything like that even one bit at the scale we need.
the OBD_ALLOC()/FREE() served multiple purposes most of which could be done with other ways now:
1. general accounting for lustre memory usage (all types directly allocated through the macros), with a message present at the module unload if we freed less than allocated - a warning is printed on unload which would set off a search for the leak (hey, at least we know there is a leak somewhat fast, we also know how much was leaked, and we might probably find out how many allocations were not freed if we wanted to add that stats). - I don't know how to replace that, so perhaps a macro for this still be useful.
2. Hunting memory leaks (this is what the variable allocated, where it was allocated, where it was freed, and the address of the allocation printed) - on non-production systems this could be replaced with kernel memory leak detector now - in fact it's even more convenient since you don't need to match up logs with a script to see what allocated what, and there's even a convenient backtrace printed as a bonus. I used it and really liked the result.
This is not really fitting in production, as kmemleak tends to eat memory like there's no tomorrow (at least in my config) and also might need a kernel rebuild. But it's not like getting people
to gather proper debug logs was easy too. So we can probably do away with that.
3. Fault injections - there's now a way to do this in the kernel, so we probably can do away with this too.
4. Sometimes we need large allocations. general kmalloc is less reliable as system lives on and memory fragmentation worsens. So we have this "allocations over 2-4 pages get switched to vmalloc" logic,
if there's a way to do that automatically - that would be great.
> I hate looking at the OBD_ALLOC* macros, but really it's not as if we
> don't have allocation helper functions anywhere in the rest of the
> kernel. It's just that the style of the lustre helpers is so very very
> ugly. It took me a while to spot that OBD_ALLOC() zeroes memory, for
> It should be relatively easy to re-write the macros so we can change
> formats like this:
> old: OBD_ALLOC(ptr, size);
> new: ptr = obd_zalloc(size, GFP_NOFS);
> old: OBD_ALLOC_WAIT(ptr, size);
> new: ptr = obd_zalloc(size, GFP_KERNEL);
> old: OBD_ALLOC_PTR(ptr);
> new: ptr = obd_zalloc(sizeof(*ptr), GFP_NOFS);
> Writing it this way means that we can't put the name of the pointer
> we're allocating in the debug output but we could use the file and line
> number instead or something. Oleg, what do you think?
I think we don't really need the allocated pointer and the name all that much now with kmemleak.
But we still need to remember the allocation amount like we do now (and when freeing them later).
This is where OBD_ALLOC_PTR/OBD_FREE_PTR come handy - the size is derived from structure size
automatically - less space for error or unintentional mismatch (since kfree does not really care
about number of bytes freed).
so if you prefer to just have everything lowercased, we probably can do obd_zalloc and obd_zallc_ptr still?
(of course in some other world, there might have been a "context-aware" general alloc/free functions
that would have known if an allocation came from a module context and did the tallying internally,
and then warned on module unload if something did not match. But I imagine such module context determination
would not be easy to do. Perhaps registered callbacks for pools that could be called on alloc and on free - though such pools would also need to allow to allocate different sized chunks too).
> If we decide to mass convert to standard functions later then it's dead
> simple to do that with sed.
It's more ugly with OBD_FREE*, though, where the size is needed, while kfree/vfee does not take size.
Also if you convert allocs while not converting frees, that makes code even more ugly (see the current patch at hand even for the example of that).
> The __OBD_MALLOC_VERBOSE() is hard to read. It has side effect bugs if
> you try to call OBD_ALLOC(ptr++, size); The kernel already has a way to
> inject kmalloc() failures for a specific module so that bit can be
> removed. Read Documentation/fault-injection/fault-injection.txt
Yes, I think I agree here.
I installed the latest 2.5.3 version on the latest CentOS 6.5 version with
all upgrades and used the following commands to create a lustre file system:
1. Created /etc/modprobe.d/lustre.conf
options lnet networks=tcp0(eth0)
2. Started lnet
service lnet start
lctl net up
3. Created MGS and MDT:
mkfs.lustre --mgs /dev/sdb
mkfs.lustre --mdt --mgsnode=188.8.131.52@tcp0 --network=tcp0 --fsname=fs01
4. Updated /etc/ldev.conf
mgs - MGS /dev/sdb -
mgs - fs01:MDT0000 /dev/sdc -
5. Started Lustre:
service lustre start MGS
service lustre start fs01:MDT0000
6. And getting the following error:
Sep 18 12:29:03 localhost kernel: LustreError: 11-0:
fs01-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect
failed with -11.
I checked that lctl ping 0@lo is working properly.
After installing Mellanox MOFED, I configured Luster in the following way
./configure --disable-server --with-o2ib=/usr/src/mlnx-ofed-kernel-2.3
and was able to compile kernel modules.
But the version of function symbols that were compiled in ko2iblnd
module belongs to original kernel modules and not to the OFED
modules. As results compiled module refuse to load.
For example ib_destroy_cq function
And this is the address from the MOFED Module.symvers file
root@kickseed:/usr/src/lustre-release# grep ib_destroy_cq Module.symvers
It looks like an issue in Lustre.
The backward compatibility is usually listed in the release notes for each new release (it is more difficult to report the future compatibility when the old release is originally made).
The 2.1 client _should_ be compatible with newer 2.4 and 2.5 servers, but we only ever test interoperability with the newest release of any branch (2.1.6 in this case). It is recommended to upgrade your clients to this release anyway to get a large number of big fixes.
> On Sep 16, 2014, at 6:06, "vaibhav pol" <vaibhav4947(a)gmail.com> wrote:
> Hi ,
> Does any one have compatibility matrix of luster client version with luster server version .
> I like to know the luster 2.1.1 client is compatible with which highest luster server version .
> Thanks and regards,
> Vaibhav Pol
> Senior Technical Officer
> National PARAM Supercomputing Facility
> Centre for Development of Advanced Computing
> Ganeshkhind Road
> Pune University Campus
> Lustre-discuss mailing list
I'm wondering if anyone has seen this, before I open a bug report or
On a metadata server running lustre 2.5.1-2 the 'exports' stats aren't
updating. Bascially what you'd see with:
lctl get_param mdt.*MDT*.exports.*(a)*.stats
The md_stats and jobstats are updating, and exports stats on OSS's are
incrementing. I've not seen this on some 2.4 servers we're running
as long-standing member of the open source community (sell hardware, not
software!) and coincidentally, as a multi-alum of iu, i am delighted to
contribute my dirt-simple hack.
give the archway under memorial hall a pat for me...
i hope that this helps,
On 09/16/2014 08:57 AM, Stephen Simms wrote:
> Hi Steve-
> Here at IU we just hit the rm -rf in spades because lots of bio guys here are running Trinity which has a step that creates thousands of files in thousands of directories. We didn't realize the extent of the removal problem until we met with the bio folks last week. So, we are about to embark on creating a script like yours.
> Is there any chance you could share your script with the community or even just my team? Sadly, we are currently hamstrung and can't upgrade to 2.6. It would certainly save us some time scripting it and would give us something to offer the Trinity users in a hurry so they could clean up their files in a timely fashion after each run.
> If you can't / won't , I completely understand the difficulties of sharing codes with other organizations and institutions.
> Thanks very much for your time and consideration!
> Stephen Simms
> Lustre Community Representative Board Member, OpenSFS
> Manager, High Performance File Systems, Indiana University
>> On Sep 16, 2014, at 8:07, steve ayer <steve.ayer(a)trd2inc.com> wrote:
>> hi rick,
>> oh, yeah.
>> a very expensive filesystem operation, aggravated by myriad related bugs in <= 2.5.x. while still saddled with these versions i went so far as to write a little shell script that walked the directory-tree depth first and blitzed files singly to keep from hanging the mds.
>> the short answer is that the diagnosis of this problem will consume far more time -- after which you will still have the problem -- than the simplest solution, which is to upgrade your machinery to 2.6.
>>> On 09/15/2014 05:32 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>>> I have run into a situation where a user's "rm -rf" process seems to cause very high buffer usage on our mds server. I verified that this process was the cause of the issue by sending the STOP signal to the "rm" command and noticing that the growth of the buffers on the mds server slowed to a crawl. If I then sent the CONT signal, the buffer size would start growing again. I dropped all the caches on the mds server and timed the growth. It increased about 17 GB in 15 mins. I have been dropping the caches periodically in an effort to contain the growth while I investigate the problem. Unfortunately, after I drop the caches, the new low point is always a little higher than it was before, which means there will come a point where dropping the caches will no longer be effective.
>>> With that in mind, I have had to stop the user's "rm" process to contain the damage. Since that is only a temporary band-aid, I am trying to get a handle on what might be the underlying problem. I searched the lustre bugs, and the closest thing I could find was LU-4740 (and maybe LU-4906?), but it's not clear if those are the cause of my problem. The odd thing is that a recursive rm is not an uncommon command to run, and I have not noticed this behavior before.
>>> The server is running:
>>> - CentOS 6.5
>>> - kernel 2.6.32-358.23.2
>>> - lustre 2.4.3
>>> The client is running:
>>> - CentOS 6.2
>>> - kernel 2.6.32-358.23.2
>>> - lustre 1.8.9-wc1
>>> Has anyone else seen a similar issue?
>> HPDD-discuss mailing list
Using lustre 2.5.3, 1 combined MDS/MDT, 44 OSTs. Currently containing 120TB
data, over 35M files.
On the weekend, our MDS server crashed due to an IO hang. After restarting the
server, we starting hitting the LU-5040 bug during recovery:
kernel BUG at fs/jbd2/transaction.c:1033!
kernel: invalid opcode: 0000 [#1] SMP
I attempted a restart of all OST and MDT mounts with abort_recov, and the
filesystem was able to mount on a client and all OSTs connected on a client. The
first access to any files or metadata caused the MDS to panic and also show
indications of LU-5392.
Is this is indicating a corrupted quota subsystem? I was trying to find a means
of rebuilding the quota records. However, "lfs quotacheck" is no longer
supported as it states "since space accounting is always enabled".
If the quotas are corrupted, how can I recover them. Likewise, how can I
recover from the two bugs mentioned above? I have some time flexibility to
resolve it, if that would assist in getting the bugs addressed and my filesystem
Any assistance would be appreciated.
Gary Molenkamp SHARCNET
Systems Administrator University of Western Ontario
Compute/Calcul Canada http://www.computecanada.org
(519) 661-2111 x88429 (519) 661-4000
On Sep 15, 2014, at 5:57 PM, Matt Bettinger <iamatt(a)gmail.com>
> what are you using to measure the buffers and what buffers are we
> talking about specifically?
I was just using "top" to monitor the amount of memory in the buffers and caches. I also used "slabtop" to monitor the /proc/slabinfo data, and I noticed that the amount of memory used by all the slabs seemed to be much less than the amount of memory "top" reported in buffers or caches.
> Also have same lustre client but on 5.8. I can do some test tomorrow
> to confirm . We use collectl--> graphite for monitoring.
Thanks. I appreciate it.
Senior HPC System Administrator
National Institute for Computational Sciences