Just to save time, and to bring everyone else up to speed, I'll just
include my responses and then yours to this thread and then finally
respond to your last message to me about this.
Let me know if I get any of the attribution wrong, this is going to be a
fun cut and paste...
On Thu, Jun 12, 2003 at 12:10:48PM -0700, Steven Dake wrote:
> Folks,
>
> I have been looking at the udev idea that Greg KH has developed.
> Userland device enumeration definately is the way to go, however, there
> are some problems with using /sbin/hotplug to transmit device
> enumeration events:
>
> 1) performance of fork/exec with lots of devices is very poor
You later said your binary of udev was 500k. I said:
500k binary? Who's running anything that big? udev weighs in
around 6k right now! You have to add a _lot_ of functionality
to move up to 500k.
You responded:
Hmm perhaps my binary was not built optmized when I looked at
it, I'll take another look.
> 2) lots of processes can be forked with no policy to control how many
> are forked at once potentially leading to OOM
I responded:
No, this is pure speculation. Have you actually tried this out?
I have. I've hotplugged pci drawers connected to huge numbers
of SCSI disks and never even seen a blip on system load. It
turns out that pretty much everything happens consecutively, with
a reasonable amount of time just spent probing the SCSI busses.
USB is pretty slow in enumerating their devices too, so I don't
see a real world application that has this problem. Do you have
any proof of this?
You responded:
I don't agree. I've timed bootup with 12 fibrechannel disks
with 4 partitions each with udev using the SDEF patches. I've
then timed bootup and added in the time required to add those 12
fibrechannel disks (via a script) using /sbin/hotplug. The
script method (executing /sbin/hotplug 12*4 times with the
appropriate data hardcoded) takes rougly 1.25-1.75 seconds
longer. Since I'm using a chronograph, its not very accurate,
but it is definately slower to enumerate with /sbin/hotplug
using this methodlogy. That is only 12 disks. What about the
case of 1000 disks?
fork and exec are very expensive system calls, to be avoided at
all costs...
To which I responded:
Well, if you don't dynamically link a 500K file, perhaps it
would be a lot faster to start up :)
Anyway, who really cares? This is _not_ a time critical thing.
Who really cares that it takes 1 second longer to see the device
node for the newly plugged in device? And if you plug in 5000
devices all at once, you better be willing to wait the extra
minute or so for the system to calm down and make all of the
devices available to userspace.
And again you came back with:
System bootup time and time between failures are the two factors
used to determine availability. Reducing bootup time and
increasing time between failures both improve availability. Even
1 second is considerable when 5 9s is 30seconds of downtime per
year. 1 more second is a considerable change in availability
when you run the equations.
I now respond:
Bootup time reduction would be a great thing to have. And I agree for
situations where 1 second is a considerable amount of time, it is
important. However, 99.99% of the people running Linux out there, do
not have those problems. And even then, for those 00.01% of the people
who do, they do whatever they possibly can to keep from having to reboot
at all. I still say the measurable ammount of time to plug in a new
disk and have it's device node created is not measurable, from using
/sbin/hotplug and udev, and using your character node. Also, the
ultimate solution for these kinds of people is the in-kernel devfs, or
the current static /dev. If they use either of them, it's guaranteed to
be faster then yours or mine implementation.
> 3) /sbin/hotplug events can occur out of order, eg: remove event occurs,
> /sbin/hotplug sleeps waiting for something, insert event occurs and
> completes immediately. Then the remove event completes, after the
> insert, resulting in out-of-order completion and a broken /dev. I have
> seen this several times with udev.
I responded:
Yes this happens. I have a fix for this for udev itself. No
kernel changes are needed. I'll show it at OLS in July if you
want to see it :)
You responded:
There is a problem in udev that fork doesn't wait for the parent
to exit (mknod) which is solved by using the mknod system call.
But there is still another problem that /sbin/hotplug can
execute out of order. I'll be at OLS so I'll take a look at
your solution then.
So we agree to talk about this at OLS, fine.
> 4) early device enumeration (character devices, etc) are missed because
> /sbin/hotplug is not available during early device init.
I responded:
Then use early initramfs to catch this. I've done this and seen
it work already. If you don't want to do early init, walk the
sysfs tree yourself from userspace, it works just as well, and
we don't have to do any buffering in kernelspace at all.
You responded:
Initramfs still misses early events. That leaves walking sysfs.
I have considered and rejected using a user space program to
walk sysfs for this information. There are several problems
1) kobject device name is not present in sysfs (how do you know
if its a character or block, then without some external mapping
db, yuck)
2) scan of sysfs is not atomic, resulting in potential lost
events and lost device enumerations
I responded:
You don't know this from hotplug either. You have to figure it
out from the kobject name either way. Actually it's quite easy,
only block devices show up in /sys/block, everything else is a
character device.
No, this is quite simple to fix:
- set up /sbin/hotplug udev handler after init has
started.
- start walking sysfs to "cold plug" devices.
- any new events that happen are caught by udev and
handled there, while you are walking the tree.
- no events are lost.
You responded:
Events could be handled in this situation twice, which is
definately not what is desireable.
I now respond:
Doing a mknod() on a node that is already present doesn't harm anything.
And since you have to keep a database around of what devices are
present, and what you have called them, removing them after already
removing them again, is detectable. So handing the same event twice
isn't a problem.
> To solve these problems, I've developed a driver and some minor
> modifications to the lib/kobject.c to queue these events and allow a
> user space program to read the events at its leisure. The driver
> supports poll/select for effecient use of the processor.
>
> The feature is called the System Device Enumeration Event Queue. I've
> attached a tarball which includes udev-0.1 modified to use the SDEQ, a
> kernel patch against linux 2.5.70 that implements the SDEQ, and a
> test/library that tests the SDEQ functionality.
I responded:
Ok, I looked at your code. To put it kindly, it will never work
properly. See the long thread on lkml by Inaky after I
announced udev. He tried to create something like what you have
done, but got a big better implementation (yours has a number of
memory leaks, and permission problems...)
See that thread for why doing this in the kernel is not the way
to go. Please don't waste your time on this anymore, it's a
loosing battle...
You responded:
It obviously works properly when using the udev application so I
am at a loss as to how you can make a claim that "it will never
work". Please explain your comments about memory leaks. Data
structures are allocated and then freed at appropriate times.
Even in the case of a crash during a transaciton (get event,
clera events), there are no leaks. Please explain permission
problems. Permissions are controlled through policy in the
filesystem (/dev/sdeq) which fully controls access to the system
device enumeration event queue.
I did look at Inaky's code, and frankly it was too complicated
to solve the simple problem that needed to be solved. We also
have a full-blown event mechanism (for redundant system slot
compact pci support), but feel its too heavy-weight to be of
general purpose use to the kernel. This particular
implementation is lightweight and solves a specific need with
real applications that can really use the functionality.
I have read the thread and atleast 70% of the comments where "we
should not use /sbin/hotplug, but instead events should be
queued in the kernel". The thread raised all of the issues that
are solved with this kernel patch. Several core kernel
maintainers were the authors of the comments, so I'm not alone
in this belief.
Your obviously opposed to using character devices to queue
kobject creation/deletion events for some reason which I don't
understand. /sbin/hotplug is an inferior solution to queueing
events in the kernel. Your not the maintainer of the code this
feature is going into (which is the linux driver model). I'd
really like to hear Pat's thoughts on this issue as he is the
one that has to live with the changes. If you want to continue
using /sbin/hotplug, with all of its numerous problems for your
work, your more then welcome to do so. No one is forcing you to
use sdeq, however, for those other vendors that may want to make
persistent device naming solutions, this is a core feature that
can not easily be solved by poorly thought out hacks.
I will respond now:
Ok, permission problems can be solved by relying on the permission of
the device node, you are correct.
As for the memory issues, if no one ever reads from the character node,
it will constantly fill up with events, right? Shouldn't they be
eventually flushed out at some point in time?
Also for trying to remove events, why is userspace allowed to remove a
multiple of events at once? I would think that a valid read would
remove that event, you can have two applications grab the same event,
which is not what I think you want to have happen.
And finally...
Yes, I know I'm not the maintainer of the code that this feature is
going into. But my position is that this code is not needed in the main
kernel tree, and offer that opinion to the maintainer of this code.
I agree, most of the arguments in the previous udev thread talked about
out of order events, but I think I've gotten that solved now (I can wave
my hands about how it will be done, or I can show you working code at
OLS. I think I'll wait until OLS, as code matters, wavy hands don't.)
So, because of this, I still think that everything you are trying to do
can be successfully done from userspace today (or use devfsd today, it
will give you the same info!) And due to that, I do not think this code
is necessary to be in the kernel.
Thanks for reading this far, and my apologies if I got any of the above
quotes out of order or misspoken, I just wanted to cut to the chase, and
not have to go through 4 more rounds of email to make it to this point
in the conversation again.
greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/