Re: Bug or feature? Priority in /proc/<pid>/stat for RT processes

Lev Makhlis (mlev@despammed.com)
Wed, 21 Nov 2001 02:36:41 -0500


On Tuesday 20 November 2001 12:51 am, Albert D. Cahalan wrote:
> Lev Makhlis wwrites:
> > On Monday 19 November 2001 06:31 am, Albert D. Cahalan wrote:
> >> I can tell you what procps will do. The very first thing is
> >> to undo your transformation. Don't bother having the kernel
> >> muck with the numbers. The procps code will transform the
> >> numbers as needed to match UNIX convention and/or the tools
> >> which users run to set these values.
> >
> > Hmm, how would you explain that the kernel mucks with the numbers
> > for SCHED_OTHER, but not for SCHED_FIFO/SCHED_RR?
>
> Long ago, the kernel didn't muck with the numbers. It has to
> do that now because the numbers used inside the kernel are
> different than they used to be. If the interface were being
> designed today, it could supply the raw numbers.

Yes, and then there would be no question about supplying the raw
numbers for FIFO/RR as well. I'm not sure, though, that it would be
such a good idea, because the raw numbers for SCHED_OTHER
do not have a rigid scale -- it changes with the definition of HZ.
It wouldn't be trivial for an app programmer to find out just how high
on the priority scale a particular process is.

>
> > IIRC, procps does not attempt to undo the f(x) = 20 - (10x + 5) / 10
> > (assuming HZ=100) transformation currently used for SCHED_OTHER.
>
> Yes it does, more-or-less. This depends on what is being
> supplied to the user. You can have the data UNIX-style,
> SunOS-style, and traditional Linux-style. Like this:
>
> ps -eo pri,opri,priority

But you can't have the raw task->counter value, can you?
I don't think it's possible, since the mapping is not 1-1.

>
> > Granted, procps can do the transformation itself, but procps does not
> > have a monopoly on using procfs data -- any other performance-monitoring
> > application would have to duplicate the transformation, if it is to be
> > consistent with the standard (procps) tools. I thought it would be
> > nice if the kernel provided a consistent interface through procfs to
> > begin with.
>
> Maybe you should consider why, if true, the kernel internals
> are not consistent with the API.

I can only speculate that it's for the same reason that /proc/partitions
shows sizes in units of 1024 regardless of the actual block size, and
that /proc/stat shows CPU times in units of 10ms even if HZ is redefined
to something other than 100 -- so that the API remains backward-compatible
as the kernel internals continue to evolve.

Perhaps this isn't a performance
> advantage. I also wonder why RT tasks have a separate priority
> in the task struct when they leave the regular one unused and
> regular tasks leave the RT one unused. If these could be the same
> data type, then there isn't even any need for a union.

I do not claim to understand all of the scheduler code, but it appears
to me that the regular priority field still has some limited use for
RT processes, especially for RR.

>
> For compatibility with the rest of the world, procps needs to
> display the scheduling policy ("RR", "TS", etc.) and remap RT
> priority values in several different ways. Having the kernel
> remap values just obfuscates what the data really means, making
> more work for every app developer and wasting kernel CPU time.

Frankly, as an app developer, I can't see the benefit of having
the raw values, as long as I get a 1-1 mapping. For example, when
I am programming on Solaris, I know that FIFO/RR priorities can range
from 0 (lowest) to 59 (highest) when I look at them via /proc, or
from 100 (lowest) to 159 (highest) when using the POSIX interface
(<sched.h>). On HP-UX, I know that they range from -32 (highest)
to -1 (lowest) when using pstat(), or from 0 (lowest) to 31 (highest)
when using sched_*(). And on Tru64 Unix, they can be between 0
(highest) and 63 (lowest) when using the mach interface, or between
0 (lowest) and 63 (highest) when using sched_*(). In each case,
there is a 1-1 mapping between the POSIX values and the "native"
values that are used by ps/top by default. In each case, I don't know
which values the kernel uses internally -- could be the POSIX ones,
the "native" ones, or neither. I simply fail to see what additional
benefit I would have if I knew, for example, that Solaris really uses
values from 300 to 359 (just an off the wall example) under the hood.
Unless, perhaps, I wanted to bypass the API and read the priority
straight from process tables in /dev/kmem or something.

On Linux, I know the RT priorities range from 1 (lowest) to 99 (highest)
in the POSIX interface, and I happen to know, thanks to source
availability, that 1..99 is used internally. Would I see anything wrong
with Linux, unlike the other platforms, providing exactly the same
numbers through its "native" (procfs) interface? Not at all. My only
objection is that it would be inconsistent with the mapping for
SCHED_OTHER that's already in place, which is from -20 (highest)
to 20 (lowest). I am saying the scales should either both go upwards,
or both go downwards. I suggested a reversal of the RT scale
because I doubt a reversal of the TS scale would be readily accepted
at this stage, but maybe I'm wrong...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/