Re: kernel lock contention and scalability

Manfred Spraul (manfred@colorfullife.com)
Sun, 25 Feb 2001 10:52:19 +0100


This is a multi-part message in MIME format.
--------------6A6AF872F4E618CA2D9E3555
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Jonathan Lahr wrote:
>
> To discover possible locking limitations to scalability, I have collected
> locking statistics on a 2-way, 4-way, and 8-way performing as networked
> database servers. I patched the [48]-way kernels with Kravetz's multiqueue
> patch in the hope that mitigating runqueue_lock contention might better
> reveal other lock contention.
>

The dual cpu numbers are really odd. Extremely high count of
add_timer(), del_timer_sync(), schedule() and process_timeout().

That could be a kernel bug:
perhaps someone uses
for(;;) {
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(100);
}
without checking signal_pending()?

> In the attached document, I describe my test environment and excerpt
> lockstat output to show the more contentious locks for a typical run on
> each of my server configurations. I'm interested in comparing these data
> to other lock contention data, so information regarding previous or ongoing
> lock contention work would be appreciated. I'm aware of timer scalability
> work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone
> working on reducing sem_ids contention?
>

Is that really a problem?
The contention is high, but the actual lost time is quite small.

The 8-way test ran for ~ 129 seconds wall clock time (total cpu time
1030 seconds), and around 0.7 seconds were lost due to spinning.
The high contention is caused by the wakeups: cpu0 scans the list of
waiting processes and if it finds one it is woken up. If that thread
runs before cpu0 can release the spinlock, the second cpu will spin.

I've attached 2 changes that might reduce the contention, but it's just
an idea, completely untested.

* slightly more efficient try_atomic_semop().
* don't acquire the spinlock if q->alter was 0. It could slightly
improve performance, but I assume that q->alter will be always 1.

Btw, I found a small bug in try_atomic_semop():
If a semaphore operation with sem_op==0 blocks, then the pid is
corrupted. The bug also exists in 2.2.

--
	Manfred
--------------6A6AF872F4E618CA2D9E3555
Content-Type: text/plain; charset=us-ascii;
 name="patch-sem"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="patch-sem"

--- sem.c.old Sun Feb 25 10:50:55 2001 +++ sem.c Sun Feb 25 10:51:19 2001 @@ -250,23 +250,23 @@ curr = sma->sem_base + sop->sem_num; sem_op = sop->sem_op; - if (!sem_op && curr->semval) + result = curr->semval; + if (!sem_op && result) goto would_block; + result += sem_op; + if (result < 0) + goto would_block; + if (result > SEMVMX) + goto out_of_range; curr->sempid = (curr->sempid << 16) | pid; - curr->semval += sem_op; + curr->semval = result; if (sop->sem_flg & SEM_UNDO) un->semadj[sop->sem_num] -= sem_op; - - if (curr->semval < 0) - goto would_block; - if (curr->semval > SEMVMX) - goto out_of_range; } if (do_undo) { - sop--; result = 0; goto undo; } @@ -285,6 +285,7 @@ result = 1; undo: + sop--; while (sop >= sops) { curr = sma->sem_base + sop->sem_num; curr->semval -= sop->sem_op; @@ -305,7 +306,9 @@ { int error; struct sem_queue * q; + int do_retry = 0; +retry: for (q = sma->sem_pending; q; q = q->next) { if (q->status == 1) @@ -323,10 +326,17 @@ q->status = 1; return; } - q->status = error; remove_from_queue(sma,q); + wmb(); + q->status = error; + /* FIXME: retry only required if an increase was + * executed + */ + do_retry = 1; } } + if (do_retry) + goto retry; } /* The following counts are associated to each semaphore: @@ -919,7 +929,13 @@ sem_unlock(semid); schedule(); - + if (queue.status == 0) { + error = 0; + if (queue.prev) + BUG(); + current->semsleeping = NULL; + goto out_free; + } tmp = sem_lock(semid); if(tmp==NULL) { if(queue.prev != NULL)

--------------6A6AF872F4E618CA2D9E3555--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/