Linus,
I wish you could have made it to the OLS RAS BOF and seen this for
yourself - the vendor support, the need and the drive towards a
unified and flexible dumping framework.
The problem with dump has not been lack of vendor interest. There
wouldn't have been multiple dump type implementations floating around
if there wasn't a need -- LKCD, Mission Critical dump, Ingo's
network dump, kmsgdump, Rusty's oops dumper to cite some. The difficulty
has been technical and hence the diversity of approaches that different
projects came up with to tackle the problem (arising from slightly
different priorities and environments in each case). The second has
been related to preferences in the kind of user level analysis tools.
And the LKCD project has been evolving to address these very
problems to bring the best of these worlds together and also allow
flexibility on the choice of analysis tools !
Mission critical Linux project code base for example is now being
maintained as part of the LKCD project. Either lcrash or mission
critical linux crash can be used for analysing LKCD dumps.
And on the kernel side of things:
(a) The dump driver interface in LKCD has been specifically
designed to enable different kinds of dumping mechanisms and
targets to be supported -- generic block, network dump ,
polled-IDE (Rusty style) etc, even alternate dump targets failover
and multiple dump devices in the future if required. We are also
experimenting with a memory dump driver to save dump to memory
and dump after a memory preserving soft-boot, reusing the mission
critical mcore technique.
(b) Selective dumping, for different levels of dump data - one
option that was added recently would dump all kernel pages
and is likely to be commonly used (gzip compressed dump). Its
pretty easy to extend to more selectivity or different levels
and the dump also occurs in passes from more critical data to
less critical.
(The page in use flag was added to help with this)
(c) The core pieces which touch the kernel as such just add basic
infrastructure that is needed in the kernel for any dumping
facility. Includes:
- Enabling IPI to collect CPU state on all processors in the
system right when dump is triggered (may not be a normal
situation, so NMIs where supported are the best option)
- Ability to quiesce (silence) the system before dumping
(and if in non-disruptive mode, then restore it back)
- Calls into dump from kernel paths (panic, oops, sysrq
etc).
- Exports of symbols to help with physical memory
traversal and verification
As Matt has said there is an active development community behind
LKCD and lot of the drive for that has come from companies who use it
and are really hoping hard that it becomes part of the mainline.
BTW, the code has also been scrutinised and reviewed over
lkml as well and undergone iterations of releases following
that. Anything else there that you think needs to be fixed please
do let us know.
Regards
Suparna
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/