<!-- received="Fri Jun 26 14:07:47 1998 EET DST" -->
<!-- sent="Fri, 26 Jun 1998 02:09:08 -0700 (PDT)" -->
<!-- name="Linus Torvalds" -->
<!-- email="torvalds@transmeta.com" -->
<!-- subject="Re: Thread implementations..." -->
<!-- id="" -->
<!-- inreplyto="Pine.LNX.3.96dg4.980626014229.17648H-100000@twinlark.arctic.org" -->
<title>Linux-kernel mailing list archive 1998-25,: Re: Thread implementations...</title>
<body bgcolor="#FFFFFF"><font face="Arial,Helvetica">
<h1>Re: Thread implementations...</h1>
<b>Linus Torvalds</b> (<a href="mailto:torvalds@transmeta.com"><i>torvalds@transmeta.com</i></a>)<br>
<i>Fri, 26 Jun 1998 02:09:08 -0700 (PDT)</i>
<p>
<ul>
<li> <b>Messages sorted by:</b> <a href="date.html#1653">[ date ]</a><a href="index.html#1653">[ thread ]</a><a href="subject.html#1653">[ subject ]</a><a href="author.html#1653">[ author ]</a>
<!-- next="start" -->
<li> <b>Next message:</b> <a href="1654.html">Pawel S. Veselov: "Re: Secure-linux and standard kernel"</a>
<li> <b>Previous message:</b> <a href="1652.html">David Luyer: "Re: uniform input device packets?"</a>
<!-- nextthread="start" -->
<!-- reply="end" -->
</ul>
<hr>
<!-- body="start" -->
On Fri, 26 Jun 1998, Dean Gaudet wrote:<br>
<i>&gt; </i><br>
<i>&gt; This thread is, uh, fun ;)  I started out liking sendfile, and now I'm</i><br>
<i>&gt; thinking it may not be worth it!</i><br>
<p>
Well, I decided I might as well implement it. Appended is a stupid example<br>
program (and I mean _really_ stupid), and the diff against 2.1.107. It was<br>
pretty much exactly as I expected it to be: a small amount of judicious<br>
re-organization made the whole system call be less than 100 lines of code,<br>
and most of that is just checking the inputs to the system call rather<br>
than actual work.<br>
<p>
CAUTION! The system call may well be completely broken. It seems to work<br>
for me with the silly test-program, and I tried to make it do all the<br>
right checks, but it's 2AM, and I did this in little more than an hour.<br>
Caveat emptor. But I'd like to hear what people think about it..<br>
<p>
<i>&gt; Another problem with sendfile():  the file FD's seek pointer is changed</i><br>
<i>&gt; while it's used, which means multiple clients can't be serviced from the</i><br>
<i>&gt; same FD.  This kind of defeats the purpose of caching the open FD in a</i><br>
<i>&gt; threaded server...</i><br>
<p>
My current implementation does this, but that is purely co-incidental. I<br>
could equally well have done a "pread()+pwrite()" kind of thing, it just<br>
requires more arguments to the system call. It's technically trivial to<br>
implement (the kernel internally does everything with the pread/pwrite<br>
interface anyway). <br>
<p>
But as far as I'm concerned, the kernel is better at looking up filenames<br>
than apache will ever be unless you guys start really doing well. Under<br>
those kinds of circumstances it's fairly pointless to do fd caching.<br>
That's especially true in SMP environments and threading, where I just<br>
don't think that the apache guys seem to be able to scale. <br>
<p>
Note! I don't think apache is going to be the best thing to use with this,<br>
if only because apache tries to be too clever. This is really meant for<br>
something that<br>
<p>
 - uses threads<br>
 - does no caching _at_all_, because it knows the kernel can cache<br>
   everything better than most user mode programs<br>
 - just has a simple main loop that looks something like<br>
<p>
	for (;;) {<br>
		if (accept() &gt;= 0)<br>
			clone(fd, connection)<br>
	}<br>
<p>
	connection(socket)<br>
	{<br>
		fd = open(filename);<br>
		fstat(fd, &amp;st)<br>
		sendfile(socket, fd, st.st_size);<br>
	}<br>
<p>
   (Yes, the above is _very_ simplistic, please don't tell me that<br>
   web-serving is slightly more complex than this ;)<br>
<p>
Anyway, the "sendfile()" system call "man-page" is:<br>
<p>
	sendfile(int outfd, int infd, size_t size);<br>
<p>
is pretty much equivalent to<br>
<p>
	write(outfd, tempbuf, read(infd, tempbuf, size));<br>
<p>
(with all the error handling details etc), except that "infd" has to be a<br>
real file on a filesystem that supports the page cache.<br>
<p>
		Linus<br>
-----<br>
/*<br>
 * Very stupid example of using the sendfile()<br>
 * system call.<br>
 */<br>
<p>
#include &lt;stdio.h&gt;<br>
#include &lt;sys/types.h&gt;<br>
#include &lt;unistd.h&gt;<br>
#include &lt;errno.h&gt;<br>
#include &lt;sys/fcntl.h&gt;<br>
<p>
ssize_t sendfile(int out, int in, size_t size)<br>
{<br>
	ssize_t retval;<br>
<p>
	asm volatile(<br>
		"pushl %%ebx\n\t"<br>
		"movl %%esi,%%ebx\n\t"<br>
		"int $0x80\n\t"<br>
		"popl %%ebx"<br>
		:"=a" (retval)<br>
		:"0" (187),<br>
		 "S" (out),	/* pseudo-ebx */<br>
		 "c" (in),<br>
		 "d" (size));<br>
	if ((unsigned long) retval &gt; (unsigned long)-1000) {<br>
		errno = -retval;<br>
		retval = -1;<br>
	}<br>
	return retval;<br>
}<br>
<p>
int main(int argc, char **argv)<br>
{<br>
	int in, out, error;<br>
<p>
	in = open(argv[1], O_RDONLY);<br>
	if (in &lt; 0) {<br>
		perror("open input");<br>
		exit(1);<br>
	}<br>
	out = open(argv[2], O_WRONLY | O_CREAT, 0666);<br>
	if (out &lt; 0) {<br>
		perror("open output");<br>
		exit(1);<br>
	}<br>
	error = sendfile(out, in, 1024);<br>
	printf("sendfile returned %d\n", error);<br>
	if (error &lt; 0) {<br>
		perror("sendfile");<br>
	}<br>
	return 0;<br>
}<br>
-----<br>
diff -u --recursive v2.1.107/linux/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S<br>
--- v2.1.107/linux/arch/i386/kernel/entry.S	Tue Jun  9 00:55:09 1998<br>
+++ linux/arch/i386/kernel/entry.S	Fri Jun 26 01:04:21 1998<br>
@@ -547,7 +547,8 @@<br>
 	.long SYMBOL_NAME(sys_capget)<br>
 	.long SYMBOL_NAME(sys_capset)           /* 185 */<br>
 	.long SYMBOL_NAME(sys_sigaltstack)<br>
+	.long SYMBOL_NAME(sys_sendfile)<br>
 	<br>
-	.rept NR_syscalls-186<br>
+	.rept NR_syscalls-187<br>
 		.long SYMBOL_NAME(sys_ni_syscall)<br>
 	.endr<br>
diff -u --recursive v2.1.107/linux/include/asm-i386/unistd.h linux/include/asm-i386/unistd.h<br>
--- v2.1.107/linux/include/asm-i386/unistd.h	Tue Jun  9 00:55:10 1998<br>
+++ linux/include/asm-i386/unistd.h	Fri Jun 26 01:14:18 1998<br>
@@ -192,6 +192,7 @@<br>
 #define __NR_capget		184<br>
 #define __NR_capset		185<br>
 #define __NR_sigaltstack	186<br>
+#define __NR_sendfile		187<br>
 <br>
 /* user-visible error numbers are in the range -1 - -122: see &lt;asm-i386/errno.h&gt; */<br>
 <br>
diff -u --recursive v2.1.107/linux/mm/filemap.c linux/mm/filemap.c<br>
--- v2.1.107/linux/mm/filemap.c	Fri Jun 26 01:31:42 1998<br>
+++ linux/mm/filemap.c	Fri Jun 26 01:09:30 1998<br>
@@ -567,6 +567,23 @@<br>
 	return page_cache;<br>
 }<br>
 <br>
+/*<br>
+ * "descriptor" for what we're up to with a read.<br>
+ * This allows us to use the same read code yet<br>
+ * have multiple different users of the data that<br>
+ * we read from a file.<br>
+ *<br>
+ * The simplest case just copies the data to user<br>
+ * mode.<br>
+ */<br>
+typedef struct {<br>
+	size_t written;<br>
+	size_t count;<br>
+	char * buf;<br>
+	int error;<br>
+} read_descriptor_t;<br>
+<br>
+typedef int (*read_actor_t)(read_descriptor_t *, const char *, unsigned long);<br>
 <br>
 /*<br>
  * This is a generic file read routine, and uses the<br>
@@ -576,23 +593,14 @@<br>
  * This is really ugly. But the goto's actually try to clarify some<br>
  * of the logic when it comes to error handling etc.<br>
  */<br>
-<br>
-ssize_t generic_file_read(struct file * filp, char * buf,<br>
-			  size_t count, loff_t *ppos)<br>
+static void do_generic_file_read(struct file * filp, loff_t *ppos, read_descriptor_t * desc, read_actor_t actor)<br>
 {<br>
 	struct dentry *dentry = filp-&gt;f_dentry;<br>
 	struct inode *inode = dentry-&gt;d_inode;<br>
-	ssize_t error, read;<br>
 	size_t pos, pgpos, page_cache;<br>
 	int reada_ok;<br>
 	int max_readahead = get_max_readahead(inode);<br>
 <br>
-	if (!access_ok(VERIFY_WRITE, buf, count))<br>
-		return -EFAULT;<br>
-	if (!count)<br>
-		return 0;<br>
-	error = 0;<br>
-	read = 0;<br>
 	page_cache = 0;<br>
 <br>
 	pos = *ppos;<br>
@@ -620,12 +628,12 @@<br>
  * Then, at least MIN_READAHEAD if read ahead is ok,<br>
  * and at most MAX_READAHEAD in all cases.<br>
  */<br>
-	if (pos + count &lt;= (PAGE_SIZE &gt;&gt; 1)) {<br>
+	if (pos + desc-&gt;count &lt;= (PAGE_SIZE &gt;&gt; 1)) {<br>
 		filp-&gt;f_ramax = 0;<br>
 	} else {<br>
 		unsigned long needed;<br>
 <br>
-		needed = ((pos + count) &amp; PAGE_MASK) - pgpos;<br>
+		needed = ((pos + desc-&gt;count) &amp; PAGE_MASK) - pgpos;<br>
 <br>
 		if (filp-&gt;f_ramax &lt; needed)<br>
 			filp-&gt;f_ramax = needed;<br>
@@ -678,20 +686,20 @@<br>
 <br>
 		offset = pos &amp; ~PAGE_MASK;<br>
 		nr = PAGE_SIZE - offset;<br>
-		if (nr &gt; count)<br>
-			nr = count;<br>
 		if (nr &gt; inode-&gt;i_size - pos)<br>
 			nr = inode-&gt;i_size - pos;<br>
-		nr -= copy_to_user(buf, (void *) (page_address(page) + offset), nr);<br>
-		release_page(page);<br>
-		error = -EFAULT;<br>
-		if (!nr)<br>
-			break;<br>
-		buf += nr;<br>
+<br>
+		/*<br>
+		 * The actor routine returns how many bytes were actually used..<br>
+		 * NOTE! This may not be the same as how much of a user buffer<br>
+		 * we filled up (we may be padding etc), so we can only update<br>
+		 * "pos" here (the actor routine has to update the user buffer<br>
+		 * pointers and the remaining count).<br>
+		 */<br>
+		nr = actor(desc, (const char *) (page_address(page) + offset), nr);<br>
 		pos += nr;<br>
-		read += nr;<br>
-		count -= nr;<br>
-		if (count)<br>
+		release_page(page);<br>
+		if (nr &amp;&amp; desc-&gt;count)<br>
 			continue;<br>
 		break;<br>
 	}<br>
@@ -709,7 +717,7 @@<br>
 			 */<br>
 			if (page_cache)<br>
 				continue;<br>
-			error = -ENOMEM;<br>
+			desc-&gt;error = -ENOMEM;<br>
 			break;<br>
 		}<br>
 <br>
@@ -738,11 +746,14 @@<br>
 		if (reada_ok &amp;&amp; filp-&gt;f_ramax &gt; MIN_READAHEAD)<br>
 			filp-&gt;f_ramax = MIN_READAHEAD;<br>
 <br>
-		error = inode-&gt;i_op-&gt;readpage(filp, page);<br>
-		if (!error)<br>
-			goto found_page;<br>
-		release_page(page);<br>
-		break;<br>
+		{<br>
+			int error = inode-&gt;i_op-&gt;readpage(filp, page);<br>
+			if (!error)<br>
+				goto found_page;<br>
+			desc-&gt;error = error;<br>
+			release_page(page);<br>
+			break;<br>
+		}<br>
 <br>
 page_read_error:<br>
 		/*<br>
@@ -750,15 +761,18 @@<br>
 		 * Try to re-read it _once_. We do this synchronously,<br>
 		 * because this happens only if there were errors.<br>
 		 */<br>
-		error = inode-&gt;i_op-&gt;readpage(filp, page);<br>
-		if (!error) {<br>
-			wait_on_page(page);<br>
-			if (PageUptodate(page) &amp;&amp; !PageError(page))<br>
-				goto success;<br>
-			error = -EIO; /* Some unspecified error occurred.. */<br>
+		{<br>
+			int error = inode-&gt;i_op-&gt;readpage(filp, page);<br>
+			if (!error) {<br>
+				wait_on_page(page);<br>
+				if (PageUptodate(page) &amp;&amp; !PageError(page))<br>
+					goto success;<br>
+				error = -EIO; /* Some unspecified error occurred.. */<br>
+			}<br>
+			desc-&gt;error = error;<br>
+			release_page(page);<br>
+			break;<br>
 		}<br>
-		release_page(page);<br>
-		break;<br>
 	}<br>
 <br>
 	*ppos = pos;<br>
@@ -766,9 +780,143 @@<br>
 	if (page_cache)<br>
 		free_page(page_cache);<br>
 	UPDATE_ATIME(inode);<br>
-	if (!read)<br>
-		read = error;<br>
-	return read;<br>
+}<br>
+<br>
+static int file_read_actor(read_descriptor_t * desc, const char *area, unsigned long size)<br>
+{<br>
+	unsigned long left;<br>
+	unsigned long count = desc-&gt;count;<br>
+<br>
+	if (size &gt; count)<br>
+		size = count;<br>
+	left = __copy_to_user(desc-&gt;buf, area, size);<br>
+	if (left) {<br>
+		size -= left;<br>
+		desc-&gt;error = -EFAULT;<br>
+	}<br>
+	desc-&gt;count = count - size;<br>
+	desc-&gt;written += size;<br>
+	desc-&gt;buf += size;<br>
+	return size;<br>
+}<br>
+<br>
+/*<br>
+ * This is the "read()" routine for all filesystems<br>
+ * that can use the page cache directly.<br>
+ */<br>
+ssize_t generic_file_read(struct file * filp, char * buf, size_t count, loff_t *ppos)<br>
+{<br>
+	ssize_t retval;<br>
+<br>
+	retval = -EFAULT;<br>
+	if (access_ok(VERIFY_WRITE, buf, count)) {<br>
+		retval = 0;<br>
+		if (count) {<br>
+			read_descriptor_t desc;<br>
+<br>
+			desc.written = 0;<br>
+			desc.count = count;<br>
+			desc.buf = buf;<br>
+			desc.error = 0;<br>
+			do_generic_file_read(filp, ppos, &amp;desc, file_read_actor);<br>
+<br>
+			retval = desc.written;<br>
+			if (!retval)<br>
+				retval = desc.error;<br>
+		}<br>
+	}<br>
+	return retval;<br>
+}<br>
+<br>
+static int file_send_actor(read_descriptor_t * desc, const char *area, unsigned long size)<br>
+{<br>
+	ssize_t written;<br>
+	unsigned long count = desc-&gt;count;<br>
+	struct file *file = (struct file *) desc-&gt;buf;<br>
+	struct inode *inode = file-&gt;f_dentry-&gt;d_inode;<br>
+<br>
+	if (size &gt; count)<br>
+		size = count;<br>
+	down(&amp;inode-&gt;i_sem);<br>
+	set_fs(KERNEL_DS);<br>
+	written = file-&gt;f_op-&gt;write(file, area, size, &amp;file-&gt;f_pos);<br>
+	set_fs(USER_DS);<br>
+	up(&amp;inode-&gt;i_sem);<br>
+	if (written &lt; 0) {<br>
+		desc-&gt;error = written;<br>
+		written = 0;<br>
+	}<br>
+	desc-&gt;count = count - written;<br>
+	desc-&gt;written += written;<br>
+	return written;<br>
+}<br>
+<br>
+asmlinkage ssize_t sys_sendfile(int out_fd, int in_fd, size_t count)<br>
+{<br>
+	ssize_t retval;<br>
+	struct file * in_file, * out_file;<br>
+	struct inode * in_inode, * out_inode;<br>
+<br>
+	/*<br>
+	 * Get input file, and verify that it is ok..<br>
+	 */<br>
+	retval = -EBADF;<br>
+	in_file = fget(in_fd);<br>
+	if (!in_file)<br>
+		goto out;<br>
+	if (!(in_file-&gt;f_mode &amp; FMODE_READ))<br>
+		goto fput_in;<br>
+	retval = -EINVAL;<br>
+	in_inode = in_file-&gt;f_dentry-&gt;d_inode;<br>
+	if (!in_inode)<br>
+		goto fput_in;<br>
+	if (!in_inode-&gt;i_op || !in_inode-&gt;i_op-&gt;readpage)<br>
+		goto fput_in;<br>
+	retval = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, in_file-&gt;f_pos, count);<br>
+	if (retval)<br>
+		goto fput_in;<br>
+<br>
+	/*<br>
+	 * Get output file, and verify that it is ok..<br>
+	 */<br>
+	retval = -EBADF;<br>
+	out_file = fget(out_fd);<br>
+	if (!out_file)<br>
+		goto fput_in;<br>
+	if (!(out_file-&gt;f_mode &amp; FMODE_WRITE))<br>
+		goto fput_out;<br>
+	retval = -EINVAL;<br>
+	if (!out_file-&gt;f_op || !out_file-&gt;f_op-&gt;write)<br>
+		goto fput_out;<br>
+	out_inode = out_file-&gt;f_dentry-&gt;d_inode;<br>
+	if (!out_inode)<br>
+		goto fput_out;<br>
+	retval = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file-&gt;f_pos, count);<br>
+	if (retval)<br>
+		goto fput_out;<br>
+<br>
+	retval = 0;<br>
+	if (count) {<br>
+		read_descriptor_t desc;<br>
+<br>
+		desc.written = 0;<br>
+		desc.count = count;<br>
+		desc.buf = (char *) out_file;<br>
+		desc.error = 0;<br>
+		do_generic_file_read(in_file, &amp;in_file-&gt;f_pos, &amp;desc, file_send_actor);<br>
+<br>
+		retval = desc.written;<br>
+		if (!retval)<br>
+			retval = desc.error;<br>
+	}<br>
+<br>
+<br>
+fput_out:<br>
+	fput(out_file);<br>
+fput_in:<br>
+	fput(in_file);<br>
+out:<br>
+	return retval;<br>
 }<br>
 <br>
 /*<br>
<p>
<p>
-<br>
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in<br>
the body of a message to majordomo@vger.rutgers.edu<br>
<!-- body="end" -->
<hr>
<p>
<ul>
<!-- next="start" -->
<li> <b>Next message:</b> <a href="1654.html">Pawel S. Veselov: "Re: Secure-linux and standard kernel"</a>
<li> <b>Previous message:</b> <a href="1652.html">David Luyer: "Re: uniform input device packets?"</a>
<!-- nextthread="start" -->
<!-- reply="end" -->
</ul>
</font></body>
