
来源:百度文库 编辑:中财网 时间:2024/05/05 11:32:51

Chapter 6. Using /proc For Input

6.1. 使用 /proc 作为输入

现在我们有两种从内核模块获得输出的方法:我们可以注册一个设备驱动并用 mknod生成一个设备文件,或者我们可以建立一个 /proc文件。这样内核就可以告诉我们重要的信息。 剩下的唯一问题是我们没法反馈信息。第一种方法是向/proc文件系统写入信息。

由于 /proc 文件系统是为内核输出其运行信息而设计的,它并未向内核输入信息提供了任何准备。 结构体struct proc_dir_entry并没有指向输入函数的指针,而是指向了一个输出函数。 作为替代办法,向/proc 写入信息,我们可以使用标准的文件系统提供的机制。

在Linux中有一种标准的注册文件系统的方法。既然每种文件系统都必须有处理文件 索引节点inode和文件本身的函数[8], 那么就一定有种结构体去存放这些函数的指针。这就是结构体struct inode_operations, 它其中又包含一个指向结构体struct file_operations的指针。在 /proc 文件系统中, 当我们需要注册一个新文件时,我们被允许选择哪一个struct inode_operations 结构体。这就是我们将使用的机制,用包含结构体 struct inode_operations指针的结构体struct file_operations 来指向我们的module_inputmodule_output函数。

需要注意的是“读”和“写”的含义在内核中是反过来的。“读”意味着输出,而“写”意味着输入。 这是从用户的角度来看待问题的。如果一个进程只能从内核的“输出”获得输入, 而内核也是从进程的输出中得到“输入”的。

在这儿另一件有趣的事就是module_permission函数了。该函数在每个进程想要对 /proc文件系统内的文件操作时被调用,它来决定是否操作被允许。 目前它只是对操作和操作所属用户的UID进行判断,但它可以也把其它的东西包括进来, 像还有哪些别的进程在对该文件进行操作,当前的时间,或是我们最后接收到的输入。

加入宏put_userget_user的原因是 Linux的内存是使用分页机制的(在Intel架构下是如此,但其它架构下有可能不同)。 这就意味着指针自身并不是指向一个确实的物理内存地址,而知是分页中的一个地址, 而且你必须知道哪些分页将来是可用的。其中内核本身占用一个分页,其它的每个进程都有自己的分页。

进程能看得到的分页只有属于它自己的,所以当编写用户程序时,不用考虑分页的存在。 但是当你编写内核模块时,你就会访问由系统自动管理的内核所在的分页。 当一块内存缓冲区中的内容要在当前运行中的进程和内核之间传递时, 内核的函数就接收指向在进程分页中的该内存缓冲区的指针。宏put_userget_user允许你进行这样的访问内存的操作。

Example 6-1. procfs.c

* procfs.c - create a "file" in /proc, which allows both input and output.
#include /* We're doing kernel work */
#include /* Specifically, a module */
#include /* Necessary because we use proc fs */
#include /* for get_user and put_user */

* Here we keep the last message received, to prove
* that we can process our input
static char Message[MESSAGE_LENGTH];
static struct proc_dir_entry *Our_Proc_File;

#define PROC_ENTRY_FILENAME "rw_test"

static ssize_t module_output(struct file *filp,/* see include/linux/fs.h */
char *buffer,/* buffer to fill with data */
size_t length,/* length of the buffer */
loff_t * offset)
static int finished = 0;
int i;
char message[MESSAGE_LENGTH + 30];

* We return 0 to indicate end of file, that we have
* no more information. Otherwise, processes will
* continue to read from us in an endless loop.
if (finished) {
finished = 0;
return 0;

* We use put_user to copy the string from the kernel's
* memory segment to the memory segment of the process
* that called us. get_user, BTW, is
* used for the reverse.
sprintf(message, "Last input:%s", Message);
for (i = 0; i < length && message[i]; i++)
put_user(message[i], buffer + i);

* Notice, we assume here that the size of the message
* is below len, or it will be received cut. In a real
* life situation, if the size of the message is less
* than len then we'd return len and on the second call
* start filling the buffer with the len+1'th byte of
* the message.
finished = 1;

return i;/* Return the number of bytes "read" */

static ssize_t
module_input(struct file *filp, const char *buff, size_t len, loff_t * off)
int i;
* Put the input into Message, where module_output
* will later be able to use it
for (i = 0; i < MESSAGE_LENGTH - 1 && i < len; i++)
get_user(Message[i], buff + i);

Message[i] = '\0';/* we want a standard, zero terminated string */
return i;

* This function decides whether to allow an operation
* (return zero) or not allow it (return a non-zero
* which indicates why it is not allowed).
* The operation can be one of the following values:
* 0 - Execute (run the "file" - meaningless in our case)
* 2 - Write (input to the kernel module)
* 4 - Read (output from the kernel module)
* This is the real function that checks file
* permissions. The permissions returned by ls -l are
* for referece only, and can be overridden here.

static int module_permission(struct inode *inode, int op, struct nameidata *foo)
* We allow everybody to read from our module, but
* only root (uid 0) may write to it
if (op == 4 || (op == 2 && current->euid == 0))
return 0;

* If it's anything else, access is denied
return -EACCES;

* The file is opened - we don't really care about
* that, but it does mean we need to increment the
* module's reference count.
int module_open(struct inode *inode, struct file *file)
return 0;

* The file is closed - again, interesting only because
* of the reference count.
int module_close(struct inode *inode, struct file *file)
return 0;/* success */

static struct file_operations File_Ops_4_Our_Proc_File = {
.read = module_output,
.write = module_input,
.open = module_open,
.release = module_close,

* Inode operations for our proc file. We need it so
* we'll have some place to specify the file operations
* structure we want to use, and the function we use for
* permissions. It's also possible to specify functions
* to be called for anything else which could be done to
* an inode (although we don't bother, we just put
* NULL).

static struct inode_operations Inode_Ops_4_Our_Proc_File = {
.permission = module_permission,/* check for permissions */

* Module initialization and cleanup
int init_module()
int rv = 0;
Our_Proc_File = create_proc_entry(PROC_ENTRY_FILENAME, 0644, NULL);
Our_Proc_File->owner = THIS_MODULE;
Our_Proc_File->proc_iops = &Inode_Ops_4_Our_Proc_File;
Our_Proc_File->proc_fops = &File_Ops_4_Our_Proc_File;
Our_Proc_File->mode = S_IFREG | S_IRUGO | S_IWUSR;
Our_Proc_File->uid = 0;
Our_Proc_File->gid = 0;
Our_Proc_File->size = 80;

if (Our_Proc_File == NULL) {
rv = -ENOMEM;
remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
printk(KERN_INFO "Error: Could not initialize /proc/test\n");

return rv;

void cleanup_module()
remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);

还需要更多的关于procfs的例子?我要提醒你的是:第一,有消息说也许不久procfs将被sysfs取代;第二, 如果你真的很想多了解些procfs,你可以参考路径 linux/Documentation/DocBook/ 下的 那些技术性的文档。在内核代码树根目录下使用 make help 来获得如何将这些文档转化为你偏好的格式,例如: make htmldocs 。如果你要为内核加入一些你的文档,你也应该考虑这样做。

Chapter 7. Talking To Device Files

7.1. 与设备文件对话 (writes and IOCTLs)

设备文件是用来代表相对应的硬件设备。绝大多数的硬件设备是用来进行输出和输入操作的, 所以在内核中肯定有内核从进程中获得发送到设备的输出的机制。这是通过打开一个设备文件然后 向其中进行写操作来实现的,如同对普通文件的写操作。在下面的的例子中,这是通过 device_write实现的。

但这并不总是够用。设想你有一个通过串口连接的调制解调器(即使你使用的是内置调制解调器, 对于CPU来说同样也是通过连接在串口上来实现工作的)。通常我们通过打开一个设备文件向调制解调器 发送信息(将要通过通信线路传输的指令或数据)或读取信息(从通信线路中返回的响应指令或数据)。 但是,我们如何设置同串口对话的速率,也就是向串口传输数据的速率这个问题仍然没有解决。

解决之道是在Unix系统中的函数ioctl(Input Output ConTroL的简写)。 每个设备可以有自己的ioctl命令,通过读取ioctl's 可以从进程中向内核发送信息,或写ioctl's向进程返回信息 [9],或者两者都是,或都不是。函数ioctl 调用时需要三个参数:合适的设备文件的文件描述符,ioctl号,和一个可以被一个任务使用来 传递任何东西的long类型的参数[10]

ioctl号是反映主设备号,ioctl的种类,对应的命令和参数类型的数字。它通常是通过在头文件中宏调用 (_IO, _IOR, _IOW_IOWR,取决于其种类)来建立的。该头文件应该被使用 ioctl的用户程序包含(这样它们就可以生成正确的ioctl's) 和内核驱动模块包含(这样模块才能理解它)。在下面的例子中,头文件为chardev.h,源程序为ioctl.c

即使你只想在自己的模块中使用ioctls,你最好还是接收正式的 ioctl标准,这样当你意外的使用别人的ioctls, 或别人使用你的时,你会知道有错误发生。详情参见内核代码目录树下的文件 Documentation/ioctl-number.txt.

Example 7-1. chardev.c

* chardev.c - Create an input/output character device

#include /* We're doing kernel work */
#include /* Specifically, a module */
#include /* for get_user and put_user */

#include "chardev.h"
#define SUCCESS 0
#define DEVICE_NAME "char_dev"
#define BUF_LEN 80

* Is the device open right now? Used to prevent
* concurent access into the same device
static int Device_Open = 0;

* The message the device will give when asked
static char Message[BUF_LEN];

* How far did the process reading the message get?
* Useful if the message is larger than the size of the
* buffer we get to fill in device_read.
static char *Message_Ptr;

* This is called whenever a process attempts to open the device file
static int device_open(struct inode *inode, struct file *file)
#ifdef DEBUG
printk("device_open(%p)\n", file);

* We don't want to talk to two processes at the same time
if (Device_Open)
return -EBUSY;

* Initialize the message
Message_Ptr = Message;
return SUCCESS;

static int device_release(struct inode *inode, struct file *file)
#ifdef DEBUG
printk("device_release(%p,%p)\n", inode, file);

* We're now ready for our next caller

return SUCCESS;

* This function is called whenever a process which has already opened the
* device file attempts to read from it.
static ssize_t device_read(struct file *file,/* see include/linux/fs.h */
char __user * buffer,/* buffer to be
* filled with data */
size_t length,/* length of the buffer */
loff_t * offset)
* Number of bytes actually written to the buffer
int bytes_read = 0;

#ifdef DEBUG
printk("device_read(%p,%p,%d)\n", file, buffer, length);

* If we're at the end of the message, return 0
* (which signifies end of file)
if (*Message_Ptr == 0)
return 0;

* Actually put the data into the buffer
while (length && *Message_Ptr) {

* Because the buffer is in the user data segment,
* not the kernel data segment, assignment wouldn't
* work. Instead, we have to use put_user which
* copies data from the kernel data segment to the
* user data segment.
put_user(*(Message_Ptr++), buffer++);

#ifdef DEBUG
printk("Read %d bytes, %d left\n", bytes_read, length);

* Read functions are supposed to return the number
* of bytes actually inserted into the buffer
return bytes_read;

* This function is called when somebody tries to
* write into our device file.
static ssize_t
device_write(struct file *file,
const char __user * buffer, size_t length, loff_t * offset)
int i;

#ifdef DEBUG
printk("device_write(%p,%s,%d)", file, buffer, length);

for (i = 0; i < length && i < BUF_LEN; i++)
get_user(Message[i], buffer + i);

Message_Ptr = Message;

* Again, return the number of input characters used
return i;

* This function is called whenever a process tries to do an ioctl on our
* device file. We get two extra parameters (additional to the inode and file
* structures, which all device functions get): the number of the ioctl called
* and the parameter given to the ioctl function.
* If the ioctl is write or read/write (meaning output is returned to the
* calling process), the ioctl call returns the output of this function.
int device_ioctl(struct inode *inode,/* see include/linux/fs.h */
struct file *file,/* ditto */
unsigned int ioctl_num,/* number and param for ioctl */
unsigned long ioctl_param)
int i;
char *temp;
char ch;

* Switch according to the ioctl called
switch (ioctl_num) {
* Receive a pointer to a message (in user space) and set that
* to be the device's message. Get the parameter given to
* ioctl by the process.
temp = (char *)ioctl_param;

* Find the length of the message
get_user(ch, temp);
for (i = 0; ch && i < BUF_LEN; i++, temp++)
get_user(ch, temp);

device_write(file, (char *)ioctl_param, i, 0);

* Give the current message to the calling process -
* the parameter we got is a pointer, fill it.
i = device_read(file, (char *)ioctl_param, 99, 0);

* Put a zero at the end of the buffer, so it will be
* properly terminated
put_user('\0', (char *)ioctl_param + i);

* This ioctl is both input (ioctl_param) and
* output (the return value of this function)
return Message[ioctl_param];

return SUCCESS;

/* Module Declarations */

* This structure will hold the functions to be called
* when a process does something to the device we
* created. Since a pointer to this structure is kept in
* the devices table, it can't be local to
* init_module. NULL is for unimplemented functions.
struct file_operations Fops = {
.read = device_read,
.write = device_write,
.ioctl = device_ioctl,
.open = device_open,
.release = device_release,/* a.k.a. close */

* Initialize the module - Register the character device
int init_module()
int ret_val;
* Register the character device (atleast try)
ret_val = register_chrdev(MAJOR_NUM, DEVICE_NAME, &Fops);

* Negative values signify an error
if (ret_val < 0) {
printk("%s failed with %d\n",
"Sorry, registering the character device ", ret_val);
return ret_val;

printk("%s The major device number is %d.\n",
"Registeration is a success", MAJOR_NUM);
printk("If you want to talk to the device driver,\n");
printk("you'll have to create a device file. \n");
printk("We suggest you use:\n");
printk("mknod %s c %d 0\n", DEVICE_FILE_NAME, MAJOR_NUM);
printk("The device file name is important, because\n");
printk("the ioctl program assumes that's the\n");
printk("file you'll use.\n");

return 0;

* Cleanup - unregister the appropriate file from /proc
void cleanup_module()
int ret;

* Unregister the device
ret = unregister_chrdev(MAJOR_NUM, DEVICE_NAME);

* If there's an error, report it
if (ret < 0)
printk("Error in module_unregister_chrdev: %d\n", ret);

Example 7-2. chardev.h

* chardev.h - the header file with the ioctl definitions.
* The declarations here have to be in a header file, because
* they need to be known both to the kernel module
* (in chardev.c) and the process calling ioctl (ioctl.c)

#ifndef CHARDEV_H
#define CHARDEV_H


* The major device number. We can't rely on dynamic
* registration any more, because ioctls need to know
* it.
#define MAJOR_NUM 100

* Set the message of the device driver
#define IOCTL_SET_MSG _IOR(MAJOR_NUM, 0, char *)
* _IOR means that we're creating an ioctl command
* number for passing information from a user process
* to the kernel module.
* The first arguments, MAJOR_NUM, is the major device
* number we're using.
* The second argument is the number of the command
* (there could be several with different meanings).
* The third argument is the type we want to get from
* the process to the kernel.

* Get the message of the device driver
#define IOCTL_GET_MSG _IOR(MAJOR_NUM, 1, char *)
* This IOCTL is used for output, to get the message
* of the device driver. However, we still need the
* buffer to place the message in to be input,
* as it is allocated by the process.

* Get the n'th byte of the message
* The IOCTL is used for both input and output. It
* receives from the user a number, n, and returns
* Message[n].

* The name of the device file
#define DEVICE_FILE_NAME "char_dev"


Example 7-3. ioctl.c

* ioctl.c - the process to use ioctl's to control the kernel module
* Until now we could have used cat for input and output. But now
* we need to do ioctl's, which require writing our own process.

* device specifics, such as ioctl numbers and the
* major device file.
#include "chardev.h"

#include /* open */
#include /* exit */
#include /* ioctl */

* Functions for the ioctl calls

ioctl_set_msg(int file_desc, char *message)
int ret_val;

ret_val = ioctl(file_desc, IOCTL_SET_MSG, message);

if (ret_val < 0) {
printf("ioctl_set_msg failed:%d\n", ret_val);

ioctl_get_msg(int file_desc)
int ret_val;
char message[100];

* Warning - this is dangerous because we don't tell
* the kernel how far it's allowed to write, so it
* might overflow the buffer. In a real production
* program, we would have used two ioctls - one to tell
* the kernel the buffer length and another to give
* it the buffer to fill
ret_val = ioctl(file_desc, IOCTL_GET_MSG, message);

if (ret_val < 0) {
printf("ioctl_get_msg failed:%d\n", ret_val);

printf("get_msg message:%s\n", message);

ioctl_get_nth_byte(int file_desc)
int i;
char c;

printf("get_nth_byte message:");

i = 0;
while (c != 0) {
c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++);

if (c < 0) {
("ioctl_get_nth_byte failed at the %d'th byte:\n",


* Main - Call the ioctl functions
int file_desc, ret_val;
char *msg = "Message passed by ioctl\n";

file_desc = open(DEVICE_FILE_NAME, 0);
if (file_desc < 0) {
printf("Can't open device file: %s\n", DEVICE_FILE_NAME);

ioctl_set_msg(file_desc, msg);


Chapter 8. System Calls

8.1. 系统调用

到目前为止,我们所做的只是使用完善的内核机制注册/proc文件和处理设备的对象。如果只是想写一个设备驱动, 这些内核程序员设定的方式已经足够了。但是,你不想做一些不寻常的事吗, 想使你的系统看起来不一样吗?当然,这取决你自己。

这里可是一个危险的地方。下面的这个例子中,我关闭了系统调用 open()。这意味着我无法打开任何文件,执行任何程序,连使用 shutdown关机都不行,关机只能靠摁电源按钮了。幸运的话,不会有文件丢失。 要保证不丢失文件的话,在insmodrmmod之前请执行sync命令。

别管什么/proc文件和什么设备文件了, 它们只是小的细节问题。所有进程同内核打交道的根本方式是系统调用。 当一个进程需要内核提供某项服务时(像打开一个文件,生成一个新进程,或要求更多的内存), 就会发生系统调用。如果你想你的系统运作方式看起来有意思点,这就是你动手的地方。 顺便说一句,如果你想知道没个程序使用了哪些系统调用,运行strace

总的来说,一个用户进程是不应该也不能够直接访问内核的。它不能访问内核的内存, 也不能调用内核的函数。这是CPU的硬件保护机制决定的(这也是为什么叫做“保护模式”的原因)。

系统调用是这条规则的例外。所发生的事是一个进程用合适的值填充寄存器, 然后调用一条跳转到已被定义过的内核中的位置的指令(当然,这些定义过的位置是对于用户进程可读的, 但是显然是不可写的)。在Intel架构中,这是通过 0x80 中断完成的。硬件明白一旦你跳转到这个位置, 你就不再是在处处受限的用户态中运行了,而是在无所不能的内核态中。

内核中的进程可以跳转过去的位置叫做系统调用。那儿将检查系统调用的序号, 这些序号将告诉内核用户进程需要什么样的服务。然后,通过查找系统调用表( sys_call_table) 找到内核函数的地址,调用该函数。当函数返回时, 再做一些系统检查,接着就返回用户进程(或是另一个进程,如果该进程的时间用完了)。 如果你想阅读一下这方面的源代码,它们就在文件 arch/$<$architecture$>$/kernel/entry.SENTRY(system_call)行的下面。

所以,如果我们想改变某个系统调用的运作方式,我们只需要用我们自己的函数去实现它 (通常只是加一点我们自己的代码,然后调用原函数)然后改变系统调用表 (sys_call_table)中的指针值使它指向我们的函数。因为这些模块将在以后卸载, 我们不想系统因此而不稳定,所以cleanup_module中恢复系统调用表是非常重要的。

这就是这样的一个模块。我们可以“监视”一个特定的用户,然后使用 printk()输出该用户打开的每个文件的消息。在结束前,我们用自己的 our_sys_open函数替换了打开文件的系统调用。该函数检查当前进程的用户序号(uid,user's id), 如果匹配我们监视的用户的序号,它调用printk()输出将要打开的文件的名字。 要不然,就用同样的参数调用原始的open()函数,真正的打开文件。

函数init_module改变了系统调用表中的恰当位置的值然后用一个变量保存下来。函数 cleanup_module则使用该变量将所有东西还原。这种处理方法其实是很危险的。想象一下, 如果我们有两个这样的模块,A和B。A用A_open替换了系统的sys_open函数,而B用B_open。现在,我们先把模块A加载, 那么原先的系统调用被A_open替代了,A_open在完成工作后自身又会调用原始的sys_open函数 。接着,我们加载B模块, 它用B_open更改了现在的已更改为A_open(显然它认为是原始的sys_open系统调用)的系统调用。

现在,如果B先卸载,一切正常。系统调用会还原到A_open,而A_open又会调用原始的sys_open。 但是,一旦A先卸载,系统就会崩溃。A的卸载会将系统调用还原到原始的sys_open,把B从链中切断。 此时再卸载B,B会将系统调用恢复到它认为的初始状态,也就是A_open,但A_open已经不在内存中了。 乍一看来,我们似乎可以通过检测系统调用是否与我们的open函数相同,如果不相同则什么都不做 (这样B就不会尝试在卸载时恢复系统调用表)。但其实这样更糟。当A先被卸载时,它将检测到系统 调用已被更改为B_open,所以A将不会在卸载时恢复系统调用表中相应的项。此时不幸的事发生了, B_open将仍然调用已经不存在的A_open,这样即使你不卸载B模块,系统也崩溃了。

但是这种替换系统调用的方法是违背正式应用中系统的稳定和可靠原则的。所以,为了防止潜在的对系统调用表 修改带来的危害,系统调用表sys_call_table不再被内核导出。这意味着如果你想顺利的运行这个例子,你必须为你的 内核树打补丁来导出sys_call_table,在example目录内你将找到相关的补丁和说明。正如同你想像的那样,这可不是 儿戏,如果你的系统非常宝贵(例如这不是你的系统,或系统很难恢复),你最好还是放弃。如果你仍然坚持,我可以 告诉你的是打补丁虽然不会有多大问题,但内核维护者他们肯定有足够的理由在2.6内核中不支持这种hack。详情请参考README。 如果你选择了N,跳过这个例子是一个安全的选择。

Example 8-1. syscall.c

* syscall.c
* System call "stealing" sample.

* Copyright (C) 2001 by Peter Jay Salzman

* The necessary header files

* Standard in kernel modules
#include /* We're doing kernel work */
#include /* Specifically, a module, */
#include /* which will have params */
#include /* The list of system calls */

* For the current (process) structure, we need
* this to know who the current user is.

* The system call table (a table of functions). We
* just define this as external, and the kernel will
* fill it up for us when we are insmod'ed
* sys_call_table is no longer exported in 2.6.x kernels.
* If you really want to try this DANGEROUS module you will
* have to apply the supplied patch against your current kernel
* and recompile it.
extern void *sys_call_table[];

* UID we want to spy on - will be filled from the
* command line
static int uid;
module_param(uid, int, 0644);

* A pointer to the original system call. The reason
* we keep this, rather than call the original function
* (sys_open), is because somebody else might have
* replaced the system call before us. Note that this
* is not 100% safe, because if another module
* replaced sys_open before us, then when we're inserted
* we'll call the function in that module - and it
* might be removed before we are.
* Another reason for this is that we can't get sys_open.
* It's a static variable, so it is not exported.
asmlinkage int (*original_call) (const char *, int, int);

* The function we'll replace sys_open (the function
* called when you call the open system call) with. To
* find the exact prototype, with the number and type
* of arguments, we find the original function first
* (it's at fs/open.c).
* In theory, this means that we're tied to the
* current version of the kernel. In practice, the
* system calls almost never change (it would wreck havoc
* and require programs to be recompiled, since the system
* calls are the interface between the kernel and the
* processes).
asmlinkage int our_sys_open(const char *filename, int flags, int mode)
int i = 0;
char ch;

* Check if this is the user we're spying on
if (uid == current->uid) {
* Report the file, if relevant
printk("Opened file by %d: ", uid);
do {
get_user(ch, filename + i);
printk("%c", ch);
} while (ch != 0);

* Call the original sys_open - otherwise, we lose
* the ability to open files
return original_call(filename, flags, mode);

* Initialize the module - replace the system call
int init_module()
* Warning - too late for it now, but maybe for
* next time...
printk("I'm dangerous. I hope you did a ");
printk("sync before you insmod'ed me.\n");
printk("My counterpart, cleanup_module(), is even");
printk("more dangerous. If\n");
printk("you value your file system, it will ");
printk("be \"sync; rmmod\" \n");
printk("when you remove this module.\n");

* Keep a pointer to the original function in
* original_call, and then replace the system call
* in the system call table with our_sys_open
original_call = sys_call_table[__NR_open];
sys_call_table[__NR_open] = our_sys_open;

* To get the address of the function for system
* call foo, go to sys_call_table[__NR_foo].

printk("Spying on UID:%d\n", uid);

return 0;

* Cleanup - unregister the appropriate file from /proc
void cleanup_module()
* Return the system call back to normal
if (sys_call_table[__NR_open] != our_sys_open) {
printk("Somebody else also played with the ");
printk("open system call\n");
printk("The system may be left in ");
printk("an unstable state.\n");

sys_call_table[__NR_open] = original_call;

Chapter 9. Blocking Processes

9.1. 阻塞进程

9.1.1. Enter Sandman

当别人让你做一件你不能马上去做的事时,你会如何反映?如果你是人类的话,而且对方也是人类的话, 你只会说:“现在不行,我忙着在。闪开!”但是如果你是一个内核模块而且你被一个进程以同样的问题困扰, 你会有另外一个选择。你可以让该进程休眠直到你可以为它服务时。毕竟,这样的情况在内核中时时刻刻都在发生 (这就是系统让多进程在单CPU上同时运行的方法)。

这个内核模块就是一个这样的例子。文件(/proc/sleep))只可以在同一时刻被一个进程打开。 如果该文件已经被打开,内核模块将调用函数 wait_event_interruptible[11]。该函数修改task的状态(task是一个内核中的结构体数据结构, 其中保存着对应进程的信息和该进程正在调用的系统调用,如果有的话)为 TASK_INTERRUPTIBLE意味着改进程将不会继续运行直到被唤醒,然后被添加到系统的进程等待队列 WaitQ中,一个等待打开该文件的队列中。然后,该函数调用系统调度器去切换到另一个不同的 但有CPU运算请求的进程。

当一个进程处理完该文件并且关闭了该文件,module_close就被调用执行了。 该函数唤醒所有在等待队列中的进程(还没有只唤醒特定进程的机制)。然后该函数返回, 那个刚刚关闭文件的进程得以继续运行。及时的,进程调度器会判定该进程执行已执行完毕, 将CPU转让给别的进程。被提供CPU使用权的那个进程就恰好从先前系统调用 module_interruptible_sleep_on[12]后的地方开始继续执行。 它可以设置一个全局变量去通知别的进程该文件已被打开占用了。当别的请求该文件的进程获得CPU时间片时, 它们将检测该变量然后返回休眠。

更有趣的是,module_close并不垄断唤醒等待中的请求文件的进程的权力。一个信号,像Ctrl+c (SIGINT也能够唤醒别的进程 [13]。 在这种情况下,我们想立即返回-EINTR 。 这对用户很重要,举个例子来说,用户可以在某个进程接受到文件前终止该进程。

还有一点值得注意。有些时候进程并不愿意休眠,它们要么立即执行它们想做的, 要么被告知任务无法进行。这样的进程在打开文件时会使用标志O_NONBLOCK。 在别的进程被阻塞时内核应该做出的响应是返回错误代码-EAGAIN,像在本例中对该文件的请求的进程。程序 cat_noblock,在本章的源代码目录下可以找到,就能够使用标志位 O_NONBLOCK打开文件。

Example 9-1. sleep.c

* sleep.c - create a /proc file, and if several processes try to open it at
* the same time, put all but one to sleep

#include /* We're doing kernel work */
#include /* Specifically, a module */
#include /* Necessary because we use proc fs */
#include /* For putting processes to sleep and
waking them up */
#include /* for get_user and put_user */

* The module's file functions

* Here we keep the last message received, to prove that we can process our
* input
static char Message[MESSAGE_LENGTH];

static struct proc_dir_entry *Our_Proc_File;
#define PROC_ENTRY_FILENAME "sleep"

* Since we use the file operations struct, we can't use the special proc
* output provisions - we have to use a standard read function, which is this
* function
static ssize_t module_output(struct file *file,/* see include/linux/fs.h */
char *buf,/* The buffer to put data to
(in the user segment) */
size_t len,/* The length of the buffer */
loff_t * offset)
static int finished = 0;
int i;
char message[MESSAGE_LENGTH + 30];

* Return 0 to signify end of file - that we have nothing
* more to say at this point.
if (finished) {
finished = 0;
return 0;

* If you don't understand this by now, you're hopeless as a kernel
* programmer.
sprintf(message, "Last input:%s\n", Message);
for (i = 0; i < len && message[i]; i++)
put_user(message[i], buf + i);

finished = 1;
return i;/* Return the number of bytes "read" */

* This function receives input from the user when the user writes to the /proc
* file.
static ssize_t module_input(struct file *file,/* The file itself */
const char *buf,/* The buffer with input */
size_t length,/* The buffer's length */
loff_t * offset)
{/* offset to file - ignore */
int i;

* Put the input into Message, where module_output will later be
* able to use it
for (i = 0; i < MESSAGE_LENGTH - 1 && i < length; i++)
get_user(Message[i], buf + i);
* we want a standard, zero terminated string
Message[i] = '\0';

* We need to return the number of input characters used
return i;

* 1 if the file is currently open by somebody
int Already_Open = 0;

* Queue of processes who want our file
* Called when the /proc file is opened
static int module_open(struct inode *inode, struct file *file)
* If the file's flags include O_NONBLOCK, it means the process doesn't
* want to wait for the file. In this case, if the file is already
* open, we should fail with -EAGAIN, meaning "you'll have to try
* again", instead of blocking a process which would rather stay awake.
if ((file->f_flags & O_NONBLOCK) && Already_Open)
return -EAGAIN;

* This is the correct place for try_module_get(THIS_MODULE) because
* if a process is in the loop, which is within the kernel module,
* the kernel module must not be removed.

* If the file is already open, wait until it isn't

while (Already_Open) {
int i, is_sig = 0;

* This function puts the current process, including any system
* calls, such as us, to sleep. Execution will be resumed right
* after the function call, either because somebody called
* wake_up(&WaitQ) (only module_close does that, when the file
* is closed) or when a signal, such as Ctrl-C, is sent
* to the process
wait_event_interruptible(WaitQ, !Already_Open);

* If we woke up because we got a signal we're not blocking,
* return -EINTR (fail the system call). This allows processes
* to be killed or stopped.

* Emmanuel Papirakis:
* This is a little update to work with 2.2.*. Signals now are contained in
* two words (64 bits) and are stored in a structure that contains an array of
* two unsigned longs. We now have to make 2 checks in our if.
* Ori Pomerantz:
* Nobody promised me they'll never use more than 64 bits, or that this book
* won't be used for a version of Linux with a word size of 16 bits. This code
* would work in any case.
for (i = 0; i < _NSIG_WORDS && !is_sig; i++)
is_sig =
current->pending.signal.sig[i] & ~current->

if (is_sig) {
* It's important to put module_put(THIS_MODULE) here,
* because for processes where the open is interrupted
* there will never be a corresponding close. If we
* don't decrement the usage count here, we will be
* left with a positive usage count which we'll have no
* way to bring down to zero, giving us an immortal
* module, which can only be killed by rebooting
* the machine.
return -EINTR;

* If we got here, Already_Open must be zero

* Open the file
Already_Open = 1;
return 0;/* Allow the access */

* Called when the /proc file is closed
int module_close(struct inode *inode, struct file *file)
* Set Already_Open to zero, so one of the processes in the WaitQ will
* be able to set Already_Open back to one and to open the file. All
* the other processes will be called when Already_Open is back to one,
* so they'll go back to sleep.
Already_Open = 0;

* Wake up all the processes in WaitQ, so if anybody is waiting for the
* file, they can have it.


return 0;/* success */

* This function decides whether to allow an operation (return zero) or not
* allow it (return a non-zero which indicates why it is not allowed).
* The operation can be one of the following values:
* 0 - Execute (run the "file" - meaningless in our case)
* 2 - Write (input to the kernel module)
* 4 - Read (output from the kernel module)
* This is the real function that checks file permissions. The permissions
* returned by ls -l are for reference only, and can be overridden here.
static int module_permission(struct inode *inode, int op, struct nameidata *nd)
* We allow everybody to read from our module, but only root (uid 0)
* may write to it
if (op == 4 || (op == 2 && current->euid == 0))
return 0;

* If it's anything else, access is denied
return -EACCES;

* Structures to register as the /proc file, with pointers to all the relevant
* functions.

* File operations for our proc file. This is where we place pointers to all
* the functions called when somebody tries to do something to our file. NULL
* means we don't want to deal with something.
static struct file_operations File_Ops_4_Our_Proc_File = {
.read = module_output,/* "read" from the file */
.write = module_input,/* "write" to the file */
.open = module_open,/* called when the /proc file is opened */
.release = module_close,/* called when it's closed */

* Inode operations for our proc file. We need it so we'll have somewhere to
* specify the file operations structure we want to use, and the function we
* use for permissions. It's also possible to specify functions to be called
* for anything else which could be done to an inode (although we don't bother,
* we just put NULL).

static struct inode_operations Inode_Ops_4_Our_Proc_File = {
.permission = module_permission,/* check for permissions */

* Module initialization and cleanup

* Initialize the module - register the proc file

int init_module()
int rv = 0;
Our_Proc_File = create_proc_entry(PROC_ENTRY_FILENAME, 0644, NULL);
Our_Proc_File->owner = THIS_MODULE;
Our_Proc_File->proc_iops = &Inode_Ops_4_Our_Proc_File;
Our_Proc_File->proc_fops = &File_Ops_4_Our_Proc_File;
Our_Proc_File->mode = S_IFREG | S_IRUGO | S_IWUSR;
Our_Proc_File->uid = 0;
Our_Proc_File->gid = 0;
Our_Proc_File->size = 80;

if (Our_Proc_File == NULL) {
rv = -ENOMEM;
remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
printk(KERN_INFO "Error: Could not initialize /proc/test\n");

return rv;

* Cleanup - unregister our file from /proc. This could get dangerous if
* there are still processes waiting in WaitQ, because they are inside our
* open function, which will get unloaded. I'll explain how to avoid removal
* of a kernel module in such a case in chapter 10.
void cleanup_module()
remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);

Chapter 10. Replacing Printks

10.1. 替换printk

Section中, 我说过最好不要在X中进行内核模块编程。在真正的内核模块开发中的确是这样。 但在实际应用中,你想在任何加载模块的tty[14]终端中显示信息。

实现的方法是使用current指针,一个指向当前运行进程的指针,来获取当前任务的 tty终端的结构体。然后,我们找到在该tty结构体中 用来向tty写入字符信息的函数的指针。通过指针我们使用该函数来向终端写入信息。

Example 10-1. print_string.c

* print_string.c - Send output to the tty we're running on, regardless if it's
* through X11, telnet, etc. We do this by printing the string to the tty
* associated with the current task.
#include /* For current */
#include /* For the tty declarations */
#include /* For LINUX_VERSION_CODE */

MODULE_AUTHOR("Peter Jay Salzman");

static void print_string(char *str)
struct tty_struct *my_tty;

* tty struct went into signal struct in 2.6.6
* The tty for the current task
my_tty = current->tty;
* The tty for the current task, for 2.6.6+ kernels
my_tty = current->signal->tty;

* If my_tty is NULL, the current task has no tty you can print to
* (ie, if it's a daemon). If so, there's nothing we can do.
if (my_tty != NULL) {

* my_tty->driver is a struct which holds the tty's functions,
* one of which (write) is used to write strings to the tty.
* It can be used to take a string either from the user's or
* kernel's memory segment.
* The function's 1st parameter is the tty to write to,
* because the same function would normally be used for all
* tty's of a certain type. The 2nd parameter controls whether
* the function receives a string from kernel memory (false, 0)
* or from user memory (true, non zero). The 3rd parameter is
* a pointer to a string. The 4th parameter is the length of
* the string.
((my_tty->driver)->write) (my_tty,/* The tty itself */
0,/* Don't take the string
from user space */
str,/* String */
strlen(str));/* Length */

* ttys were originally hardware devices, which (usually)
* strictly followed the ASCII standard. In ASCII, to move to
* a new line you need two characters, a carriage return and a
* line feed. On Unix, the ASCII line feed is used for both
* purposes - so we can't just use \n, because it wouldn't have
* a carriage return and the next line will start at the
* column right after the line feed.
* This is why text files are different between Unix and
* MS Windows. In CP/M and derivatives, like MS-DOS and
* MS Windows, the ASCII standard was strictly adhered to,
* and therefore a newline requirs both a LF and a CR.
((my_tty->driver)->write) (my_tty, 0, "\015\012", 2);

static int __init print_string_init(void)
print_string("The module has been inserted. Hello world!");
return 0;

static void __exit print_string_exit(void)
print_string("The module has been removed. Farewell world!");


10.2. 让你的键盘指示灯闪起来

你也许想让你的模块更直接的同外界交流,你的键盘指示灯就是一个不错的选择。它可以及时显示模块的工作状态, 吸引你的注意,并且它们不许要任何设置,使用起来也不像向终端或磁盘写入信息那么危险。


Example 10-2. kbleds.c

* kbleds.c - Blink keyboard leds until the module is unloaded.

#include /* For fg_console, MAX_NR_CONSOLES */
#include /* For KDSETLED */
#include /* For vc_cons */

MODULE_DESCRIPTION("Example module illustrating the use of Keyboard LEDs.");
MODULE_AUTHOR("Daniele Paolo Scarpazza");

struct timer_list my_timer;
struct tty_driver *my_driver;
char kbledstatus = 0;

#define BLINK_DELAY HZ/5
#define ALL_LEDS_ON 0x07

* Function my_timer_func blinks the keyboard LEDs periodically by invoking
* command KDSETLED of ioctl() on the keyboard driver. To learn more on virtual
* terminal ioctl operations, please see file:
* /usr/src/linux/drivers/char/vt_ioctl.c, function vt_ioctl().
* The argument to KDSETLED is alternatively set to 7 (thus causing the led
* mode to be set to LED_SHOW_IOCTL, and all the leds are lit) and to 0xFF
* (any value above 7 switches back the led mode to LED_SHOW_FLAGS, thus
* the LEDs reflect the actual keyboard status). To learn more on this,
* please see file:
* /usr/src/linux/drivers/char/keyboard.c, function setledstate().

static void my_timer_func(unsigned long ptr)
int *pstatus = (int *)ptr;

if (*pstatus == ALL_LEDS_ON)
*pstatus = RESTORE_LEDS;
*pstatus = ALL_LEDS_ON;

(my_driver->ioctl) (vc_cons[fg_console].d->vc_tty, NULL, KDSETLED,

my_timer.expires = jiffies + BLINK_DELAY;

static int __init kbleds_init(void)
int i;

printk(KERN_INFO "kbleds: loading\n");
printk(KERN_INFO "kbleds: fgconsole is %x\n", fg_console);
for (i = 0; i < MAX_NR_CONSOLES; i++) {
if (!vc_cons[i].d)
printk(KERN_INFO "poet_atkm: console[%i/%i] #%i, tty %lx\n", i,
MAX_NR_CONSOLES, vc_cons[i].d->vc_num,
(unsigned long)vc_cons[i].d->vc_tty);
printk(KERN_INFO "kbleds: finished scanning consoles\n");

my_driver = vc_cons[fg_console].d->vc_tty->driver;
printk(KERN_INFO "kbleds: tty driver magic %x\n", my_driver->magic);

* Set up the LED blink timer the first time
my_timer.function = my_timer_func;
my_timer.data = (unsigned long)&kbledstatus;
my_timer.expires = jiffies + BLINK_DELAY;

return 0;

static void __exit kbleds_cleanup(void)
printk(KERN_INFO "kbleds: unloading...\n");
(my_driver->ioctl) (vc_cons[fg_console].d->vc_tty, NULL, KDSETLED,


如果上面的方法都无法满足你调试的需要,你就可能需要其它的技巧了。还记得那个在 make menuconfig 时的CONFIG_LL_DEBUG参数吗?如果你激活该选项,你就可以获得对串口的底层操纵。如果这仍然不够爽,你还可以对 kernel/printk.c或其它的基本的系统底层调用打补丁来使用printascii,从而可以通过串口跟踪 内核的每步动作。如果你的架构不支持上面的例子却有一个标准的串口,这可能应该是你首先应该考虑的了。通过网络上的 终端调试同样值得尝试。

尽管有很多关于如何调试的技巧,但我要提醒的是任何调试都会代码带来影响。加入调试代码足以导致原始代码产生bug的 条件的消失,所以尽可能少的加入调试代码并且确保它们不出现在成熟的代码中。

Chapter 11. Scheduling Tasks

11.1. 任务调度

经常我们要定期的抽空处理一些“家务活”。如果这样的任务通过一个用户进程完成的,那么我们可以将它放到一个 crontab文件中。如果是通过一个内核模块来完成,那么我们有两种选择。 第一种选择是使用crontab文件,启动一个进程,通过一个系统调用唤醒内核模块,例如打开一个文件。 这很没效率。我们通过crontab生成了一个新进程,读取了一段新的可执行代码进入内存, 只是为了唤醒一个已经在内存中的内核模块。

第二种选择是我们构造一个函数,然后该函数在每次时间中断发生时被调用。实现方法是我们构造一个任务,使用结构体 tq_struct,而该结构体又保存着指向该函数的指针。然后,我们用 queue_task把该任务放在叫做tq_timer任务队列中。 该队列是将在下个时间中断发生时执行的任务。因为我们想要使它不停的执行,所以当该函数执行完后我们还要将它放回 tq_timer任务队列中等待下一次时间中断。

但我们似乎忘了一点。当一个模块用rmmod卸载时,它会检查使用计数。 如果该计数为零,则调用module_cleanup。然后,模块就同它的所有函数调用从内存中消失了。 此时没人去检查任务队列中是否正好还有一个等待执行的这些函数的指针。在可能是一段漫长的时间后 (当然是相对计算机而言,对于我们这点时间什么都不是,也就差不多百分之一秒吧), 内核接收到一个时间中断,然后准备调用那个在任务队列中的函数。不幸的是,该函数已经不存在了。 大多数情况下,由于访问的内存页是空白的,你只会收到一个不愉快的消息。但是如果其它的一些代码恰好就在那里, 结果可能将会非常糟糕。同样不幸的是,我们也没有一种轻易的向任务队列注销任务的机制。

既然cleanup_module不能返回一个错误代码(它是一个void函数), 解决之道是让它不要返回。相反,调用sleep_onmodule_sleep_on[15]rmmod的进程休眠。在此之前,它通知被时间中断调度出任务队列的那个函数不要在返回队列。 这样,在下一个时间中断发生时,rmmod就会被唤醒,此时我们的函数已经不在队列中, 可以很安全的卸载我们的模块了。

Example 11-1. sched.c

* sched.c - scheduale a function to be called on every timer interrupt.
* Copyright (C) 2001 by Peter Jay Salzman

* The necessary header files

* Standard in kernel modules
#include /* We're doing kernel work */
#include /* Specifically, a module */
#include /* Necessary because we use the proc fs */
#include /* We scheduale tasks here */
#include /* We need to put ourselves to sleep
and wake up later */
#include /* For __init and __exit */
#include /* For irqreturn_t */

struct proc_dir_entry *Our_Proc_File;
#define PROC_ENTRY_FILENAME "sched"
#define MY_WORK_QUEUE_NAME "WQsched.c"

* The number of times the timer interrupt has been called so far
static int TimerIntrpt = 0;

static void intrpt_routine(void *);

static int die = 0;/* set this to 1 for shutdown */

* The work queue structure for this task, from workqueue.h
static struct workqueue_struct *my_workqueue;

static struct work_struct Task;
static DECLARE_WORK(Task, intrpt_routine, NULL);

* This function will be called on every timer interrupt. Notice the void*
* pointer - task functions can be used for more than one purpose, each time
* getting a different parameter.
static void intrpt_routine(void *irrelevant)
* Increment the counter

* If cleanup wants us to die
if (die == 0)
queue_delayed_work(my_workqueue, &Task, 100);

* Put data into the proc fs file.
procfile_read(char *buffer,
char **buffer_location,
off_t offset, int buffer_length, int *eof, void *data)
int len;/* The number of bytes actually used */

* It's static so it will still be in memory
* when we leave this function
static char my_buffer[80];

static int count = 1;

* We give all of our information in one go, so if the anybody asks us
* if we have more information the answer should always be no.
if (offset > 0)
return 0;

* Fill the buffer and get its length
len = sprintf(my_buffer, "Timer called %d times so far\n", TimerIntrpt);

* Tell the function which called us where the buffer is
*buffer_location = my_buffer;

* Return the length
return len;

* Initialize the module - register the proc file
int __init init_module()
int rv = 0;
* Put the task in the work_timer task queue, so it will be executed at
* next timer interrupt
my_workqueue = create_workqueue(MY_WORK_QUEUE_NAME);
queue_delayed_work(my_workqueue, &Task, 100);

Our_Proc_File = create_proc_entry(PROC_ENTRY_FILENAME, 0644, NULL);
Our_Proc_File->read_proc = procfile_read;
Our_Proc_File->owner = THIS_MODULE;
Our_Proc_File->mode = S_IFREG | S_IRUGO;
Our_Proc_File->uid = 0;
Our_Proc_File->gid = 0;
Our_Proc_File->size = 80;

if (Our_Proc_File == NULL) {
rv = -ENOMEM;
remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
printk(KERN_INFO "Error: Could not initialize /proc/%s\n",

return rv;

* Cleanup
void __exit cleanup_module()
* Unregister our /proc file
remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
printk(KERN_INFO "/proc/%s removed\n", PROC_ENTRY_FILENAME);

die = 1;/* keep intrp_routine from queueing itself */
cancel_delayed_work(&Task);/* no "new ones" */
flush_workqueue(my_workqueue);/* wait till all "old ones" finished */

* Sleep until intrpt_routine is called one last time. This is
* necessary, because otherwise we'll deallocate the memory holding
* intrpt_routine and Task while work_timer still references them.
* Notice that here we don't allow signals to interrupt us.
* Since WaitQ is now not NULL, this automatically tells the interrupt
* routine it's time to die.


* some work_queue related functions
* are just available to GPL licensed Modules

Chapter 12. Interrupt Handlers

12.1. Interrupt Handlers

12.1.1. Interrupt Handlers

除了刚结束的那章,我们目前在内核中所做的每件事都只不过是对某个请求的进程的响应, 要么是对某个特殊的文件的处理,要么是发送一个ioctl(),要么是调用一个系统调用。 但是内核的工作不仅仅是响应某个进程的请求。还有另外一项非常重要的工作就是负责对硬件的管理。

在CPU和硬件之间的活动大致可分为两种。第一种是CPU发送指令给硬件,第二种就是硬件要返回某些信息给CPU。 后面的那种又叫做中断,因为要知道何时同硬件对话才适宜而较难实现。硬件设备通常只有很少的缓存, 如果你不及时的读取里面的信息,这些信息就会丢失。

在Linux中,硬件中断被叫作IRQ(Interrupt Requests,中断请求)[16]。有两种硬件中断,短中断和长中断。短中断占用的时间非常短,在这段时间内, 整个系统被阻塞,任何其它中断都不会处理。长中断占用的时间相对较长,在此期间,可能会有别的中断发生请求处理 (不是相同设备发出的中断)。可能的话,尽量将中断声明为长中断。

当CPU接收到一个中断时,它停止正在处理的一切事务(除非它在处理另一个更重要的中断, 在这种情况下它只会处理完这个重要的中断才会回来处理新产生的中断), 将运行中的那些参数压入栈中然后调用中断处理程序。这同时意味着中断处理程序本身也有一些限制, 因为此时系统的状态并不确定。解决的办法是让中断处理程序尽快的完成它的事务,通常是从硬件读取信息和向硬件发送指令, 然后安排下一次接收信息的相关处理(这被称为"bottom half"[17] ),然后返回。内核确保被安排的事务被尽快的执行。当被执行时,在内核模块中允许的操作就是被允许的。

实现的方法是调用request_irq()函数,当接受到相应的IRQ时 (共有15种中断,在Intel架构平台上再加上1种用于串连中断控制器的中断)去调用你的中断 处理程序。该函数接收IRQ号,要调用的处理IRQ函数的名称,中断请求的类别标志位,文件 /proc/interrupts中声明的设备的名字,和传递给中断处理程序的参数。中断请求的类别标志位可以为 SA_SHIRQ来告诉系统你希望与其它中断处理程序共享该中断号 (这通常是由于一些设备共用相同的IRQ号),也可以为SA_INTERRUPT 来告诉系统这是一个快速中断,这种情况下该函数只有在该IRQ空闲时才会成功返回,或者同时你又决定共享该IQR。

然后,在中断处理程序内部,我们与硬件对话,接着使用带tq_immediate()mark_bh(BH_IMMEDIATE)queue_task_irq()去对bottom half队列进行调度。我们不能使用2.0版本种标准的 queue_task 的原因是中断可能就发生在别人的 queue_task[18] 中。我们需要mark_bh是因为早期版本的Linux只有一个可以存储32个bottom half的数组, 并且现在它们中的一个(BH_IMMEDIATE)已经被用来连接没有分配到队列中的入口的硬件 驱动的bottom half。

12.1.2. Intel架构中的键盘


在写这章的事例代码时,我遇到了一些困难。一方面,我需要一个可以得到实际有意义结果的, 能在各种平台上工作的例子。另一方面,内核中已经包括了各种设备驱动,并且这些驱动将无法和我的例子共存。 我找到的解决办法是为键盘中断写点东西,当然首先禁用普通的键盘中断。因为该中断在内核中定义为一个静态连接的符号 (见drivers/char/keyboard.c)),我们没有办法恢复。所以在 insmod前,如果你爱惜你的机器,新打开一个终端运行sleep 120 ; reboot

该代码将自己绑定在IRQ 1, 也就是Intel架构中键盘的IRQ。然后,当接收到一个键盘中断请求时,它读取键盘的状态(那就是 inb(0x64)的目的)和扫描码,也就是键盘返回的键值。然后,一旦内核认为这是符合条件的,它运行 got_char去给出操作的键(扫描码的头7个位)和是按下键(扫描码的第8位为0) 还是弹起键(扫描码的第8位为1)。

Example 12-1. intrpt.c

* intrpt.c - An interrupt handler.
* Copyright (C) 2001 by Peter Jay Salzman

* The necessary header files

* Standard in kernel modules
#include /* We're doing kernel work */
#include /* Specifically, a module */
#include /* We want an interrupt */

#define MY_WORK_QUEUE_NAME "WQsched.c"

static struct workqueue_struct *my_workqueue;

* This will get called by the kernel as soon as it's safe
* to do everything normally allowed by kernel modules.
static void got_char(void *scancode)
printk("Scan Code %x %s.\n",
(int)*((char *)scancode) & 0x7F,
*((char *)scancode) & 0x80 ? "Released" : "Pressed");

* This function services keyboard interrupts. It reads the relevant
* information from the keyboard and then puts the non time critical
* part into the work queue. This will be run when the kernel considers it safe.
irqreturn_t irq_handler(int irq, void *dev_id, struct pt_regs *regs)
* This variables are static because they need to be
* accessible (through pointers) to the bottom half routine.
static int initialised = 0;
static unsigned char scancode;
static struct work_struct task;
unsigned char status;

* Read keyboard status
status = inb(0x64);
scancode = inb(0x60);

if (initialised == 0) {
INIT_WORK(&task, got_char, &scancode);
initialised = 1;
} else {
PREPARE_WORK(&task, got_char, &scancode);

queue_work(my_workqueue, &task);


* Initialize the module - register the IRQ handler
int init_module()
my_workqueue = create_workqueue(MY_WORK_QUEUE_NAME);

* Since the keyboard handler won't co-exist with another handler,
* such as us, we have to disable it (free its IRQ) before we do
* anything. Since we don't know where it is, there's no way to
* reinstate it later - so the computer will have to be rebooted
* when we're done.
free_irq(1, NULL);

* Request IRQ 1, the keyboard IRQ, to go to our irq_handler.
* SA_SHIRQ means we're willing to have othe handlers on this IRQ.
* SA_INTERRUPT can be used to make the handler into a fast interrupt.
return request_irq(1,/* The number of the keyboard IRQ on PCs */
irq_handler,/* our handler */
SA_SHIRQ, "test_keyboard_irq_handler",
(void *)(irq_handler));

* Cleanup
void cleanup_module()
* This is only here for completeness. It's totally irrelevant, since
* we don't have a way to restore the normal keyboard interrupt so the
* computer is completely useless and has to be rebooted.
free_irq(1, NULL);

* some work_queue related functions are just available to GPL licensed Modules

Chapter 13. Symmetric Multi Processing

13.1. 对称多线程处理

提高性能的最简单也是最便宜的方法是给你的主板加第二个CPU(如果你的主板支持的话)。 这可以通过让不同的CPU完成不同的工作(非对称多线程处理)或是相同的工作(对称多线程处理)。 实现高效率的非对称的多线程处理需要特殊硬件相关的知识,而对于Linux这样通用操作系统这是不可能的。 相对而言,对称多线程处理是较容易实现的。

我这里所说的相对容易,老实说,还是不容易。在一个对称多线程处理的环境中, 多个CPU共享内存,导致的结果是其中一个CPU运行的代码会对别的CPU也产生影响。 你不能再确定你代码中第一行中设置的变量在接下来的那行代码中还是那个设置值; 其它的CPU可能会趁你不注意已经把它修改了。显然,如果是这样的话,是无法进行任何编程的。

对于进程层面上的编程这通常不是个问题,因为一个进程通常同一时间只在一个CPU上运行 [19]。 但是,对于内核,就可以被在不同的CPU上的同时运行的不同的进程使用。

在内核版本2.0.x中,这还不算作什么问题,因为整个内核是一个spinlock [20],这就意味着一旦某个CPU进入内核态,别的CPU将不允许进入内核态。这使Linux的SMP实现很安全 [21],但缺乏效率。


Chapter 14. Common Pitfalls

14.1. 注意

在我让你们进入内核模块的世界之前,我需要提醒你们下面的一些注意。如果我没警告到你们但是的确发生了, 那么你将问题报告我,我将全额退还你的书款。


你无法这样做。在内核模块中,你只能使用内核提供的函数,也就是你在 /proc/kallsyms能查到的那些。


你如果这样做了但只是一瞬间,没问题,当我没提这事。但是事后你没有恢复它们, 你就只能摁电源键来重启你僵死的系统了。



Appendix A. Changes: 2.0 To 2.2

A.1. 从2.0到2.2的变化

A.1.1. 从2.0到2.2的变化

我对内核的了解并不很完全所以我也无法写出所有的变化。在修改代码 (更确切的说,是采用Emmanuel Papirakis的修改)时,我遇到了以下的这些修改。 我将它们都列出来以方便模块编写者们,特别是学习该档案先前版本并熟悉我提到的这些技巧 (但已经更换到新版本的)的那些人。

更多的这方面的参考资料在 Richard Gooch's的站点上。


如果你要使用put_userget_user你就需要 #include它。


在2.2版本中,get_user同时接收用户内存的指针和用来 设置信息的内核内存中变量的内存指针。变化的原因是因为当我们读取的变量是二或四个字节长的时候, get_user也可以读取二或四个字节长的变量。



close in file_operations


read,write in file_operations

这些函数的头文件改变了。它们现在返回ssize_t而不是整形值, 且它们的参数表也变了。inode 不再是一个参数,文件中的偏移量也一样。


该函数已经不复存在。你应该使用用0作为inode参数的proc_register 函数来替代它。


在 task 结构体中的signals不再是一个32位整形变量,而是一个为 _NSIG_WORDS 整形的数组。


即使你想在中断处理内部调度一个任务,你也应该使用 queue_task而不是queue_task_irq

Module Parameters

你不必在将模块参数声明为全局变量。在2.2中,使用 MODULE_PARM去声明模块参数。这是一个进步, 这样就允许模块接受以数字开头的参数名而不会被弄糊涂。

Symmetrical Multi-Processing

内核本省已不再是一个spinlock,意味着你的模块也应该考虑 SMP的问题。

Appendix B. Where To Go From Here

B.1. 为什么这样写?

我其实可以给这本书再加入几章,例如如何为实现新的文件系统加上一章,或是添加一个新的协议栈(如果有这样的必要的话, 想找到Linux不支持的网络协议已经是非常的困难的了)。我还可以解释一下我们尚未接触到的内核实现机制,像系统的引导自举, 或磁盘存储。

但是,我决定否。我写本书的目的是提供基本的,入门的对神秘的内核模块编程的认识和这方面的常用技巧。 对于那些非常热衷与内核编程的人,我推荐Juan-Mariano de Goyeneche的 内核资源列表 。 同样,就同Linus本人说的那样,学习内核最好的方法是自己阅读内核源代码。

如果你对更多的短小的示例内核模块感兴趣,我向你推荐 Phrack magazine 这本杂志。 即使你不关心安全问题,作为一个程序员你还是应该时时考虑这个问题的。这些内核模块代码都很短,不需要费多大劲就能读懂。

我希望我满足了你希望成为一个更优秀的程序员的要求,至少在学习技术的过程中体会到了乐趣。 如果你真的写了一些非常有用的模块,我希望你使用GPL许可证发布你的模块,这样我也就可以使用它们了。



/etc/conf.modules, 内核模块是如何被调入内核工作的?
/etc/modules.conf, 内核模块是如何被调入内核工作的?
/proc filesystem, 关于 /proc 文件系统
/proc/interrupts, Interrupt Handlers
/proc/kallsyms, 模块可调用的函数, 命名空间, 注意
/proc/meminfo, 关于 /proc 文件系统
/proc/modules, 内核模块是如何被调入内核工作的?, 关于 /proc 文件系统
2.2 changes, 从2.0到2.2的变化
_IO, 与设备文件对话 (writes and IOCTLs)
_IOR, 与设备文件对话 (writes and IOCTLs)
_IOW, 与设备文件对话 (writes and IOCTLs)
_IOWR, 与设备文件对话 (writes and IOCTLs)
_NSIG_WORDS, 从2.0到2.2的变化
__exit, Hello World (part 3): 关于__init和__exit宏
__init, Hello World (part 3): 关于__init和__exit宏
__initdata, Hello World (part 3): 关于__init和__exit宏
__initfunction(), Hello World (part 3): 关于__init和__exit宏
__NO_VERSION__, 由多个文件构成的内核模块
hello-3.c, Hello World (part 3): 关于__init和__exit宏


uaccess.h, 从2.0到2.2的变化
asm/uaccess.h, 从2.0到2.2的变化


BH_IMMEDIATE, Interrupt Handlers
blocking processes, 阻塞进程
blocking, how to avoid, Enter Sandman
bottom half, Interrupt Handlers
busy, Enter Sandman


large, 注意
cleanup_module(), Hello, World (part 1): 最简单的内核模块, 介绍printk()
close, 从2.0到2.2的变化
code space, 代码空间
coffee, Major and Minor Numbers
multiple, 对称多线程处理
crontab, 任务调度
ctrl-c, Enter Sandman
current task, 替换printk


defining ioctls, 与设备文件对话 (writes and IOCTLs)
device file
character, 字符设备文件
device files
input to, 与设备文件对话 (writes and IOCTLs)
write to, 与设备文件对话 (writes and IOCTLs)


EAGAIN, Enter Sandman
EINTR, Enter Sandman
elf_i386, 由多个文件构成的内核模块
ENTRY(system call), 系统调用
entry.S, 系统调用


file, 关于file结构体
/proc, 关于 /proc 文件系统
registration, 使用 /proc 作为输入
filesystem registration, 使用 /proc 作为输入
file_operations, 关于file_operations结构体
file_operations structure, 使用 /proc 作为输入
flush, 从2.0到2.2的变化


get_user, 使用 /proc 作为输入, 从2.0到2.2的变化


interrupt, Interrupt Handlers
housekeeping, 任务调度
Hurd, 代码空间


inb, Intel架构中的键盘
init_module(), Hello, World (part 1): 最简单的内核模块, 介绍printk()
inode, 关于file结构体, 关于 /proc 文件系统
inode_operations structure, 使用 /proc 作为输入
using /proc for, 使用 /proc 作为输入
insmod, 编译内核模块, 系统调用
Intel architecture
keyboard, Intel架构中的键盘
interrupt 0x80, 系统调用
interrupt handlers, Interrupt Handlers
interruptible_sleep_on, Enter Sandman
interrupts, 从2.0到2.2的变化
disabling, 注意
ioctl, 与设备文件对话 (writes and IOCTLs)
defining, 与设备文件对话 (writes and IOCTLs)
official assignment, 与设备文件对话 (writes and IOCTLs)
irqs, 从2.0到2.2的变化


versions, 从2.0到2.2的变化
kernel versions, 为多个版本的内核编写内核模块
kerneld, 内核模块是如何被调入内核工作的?
kernel\_version, 由多个文件构成的内核模块
KERNEL_VERSION, 为多个版本的内核编写内核模块
keyboard, Intel架构中的键盘
keyboard LEDs
flashing, 让你的键盘指示灯闪起来
kmod, 内核模块是如何被调入内核工作的?


ld, 由多个文件构成的内核模块
standard, 注意
library function, 模块可调用的函数
LINUX_VERSION_CODE, 为多个版本的内核编写内核模块


major number, Major and Minor Numbers
dynamic allocation, 注册一个设备
mark_bh, Interrupt Handlers
memory segments, 使用 /proc 作为输入
microkernel, 代码空间
minor number, Major and Minor Numbers
mknod, Major and Minor Numbers
modem, 与设备文件对话 (writes and IOCTLs)
parameters, 从2.0到2.2的变化
module parameters, 从2.0到2.2的变化
module.h, 由多个文件构成的内核模块
alias, 内核模块是如何被调入内核工作的?
comment, 内核模块是如何被调入内核工作的?
keep, 内核模块是如何被调入内核工作的?
options, 内核模块是如何被调入内核工作的?
path, 内核模块是如何被调入内核工作的?
MODULE_AUTHOR(), Hello World (part 4): 内核模块证书和内核模块文档说明
module_cleanup, 任务调度
MODULE_DESCRIPTION(), Hello World (part 4): 内核模块证书和内核模块文档说明
module_exit, Hello World (part 2)
module_init, Hello World (part 2)
module_interruptible_sleep_on, Enter Sandman
MODULE_LICENSE(), Hello World (part 4): 内核模块证书和内核模块文档说明
MODULE_PARM, 从2.0到2.2的变化
module_permissions, 使用 /proc 作为输入
module_sleep_on, Enter Sandman, 任务调度
MODULE_SUPPORTED_DEVICE(), Hello World (part 4): 内核模块证书和内核模块文档说明
module_wake_up, Enter Sandman
MOD_IN_USE, 注销一个设备
monolithic kernel, 代码空间
multi-processing, 对称多线程处理
multi-tasking, Enter Sandman
multitasking, Enter Sandman


namespace pollution, 命名空间
Neutrino, 代码空间
non-blocking, Enter Sandman


official ioctl assignment, 与设备文件对话 (writes and IOCTLs)
O_NONBLOCK, Enter Sandman


permission, 使用 /proc 作为输入
current, 使用 /proc 作为输入
replacing, 替换printk
using for input, 使用 /proc 作为输入
proc file
kallsyms, 注意
blocking, 阻塞进程
killing, Enter Sandman
waking up, Enter Sandman
multi, 对称多线程处理
proc_dir_entry, 使用 /proc 作为输入
proc_register, 关于 /proc 文件系统, 从2.0到2.2的变化
proc_register_dynamic, 关于 /proc 文件系统, 从2.0到2.2的变化
putting processes to sleep, Enter Sandman
put_user, 使用 /proc 作为输入, 从2.0到2.2的变化


queue_task, 任务调度, Interrupt Handlers, 从2.0到2.2的变化
queue_task_irq, Interrupt Handlers, 从2.0到2.2的变化


read, 从2.0到2.2的变化
in the kernel, 使用 /proc 作为输入
reference count, 任务调度
refund policy, 注意
register_chrdev, 注册一个设备
request_irq(), Interrupt Handlers
rmmod, 系统调用, 任务调度
preventing, 注销一个设备


SA_INTERRUPT, Interrupt Handlers
SA_SHIRQ, Interrupt Handlers
scheduler, Enter Sandman
scheduling tasks, 任务调度
memory, 使用 /proc 作为输入
serial port, 与设备文件对话 (writes and IOCTLs)
shutdown, 系统调用
SIGINT, Enter Sandman
signal, Enter Sandman
signals, 从2.0到2.2的变化
putting processes to, Enter Sandman
sleep_on, Enter Sandman, 任务调度
SMP, 对称多线程处理, 从2.0到2.2的变化
source file
chardev.c, 与设备文件对话 (writes and IOCTLs)
chardev.h, 与设备文件对话 (writes and IOCTLs)
hello-1.c, Hello, World (part 1): 最简单的内核模块
hello-2.c, Hello World (part 2)
hello-4.c, Hello World (part 4): 内核模块证书和内核模块文档说明
hello-5.c, 从命令行传递参数给内核模块
intrpt.c, Intel架构中的键盘
ioctl.c, 与设备文件对话 (writes and IOCTLs)
print_string.c, 替换printk
sched.c, 任务调度
sleep.c, Enter Sandman
start.c, 由多个文件构成的内核模块
stop.c, 由多个文件构成的内核模块
syscall.c, 系统调用
source files
multiple, 由多个文件构成的内核模块, 为已编译的内核编译模块
ssize_t, 从2.0到2.2的变化
standard libraries, 注意
strace, 模块可调用的函数, 系统调用
tty, 替换printk
struct file_operations, 使用 /proc 作为输入
struct inode_operations, 使用 /proc 作为输入
file_operations, 从2.0到2.2的变化
symbol table, 命名空间
symmetrical multi-processing, 对称多线程处理, 从2.0到2.2的变化
sync, 系统调用
system call, 模块可调用的函数, 系统调用
open, 系统调用
system calls, 系统调用
sys_call_table, 系统调用
sys_open, 系统调用


task, 任务调度
current, 替换printk
scheduling, 任务调度
tq_immediate, Interrupt Handlers
tq_struct, 任务调度
tq_timer, 任务调度
try_module_get, 系统调用
tty_structure, 替换printk


version.h, 由多个文件构成的内核模块


waking up processes, Enter Sandman
write, 从2.0到2.2的变化
in the kernel, 使用 /proc 作为输入



在早期的linux版本中,是 一个名为kerneld的守护进程。


如果你在修改内核,为避免 覆盖你现在工作的模块,你应该试试使用内核Makefile中的变量EXTRAVERSION去建立一个独 立的模块目录。




我是物理专业的, 而不是主修计算机。




这只是习惯上的。将设备文件放 在你的用户目录下是没有问题的。但是当真正提供成熟的驱动模块时,请保证将设备文 件放在/dev下。


这是在2.0版本中的做法, 在版本2.2中,当我们把inode设为0时,就已经这样自动处理了。


两者的区别是文件的操作针对具体的,实在的文件, 而文件索引节点的操作是针对文件的引用,像建立文件的连接等。


注意这儿“读”与“写”的角色再次翻转过来,在ioctl's中读是向内核发送信息, 而写是从内核获取信息。


这样的表述并不准确。 例如你不能在ioctl中传递一个结构体,但你可以通过传递指向这个结构体的指针实现。


最方便的保持某个文件被打开的方法是使用命令 tail -f打开该文件。


这就意味着该进程仍然在内核态中, 该进程已经调用了open的系统调用,但系统调用却没有返回。 在这段时间内该进程将不会得知别人正在使用CPU。


这是因为我们使用的是module_interruptible_sleep_on。我们也可以使用 module_sleep_on,但这样会导致一些十分愤怒的用户,因为他们的Ctrl+c将不起任何作用。


Teletype, 原先是一种用来和Unix系统交互的键盘和打印机结合起来的装置。现在,它只是一个用来同Unix或类似的系统交流文字流 的抽象的设备,而不管它具体是显示器,X中的xterm,还是一个通过telnet的网络连接。






这里是译者给出的关于“bottom half”的一点解释,来源是google上搜索到的英文资料:

“底部”,“bottom half”常在涉及中断的设备驱动中提到。

当内核接收到一个中断请求,对应的设备驱动被调用。因为在这段时间内无法处理别的任何事务, 让中断处理尽快的完成并重新让内核返回正常的工作状态是非常重要的。就是因为这个设计思想, 驱动的“顶部”和“底部”的概念被提出:“顶部”是被内核调用时最先被执行的部分, 快速的完成一些尽量少的却是必需的工作(像对硬件或其它资源的独享访问这种必须立刻执行的操作), 然后做一些设置让“底部”去完成那些要求时间相对比较宽裕的,剩下的工作。

“底部”什么时候如何运作是内核的设计问题。你也许会听到“底部”的设计已经在最近的内核中被废除了。 这种说法不是很确切,在新内核中其实你可以去选择怎样去执行:像软中断或任务,就像它们以前那样, 还是加入任务队列,更像启动一个用户进程。


queue_task_irq被一个全局的锁(有锁定作用的变量) 保护着,在版本2.2中,并没有queue_task_irq而且queue_task 也是被一个锁保护的。




抱歉,我没有找到合适的词语来表达这个单词。这是内核中的一种机制,可以对内核中的关键数据结构进行锁定保护, 防止其被破坏。

