epoll

概述

epoll 系列描述符用来监控多个文件描述符是否可用。epoll 系列相关函数包括：epoll_create、epoll_ctl、epoll_wait，其中：

epoll_create

创建一个 epoll 实例，并返回一个引用该实例的文件描述符。
epoll_ctl

通过 epoll_ctl 将感兴趣的文件描述符注册到 epoll 实例。
epoll_wait

等待 IO 事件发生，如果当前无事件则阻塞当前线程。

1
2
3

int epoll_create(int size)；// 创建一个 epoll 句柄，size 参数已被忽略
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)；
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

epoll_create

创建一个 epoll 的句柄，2.6.8 以后版本 size 被忽略。返回的 epoll 示例实际是一个文件描述符，在使用完毕时需要关闭。

epoll_ctl

1	int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)；

对 epoll 实例进行操作。op 参数可选值：

EPOLL_CTL_ADD：添加参数 fd 的监听事件
EPOLL_CTL_DEL：删除参数 fd 的监听事件
EPOLL_CTL_MOD：修改参数 fd 的监听事件

struct epoll_event 结构如下：

typedef union epoll_data {
    void        *ptr;
    int          fd;
    uint32_t     u32;
    uint64_t     u64;
} epoll_data_t;

struct epoll_event {
    uint32_t     events;      /* Epoll events */
    epoll_data_t data;        /* User data variable */
};

结构体中 events 成员描述关注的事件类型掩码，取值范围如下：

EPOLLIN
表示对应的文件描述符可以读
EPOLLOUT

表示对应的文件描述符可以写
EPOLLPRI

表示对应的文件描述符有紧急的数据可读，一般是带外数据
EPOLLERR

表示对应的文件描述符发生错误
EPOLLET

将文件描述符设置为边沿触发模式
EPOLLEXCLUSIVE

Linux 4.5 内核添加，在操作系统内核级别解决了“惊群”问题

epoll_wait

1	int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

等待 epfd 关联的 poll 实例上发生事件。参数 events 用来从内核接收发生的事件，maxevents 设置最多返回的事件数目，timeout 为超时时间。与 epoll_wait 相似的函数 epoll_pwait 可以忽略信号产生，仅当事件发生或超时函数才返回。

边沿触发与水平触发

epoll 对于 fd 支持边沿触发（ET，edge trigger）和水平触发（LT，level trigger）两种模式。man 手册（man 7 epoll）有两者区别的讲解。假设有如下场景：

在一个线程 A 将管道描述符写关闭，将管道描述符（rfd）在 epoll 实例上注册，并调用 epoll_wait 等待事件发生。
另一个线程 B 将管道描述符的读关闭，在管道上写入 2kB 的数据。
在线程 A 中，rfd 作为已准备好事件被 epoll_wait 返回。
线程 A 读取 rfd 中的 1kB 数据。
对 epoll_wait 的调用已完成。

如果在第一步中添加 rfd 到 epoll 实例时设置触发方式为 EPOLLET 模式，在 5 步完成 epoll_wait 的调用后即使 rdf 中有可读数据，rfd 的事件也不会触发，此时请求被 hang（线程 B 在等待线程 A 的应答数据，线程 A 在继续等待请求数据）。这是因为 EPOLLET 模式仅当变化发生在文件描述符上时才会触发事件。因此在第 5 步中，线程 A 一直在等待已经存在与 rfd 缓冲区中的数据。在上面的示例中，第 2 步因为写数据导致 rfd 的可读事件触发，同时此事件在第 3 步被消费。由于第 4 步读操作并未将所有可读数据读取，因此在第 5 步完成对 epoll_wait 的调用后，有可能导致此缓冲区中的数据一直未被读取。

使用 EPOLLET 模式的应用程序应该使用非阻塞文件描述符，以避免在处理多个文件描述符是阻塞读或写数据。将 epoll 用作 EPOLLET 模式的建议如下：

使用非阻塞文件描述符
在进行 read 或 write 操作时等待 EAGAIN 错误发生才认为事件处理结束

使用示例

 #define MAX_EVENTS 10
struct epoll_event ev, events[MAX_EVENTS];

int listen_sock, conn_sock, nfds, epollfd;

/* Code to set up listening socket, 'listen_sock', (socket(), bind(), listen()) omitted */
// listen_sock 已设置为非阻塞套接字
epollfd = epoll_create1(0);
if (epollfd == -1) {
    perror("epoll_create1");
    exit(EXIT_FAILURE);
}

ev.events = EPOLLIN;
ev.data.fd = listen_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
    perror("epoll_ctl: listen_sock");
    exit(EXIT_FAILURE);
}

for (;;) {
    nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
    if (nfds == -1) {
        perror("epoll_wait");
        exit(EXIT_FAILURE);
    }

    for (n = 0; n < nfds; ++n) {
        if (events[n].data.fd == listen_sock) {
            conn_sock = accept(listen_sock, (struct sockaddr *) &addr, &addrlen);
            if (conn_sock == -1) {
                perror("accept");
                exit(EXIT_FAILURE);
            }
            setnonblocking(conn_sock);
            ev.events = EPOLLIN | EPOLLET; // 边沿触发模式
            ev.data.fd = conn_sock;
            if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, &ev) == -1) {
                perror("epoll_ctl: conn_sock");
                exit(EXIT_FAILURE);
            }
        } else {
            // 函数 do_use_fd（）使用新的就绪文件描述符，直到 read 或 write 返回 EAGAIN。 
            // 事件驱动的状态机应用程序应在收到 EAGAIN 后记录其当前状态，以便在下次调用 do_use_fd（）时，它将继续从之前停止的位置 read 或 write。
            do_use_fd(events[n].data.fd);
        }
    }
}

出于性能原因，使用边缘触发接口时可以通过指定（EPOLLIN | EPOLLOUT）在 epoll 接口 EPOLL_CTL_ADD 中添加文件描述符一次。这允许您使用 EPOLL_CTL_MOD 调用 epoll_ctl，避免在 EPOLLIN 和 EPOLLOUT 之间连续切换。

manpage QA

本节摘选自 man 7 epoll QA 部分：

What happens if you register the same file descriptor on an epoll instance twice?

You will probably get EEXIST. However, it is possible to add a duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) file descriptor to the same epoll instance. This can be a useful technique for filtering events, if the duplicate file descriptors are registered with different events masks.
Can two epoll instances wait for the same file descriptor? If so, are events reported to both epoll file descriptors?

Yes, and events would be reported to both. However, careful programming may be needed to do this correctly.
Is the epoll file descriptor itself poll/epoll/selectable?

Yes. If an epoll file descriptor has events waiting, then it will indicate as being readable.
What happens if one attempts to put an epoll file descriptor into its own file descriptor set?

The epoll_ctl(2) call will fail (EINVAL). However, you can add an epoll file descriptor inside another epoll file descriptor set.
Can I send an epoll file descriptor over a UNIX domain socket to another process?

Yes, but it does not make sense to do this, since the receiving process would not have copies of the file descriptors in the epoll set.
Will closing a file descriptor cause it to be removed from all epoll sets automatically?

Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.
If more than one event occurs between epoll_wait(2) calls, are they combined or reported separately?

They will be combined.
Does an operation on a file descriptor affect the already collected but not yet reported events?

You can do two operations on an existing file descriptor. Remove would be meaningless for this case. Modify will reread available I/O.
Do I need to continuously read/write a file descriptor until EAGAIN when using the EPOLLET flag (edge-triggered behavior) ?

Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is ready for the requested I/O operation. You must consider it ready until the next (nonblocking) read/write yields EAGAIN. When and how you will use the file descriptor is entirely up to you.
For packet/token-oriented files (e.g., datagram socket, terminal in canonical mode), the only way to detect the end of the read/write I/O space is to continue to read/write until EAGAIN.
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.)

zhou cheng jie

coder