How it started (June 2001)

This is my first humble attempt at hacking with the OpenBSD kernel. Since as of late we lack a packet filter, this looks like a good start. Even if in all probability this won't end up as a useable piece of code, it might prove to be instructive nontheless.

Where to get the packets

The right place to interface with the kernel is netinet/ip_input.c and ip_output.c. I placed the hooks between the checksum calculation and the call to the lower level (ethernet) functions. Possibly they can/should be moved slightly, for instance to use the kernel's fragments reassembly.

Adding a pseudo device

A character type pseudo device /dev/pf looks like a good interface between the kernel code and userland. Using ioctl(), a userland tool can load filter rules into the kernel, query stats, etc.

Stateful filter and NAT

Currently, stateful filtering and NAT is implemented. The ruleset is identical to IPF, minus some options that are not yet implemented (keep frags, return-icmp, etc.).

A state basically consists of three address/port pairs. One for the external host, one for the gateway and one for the internal machine that is being NATed (in case it is a NATed connection, otherwise two pairs are identical).

Each stateful connection creates a state, and its address is inserted into two trees and one linked list. One tree is sorted on external and gateway address/port, the other on external and internal address/port. The entries in each tree are unique (no duplicate keys). When a packet comes in on the (external) interface, the external-gateway tree is searched. When a packet goes out, the external-internal tree is searched. NATed connections always have state, and the three address/port pairs are used to translate addresses. There is no other NAT mapping table of any kind. Tree searches (the primary operation which occurs for every packet) are O(log n), inserts and removals as well (they occur only once for each connection).

The linked list is used to traverse all states when expired states are purged. This is a O(n) operation, but occurs at most once every 10 seconds (not for every packet). A sorted container doesn't make sense: the expire time of a state (the key) usually changes with each packet.

The design handles large numbers states (concurrent connections) decently. Right now, the memory usage for the states is the limit. There is much to be optimized (whole bytes are wasted to hold bits, etc.).

Here's a sample filter rule file and nat rule file.

Note that while protocols and ports can be given as names (translated through /etc/protocols and /etc/services), icmp-type and code must be numeric. The parser doesn't do much error checking yet, so watch the parser output to verify you get the rules that you meant.

The filter rules are now applied after NAT takes place (both in and out), which is the same behaviour as IPF.

Missing parts, open questions

keep frags

Currently, fragments are handled cheaply. The first fragment, if it contains the complete IP and TCP/UDP/ICMP headers, is filtered according to the rules. Any fragments with higher offsets (that don't contain header data) are passed without checks. If the headers themselves are fragmented (this almost certainly is an attack, I don't know of any real-life MTU less than 64), those packet(s) are dropped. This, of course, is not "keep frags". But it's cheap, memory-wise, for the filter. And with the first fragment being filtered, I can't see how an attacker can gain anything because of the higher fragments being passed without being filtered. The destination host will, missing the first fragment, be unable to reassemble the fragments. If someone can show me how this is dangerous and worth the time and memory of a fragment cache on the filter, I'm happy to listen. Consider that if legitimate connections get fragmented, the cache would require large amounts of memory, especially for high-speed connections (remember that the fragments can arrive in any order), I can easily imagine several dozen megabytes.

Please let me know if you have some ideas or know of a good reference for these tasks.


Here's the code so far, based on 2.9-current, relative to /usr/src/sys/. It's possible to use the code on -stable (which still has IPF installed), but care must be taken to disable IPF correctly. If you want this, ask me. It's a pain to keep even one set of patches relatively up to date, I stopped doing it for -stable.

Note: This is a historical snapshot from June 26th 2001, the last version prior to the import into the OpenBSD source tree. pfm was later rewritten using yacc and renamed to pfctl. The entire source, including kernel (with an AVL tree implementation) and userland, was only 3118 lines of text. It has grow a bit, since.

All further changes to the source code are documented in the OpenBSD CVS tree, which you can browse using cvsweb:

Latest changes (newest at the top)