Firewall Ruleset Optimization

Goals

Ideally, the operation of a packet filter should not affect legitimate network traffic. Packets violating the filtering policy should be blocked, and compliant packets should pass the device as if the device wasn't there at all.

In reality, several factors limit how well a packet filter can achieve that goal. Packets have to pass through the device, adding some amount of latency between the time a packet is received and the time it is forwarded. Any device can only process a finite amount of packets per second. When packets arrive at a higher rate than the device can forward them, packets are lost.

Most protocols, like TCP, deal well with added latency. You can achieve high TCP transfer rates even over links that have several hundred milliseconds of latency. On the other hand, in interactive network gaming even a few tens of milliseconds are usually perceived as too much. Packet loss is generally a worse problem, TCP performance will seriously deteriorate when a significant number of packets are lost.

This article explains how to identify when pf is becoming the limiting factor in network throughput and what can be done to improve performance in this case.

The significance of packet rate

One commonly used unit to compare network performance is throughput in bytes per second. But this unit is completely inadequate to measure pf performance. The real limiting factor isn't throughput but packet rate, that is the number of packets per second the host can process. The same host that handles 100Mbps of 1500 byte packets without breaking a sweat can be brought to its knees by a mere 10Mbps of 40 byte packets. The former amounts to only 8,000 packets/second, but the latter traffic stream amounts to 32,000 packets/second, which causes roughly four times the amount of work for the host.

To understand this, let's look at how packets actually pass through the host. Packets are received from the wire by the network interface card (NIC) and read into a small memory buffer on the NIC. When that buffer is full, the NIC triggers a hardware interrupt, causing the NIC driver to copy the packets into network memory buffers (mbufs) in kernel memory. The packets are then passed through the TCP/IP stack in form of these mbufs. Once a packet is transferred into an mbuf, most operations the TCP/IP stack performs on the packet are not dependant on the packet size, as these operations only inspect the packet headers and not the payload. This is also true for pf, which gets passed one packet at a time and makes the decision of whether to block it or pass it on. If the packet needs forwarding, the TCP/IP stack will pass it to a NIC driver, which will extract the packet from the mbuf and put it back onto the wire.

Most of these operations have a comparatively high cost per packet, but a very low cost per size of the packet. Hence, processing a large packet is only slightly more expensive than processing a small packet.

Some limits are based on hardware and software outside of pf. For instance, i386-class machines are not able to handle much more than 10,000 interrupts per second, no matter how fast the CPU is, due to architectural constraints. Some network interface cards will generate one interrupt for each packet received. Hence, the host will start to lose packets when the packet rate exceeds around 10,000 packets per second. Other NICs, like more expensive gigabit cards, have larger built-in memory buffers that allow them to bundle several packets into one interrupt. Hence, the choice of hardware can impose some limits that no optimization of pf can surpass.

When pf is the bottleneck

The kernel passes packets to pf sequentially, one after the other. While pf is being called to decide the fate of one packet, the flow of packets through the kernel is briefly suspended. During that short period of time, further packets read off the wire by NICs have to fit into memory buffers. If pf evaluations take too long, packets will quickly fill up the buffers, and further packets will be lost. The goal of optimizing pf rulesets is to reduce the amount of time pf spends for each packet.

An interesting exercise is to intentionally push the host into this overloaded state by loading a very large ruleset like this:

  $ i=0; while [ $i -lt 100 ]; do \
      printf "block from any to %d.%d.%d.%d\n" \
        `jot -r -s " " 4 1 255`; \
      let i=i+1; \
    done | pfctl -vf -

  block drop inet from any to 151.153.227.25
  block drop inet from any to 54.186.19.95
  block drop inet from any to 165.143.57.178
  ...

This represents a worst-case ruleset that defies all automatic optimizations. Because each rule contains a different random non-matching address, pf is forced to traverse the entire ruleset and evaluate each rule for every packet. Loading a ruleset that solely consists of thousands of such rules, and then generating a steady flow of packets that must be filtered, inflicts noticeable load on even the fastest machine. While the host is under load, check the interrupt rate with:

  $ vmstat -i

And watch CPU states with:

  $ top

This will give you an idea of how the host reacts to overloading, and will help you spot similar symptoms when using your own ruleset. You can use the same tools to verify effects of optimizations later on.

Then try the other extreme. Completely disable pf like:

  $ pfctl -d

Then compare the vmstat and top values.

This is a simple way to get a rough estimate and upper limit on what to realistically expect from optimization. If your host handles your traffic with pf disabled, you can aim to achieve similar performance with pf enabled. However, if the host already shows problems handling the traffic with pf disabled, optimizing pf rulesets is probably pointless, and other components should be changed first.

If you already have a working ruleset and are wondering whether you should spend time on optimizing it for speed, repeat this test with your ruleset and compare the results with both extreme cases. If running your ruleset shows effects of overloading, you can use the guidelines below to reduce those effects.

In some cases, the ruleset shows no significant amount of load on the host, yet connections through the host show unexpected problems, like delays during connection establishment, stalling connections or disappointingly low throughput. In most of these cases, the problem is not filtering performance at all, but a misconfiguration of the ruleset which causes packets to get dropped. See Testing Your Firewall about how to identify and deal with such problems.

And finally, if your ruleset is evaluated without causing significant load and everything works as expected, the most reasonable conclusion is to leave the ruleset as is is. Often, rulesets written in a straight-forward approach without respect for performance are evaluated efficiently enough to cause no packet loss. Manual optimizations will only make the ruleset harder to read for the human maintainer, while having only insignificant effect on performance.

Filter statefully

The amount of work done by pf mainly consists of two kinds of operations: ruleset evaluations and state table lookups.

For every packet, pf first does a state table lookup. If a matching state entry is found in the state table, the packet is immediately passed. Otherwise pf evaluates the filter ruleset to find the last matching rule for the packet which decides whether to block or pass it. If the rule passes the packet, it can optionally create a state entry using the 'keep state' option.

When filtering statelessly, without using 'keep state' to create state entries for connections, every packet causes an evaluation of the ruleset, and ruleset evaluation is the single most costly operation pf performs in this scenario. Each packet still causes a state table lookup, but since the table is empty, the cost of the lookup is basically zero.

Filtering statefully means using 'keep state' in filter rules, so packets matching those rules will create a state table entry. Further packets related to the same connections will match the state table entries and get passed automatically, without evaluations of the ruleset. In this scenario, only the first packet of each connection causes a ruleset evaluation, and subsequent packets only cause a state lookup.

Now, a state lookup is much cheaper than a ruleset evaluation. A ruleset is basically a list of rules which must be evaluated from top to bottom. The cost increases with every rule in the list, twice as many rules mean twice the amount of work. And evaluating a single rule can cause comparison of numerous values in the packet. The state table, on the other hand, is a tree. The cost of lookup increases only logarithmically with the number of entries, twice as many states mean only one additional comparison, a fraction of additional work. And comparison is needed only for a limited number of values in the packet.

There is some cost to creating and removing state entries. But assuming the state will match several subsequent packets and saves ruleset evaluation for them, the sum is much cheaper. For specific connections like DNS lookups, where each connection only consists of two packets (one request and one reply), the overhead of state creation might be worse than two ruleset evaluations. Connections that consist of more than a handful of packets, like most TCP connections, will benefit from the created state entry.

In short, you can make ruleset evaluation a per-connection cost instead of a per-packet cost. This can easily make a factor of 100 or more. For example, I see the following counters when I run:

  $ pfctl -si

  State Table                          Total             Rate
    searches                       172507978          887.4/s
    inserts                          1099936            5.7/s
    removals                         1099897            5.7/s
  Counters
    match                            6786911           34.9/s

This means pf gets called about 900 times per second. I'm filtering on multiple interfaces, so that would mean I'm forwarding about 450 packets per second, each of which gets filtered twice, once on each interface it passes through. But ruleset evaluation occurs only about 35 times per second, and state insertions and deletions only 6 times per second. With anything but a tiny ruleset, this is very well worth it.

To make sure that you're really creating state for each connection, search for 'pass' rules which don't use 'keep state', like in:

  $ pfctl -sr | grep pass | grep -v 'keep state'

Make sure you have a tight 'block by default' policy, as otherwise packets might pass not only due to explicit 'pass' rules, but mismatch all rules and pass by default.

The downside of stateful filtering

The only downside to stateful filtering is that state table entries need memory, around 256 bytes for each entry. When pf fails to allocate memory for a new state entry, it blocks the packet that should have created the state entry instead, and increases an out-of-memory counter shown by:

  $ pfctl -si
  Counters
    memory                                 0            0.0/s

Memory for state entries is allocated from the kernel memory pool called 'pfstatepl'. You can use vmstat(8) to show various aspects of pool memory usage:

  $ vmstat -m
  Memory resource pool statistics
  Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
  pfstatepl    256  1105099    0  1105062   183   114    69   127     0 625   62

The difference between 'Requests' and 'Releases' equals the number of currently allocated state table entries, which should match the counter shown by:

  $ pfctl -si
  State Table                          Total             Rate
    current entries                       36

Other counters shown by pfctl can get reset by pfctl -Fi.

Not all memory of the host is available to the kernel, and the way the amount of physical RAM affects the amount available to the kernel depends on architecture and kernel options and version. As of OpenBSD 3.6, an i386 kernel can use up to 256MB of memory. Prior to 3.6, that limit was much lower for i386. You could have 8GB of RAM in your host, and still pf would fail to allocate memory beyond a small fraction of that amount.

To make matters worse, when pf really hits the limit where pool_get(9) fails, the failure is not as graceful as one might wish. Instead, the entire system becomes unstable after that point, and eventually crashes. This really isn't pf's fault, but a general problem with kernel pool memory management.

To address this, pf itself limits the number of state entries it will allocate at the same time, using pool_sethardlimit(9), also shown by vmstat -m output. The default for this limit is 10,000 entries, which is safe for any host. The limit can be printed with:

  $ pfctl -sm
  states     hard limit  10000
  src-nodes  hard limit  10000
  frags      hard limit    500

If you need more concurrent state entries, you can increase the limit in pf.conf with:

  set limit states 10000

The problem is determining a large value that is still safe enough not to trigger a pool allocation failure. This is still a sore topic, as there is no simple formula to calculate the value. Basically, you have to increase the limit and verify the host remains stable after reaching that limit, by artificially creating many entries.

On the bright side, if you have 512MB or more of RAM, you can now use 256MB for the kernel, which should be safe for at least 500,000 state entries. And most people consider that a lot of concurrent connections. Just imagine each of those connections generating just one packet every ten seconds, and you end up with a packet rate of 50,000 packets/s.

More likely, you don't expect that many states at all. But whatever your state limit is, there are cases where it will be reached, like during a denial-of-service attack. Remember, pf will fail closed not open when state creation fails. An attacker could create state entries until the limit is reached, just for the purpose of denying service to legitimate users.

There are several ways to deal with this problem.

You can limit the number of states created from specific rules, for instance like:

  pass in from any to $ext_if port www keep state (max 256)

This would limit the number of concurrent connections to the web server to 256, while other rules could still create state entries. Similarly, the maximum number of connections per source address can be restricted with:

  pass keep state (source-track rule, max-src-states 16)

Once a state entry is created, various timeouts define when it is removed. For instance:

  $ pfctl -st
  tcp.opening                  30s

The timeout for TCP states that are not yet fully established is set to 30 seconds. These timeouts can be lowered to remove state entries more aggressively. Individual timeout values can be set globally in pf.conf:

 set timeout tcp.opening 20

They can also be set in individual rules, applying only to states created by these rules:

  pass keep state (tcp.opening 10)

There are several pre-defined sets of global timeouts which can be selected in pf.conf:

  set optimization aggressive

Also, there's adaptive timeouts, which means these timeouts are not constants, but variables which adjust to the number of state entries allocated. For instance:

  set timeout { adaptive.start 6000, adaptive.end 12000 }

pf will use constant timeout values as long as there are less than 6,000 state entries. When there are between 6,000 and 12,000 entries, all timeout values are linearly scaled from 100% at 6,000 to 0% at 12,000 entries, i.e. with 9,000 entries all timeout values are reduced to 50%.

In summary, you probably can specify a number of maximum states you expect to support. Set this as the limit for pf. Expect the limit to get reached during certain attacks, and define a timeout strategy for this case. In the worst case, pf will drop packets when state insertion fails, and the out-of-memory counter will increase.

Ruleset evaluation

A ruleset is a linear list of individual rules, which are evaluated from top to bottom for a given packet. Each rule either does or does not match the packet, depending on the criteria in the rule and the corresponding values in the packet.

Therefore, to a first approximation, the cost of ruleset evaluation grows with the number of rules in the ruleset. This is not precisely true for reasons we'll get into soon, but the general concept is correct. A ruleset with 10,000 rules will almost certainly cause much more load on your host than one with just 100 rules. The most obvious optimization is to reduce the number of rules.

Ordering rulesets to maximize skip steps

The first reason why ruleset evaluation can be cheaper than evaluating each individual rule in the ruleset is called skip steps. This is a transparent and automatic optimization done by pf when the ruleset is loaded. It's best explained with an example. Imagine you have the following simple ruleset:

  1. block in all
  2. pass in on fxp0 proto tcp from any to 10.1.2.3 port 22 keep state
  3. pass in on fxp0 proto tcp from any to 10.1.2.3 port 25 keep state
  4. pass in on fxp0 proto tcp from any to 10.1.2.3 port 80 keep state
  5. pass in on fxp0 proto tcp from any to 10.2.3.4 port 80 keep state

A TCP packet arrives in on fxp0 to destination address 10.2.3.4 on some port.

pf will start the ruleset evaluation for this packet with the first rule, which fully matches. Evaluation continues with the second rule, which matches the criteria 'in', 'on fxp0', 'proto tcp', 'from any' but doesn't match 'to 10.1.2.3'. So the rule does not match, and evaluation should continue with the third rule.

But pf is aware that the third and fourth rule also specify the same criterion 'to 10.1.2.3' which caused the second rule to mismatch. Hence, it is absolutely certain that the third and fourth rule cannot possibly match this packet, either, and immediately jumps to the fifth rule, saving several comparisons.

Imagine the packet under inspection was UDP instead of TCP. The first rule would have matched, evaluation would have continued with the second rule. There, the criterion 'proto tcp' would have made the rule mismatch the packet. Since the subsequent rules also specify the same criterion 'proto tcp' which was found to mismatch the packet, all of them could be safely skipped, without affecting the outcome of the evaluation.

Here's how pf analyzes your ruleset when you load it. Each rule can contain a list of criteria like 'to 10.1.2.3', restricting the rule to match packets with that destination address. For each criteria in each rule, pf counts the number of rules immediately below that rule which specify the exact same criterion. This can be zero, when the next rule does not use the exact same criterion. Or it can be any number up to the number of remaining rules, when they all specify it. The counted numbers are stored in memory for later use. They're called skip steps because they tell pf how many subsequent steps (rules) can be skipped when any criteria in any rule is found to not match the packet being inspected.

Rule evaluation compares the criteria in the rule against the values in the packet in a fixed order:

  1. interface ('on fxp0')
  2. direction ('in', 'out')
  3. address family ('inet' or 'inet6')
  4. protocol ('proto tcp')
  5. source address ('from 10.1.2.3')
  6. source port ('from port < 1024')
  7. destination address ('to 10.2.3.4')
  8. destination port ('to port 80')

If the rule completely matches, evaluation continues on the very next rule. If the rule does not match, the first criterion from the list above which mismatches decides which skip step is used. There might be more than one criterion which mismatches, but only the first one, in the order of the list above, matters.

Obviously, the order of rules in your ruleset affects the skip step values calculated for each rule. For instance:

  1. pass on fxp0
  2. pass on fxp1
  3. pass on fxp0
  4. pass on fxp1

This ruleset will produce skip steps with value zero for the interface criterion in each rule, because no adjacent rules contain the same interface criterion.

Those rules could instead be ordered like:

  1. pass on fxp0
  2. pass on fxp0
  3. pass on fxp1
  4. pass on fxp1

The skip step value for the interface criterion would then equal one in the first and third rule.

This makes a small difference when the ruleset is evaluated for a packet on fxp2. Before the reordering, all four rules are evaluated because none of them can be skipped. After the reordering, only rules one and three need to be evaluated, and rules two and four can be skipped.

The difference may be insignificant in this little example, but imagine a ruleset containing 1,000 rules which all apply only to two different interfaces. If you order these rules so all rules applying to one interface are adjacent, followed by the rules applying to the other interface, pf can reliably skip 500 rules in each and every evaluation of the ruleset, reducing the cost of ruleset evaluation to 50%, no matter what packets your traffic consists of.

Hence, you can help pf maximize its skip steps by ordering your rules by the criteria in the order they are listed above, i.e. order your rules by interface first. Within the block of rules for the same interface, order rules by direction. Within the block for the same interface and direction, order by address family, etc.

To verify the effects, run

  $ pfctl -gsr

pfctl prints the calculated skip step values for each criterion in each rule, for instance

  @18 block return-rst in quick on kue0 proto tcp from any to any port = 1433
  [ Skip steps: i=38 d=38 f=41 p=27 sa=48 sp=end da=43 ]

In this output, 'i' stands for interface, 'd' for direction, 'f' for address family, etc. The 'i=38' part means that packets which don't match 'on kue0' will skip to rule number 38.

This also affects the number of evaluations counted for each rule, try:

  $ pfctl -vsr

pfctl counts how many times each rule has been evaluated, how many packets and bytes it matched and how many states it created. When a rule is skipped by skip steps during evaluation, its evaluation counter is not increased.

Use tables for address lists

The use of lists in curly braces allows to write very compact rules in pf.conf, like:

  pass proto tcp to { 10.1.2.3, 10.2.3.4 } port { ssh, www }

But these lists are not actually loaded into a single rule in the kernel. Instead, pfctl expands the single input rule to multiple rules for the kernel, in this case

  $ echo "pass proto tcp to { 10.1.2.3, 10.2.3.4 } port { ssh, www }" |
	pfctl -nvf -
  pass inet proto tcp from any to 10.1.2.3 port = ssh keep state
  pass inet proto tcp from any to 10.1.2.3 port = www keep state
  pass inet proto tcp from any to 10.2.3.4 port = ssh keep state
  pass inet proto tcp from any to 10.2.3.4 port = www keep state

The short syntax in pf.conf betrays the real cost of evaluating it. Your pf.conf might be only a dozen rules long, but if those expand to hundreds of rules in kernel, evaluation cost is the same as if you put those hundreds of rules in pf.conf in the first place. To see what rules are really being evaluated, check:

  $ pfctl -sr

For one specific type of list, addresses, there is a container in kernel, called 'table'. For example:

  pass in from { 10.1.2.3, 10.2.3.4, 10.3.4.5 }

The list of addresses can be expressed as a table:

  table <clients> const { 10.1.2.3, 10.2.3.4, 10.3.4.5 }
  pass in from <clients>

This construct can be loaded as a single rule (and a table) into the kernel, whereas the non-table version would expand to three rules.

During evaluation of the rule referencing the table, pf will do a lookup of the packet's source address in the table to determine whether the rule matches the packet. This lookup is very cheap, and the cost does not increase with the number of entries in the table.

If the list of addresses is large, the performance gain of one rule evaluation with one table lookup vs. one rule evaluation for each address is significant. As a rule of thumb, tables are cheaper when the list contains six or more addresses. For a list of 1,000 addresses, the difference will be factor of 1,000.

Use quick to abort ruleset evaluation when rules match

When a rule does match, pf (unlike other packet filtering products) does not by default abort ruleset evaluation, but continues until all rules have been evaluated. When the end is reached, the last rule that matched (the last-matching rule) makes the decision.

The option 'quick' can be used in rules to make them abort ruleset evaluation when they match. When 'quick' is used on every single rule, pf's behaviour effectively becomes first-matching, but that's not the default.

For instance, pf filters packets passing through any interface, including virtual interfaces such as loopback. If, like most people, you don't intend to filter loopback traffic, a rule like the following at the top can save a lot of rule evaluations:

  set skip on { lo0 }

The ruleset might contain hundreds of rules all mismatching the loopback interface, and loopback traffic might just pass by the implicit default pass. The difference is between evaluating these hundreds of rules for every loopback packet.

Usually, you'd place a rule with 'quick' at the top of the ruleset, reasoning that it has the potential of matching and saving the evaluation of the rules further down. But in those cases where the rule does not match a packet, placement of the rule at the top has caused one more evaluation. In short, the frequency with which a rule is expected to match on average is also relevant when deciding placement within the ruleset for performance reasons. And the frequency with which it does match depends on your actual traffic.

Instead of guessing how likely a rule should match on average, you can use the rule evaluation and matching counters that are printed by:

  $ pfctl -vsr

When you see a rule near the top that is evaluated a lot but rarely matches, you can move it further down in the ruleset.

Anchors with conditional evaluation

An anchor is basically a ruleset separate from the main ruleset, or a sub-ruleset. You can load entire rulesets into anchors, and cause them to get evaluated from the main ruleset.

Another way to look at them is to compare filtering rules with a programming language. Without anchors, all your code is in a single main function, the main ruleset. Anchors, then, are just subroutines, code in separate functions that you can call from the main function.

As of OpenBSD 3.6, you can also nest anchors within anchors, building a hierarchy of subroutines, and call one subroutine from another. In OpenBSD 3.5 and before, the hierarchy could only be one level deep, that is, you could have multiple subroutines, but could call subroutines only from the main ruleset.

For instance:

  pass in proto tcp from 10.1.2.3 to 10.2.3.4 port www
  pass in proto udp from 10.1.2.3 to 10.2.3.4
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port www
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port ssh
  pass in proto udp from 10.1.2.4 to 10.2.3.5
  pass in proto tcp from 10.1.2.5 to 10.2.3.6 port www
  pass in proto udp from 10.1.2.5 to 10.2.3.6
  pass in proto tcp from 10.1.2.6 to 10.2.3.7 port www

You could split the ruleset into two sub-rulesets, one for UDP called "udp-only":

  pass in proto udp from 10.1.2.3 to 10.2.3.4
  pass in proto udp from 10.1.2.4 to 10.2.3.5
  pass in proto udp from 10.1.2.5 to 10.2.3.6

And a second one for TCP called "tcp-only":

  pass in proto tcp from 10.1.2.3 to 10.2.3.4 port www
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port www
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port ssh
  pass in proto tcp from 10.1.2.5 to 10.2.3.6 port www
  pass in proto tcp from 10.1.2.6 to 10.2.3.7 port www

Both of them can be called from the main ruleset with:

  anchor udp-only
  anchor tcp-only

That would not improve performance much, though. Actually, there is some overhead involved when the kernel has to step into and out of these sub-rulesets.

But anchor calls can also contain filter criteria, much like pass/block rules:

  anchor udp-only in on fxp0 inet proto udp
  anchor tcp-only in on fxp0 inet proto tcp

The sub-ruleset is only evaluated for packets that match the criteria. In other words, the subroutine is only conditionally evaluated. When the criteria do not match, the call is skipped, and the evaluation cost is limited to the comparison of the criteria in the call.

For performance, this is mainly relevant when the sub-ruleset contains many rules, and the call criteria are not those primarly optimized by skip steps.

Let pfctl do the work for you

As of OpenBSD 3.6, several of the optimizations discussed can be automated by pfctl -o. The optimizer analyzes a ruleset and makes modifications that do not change the effect of the ruleset.

First, pfctl splits the ruleset into blocks of adjacent rules in such a way that reordering rules within one block cannot possibly affect the outcome of evaluation for any packet.

For example, the rules in the following block can be arbitrarily reordered:

  pass proto tcp to 10.1.2.3 port www keep state
  pass proto udp to 10.1.2.3 port domain keep state
  pass proto tcp to 10.1.0.0/16 keep state

But in most cases rule order is relevant. For instance:

  block log all
  block from 10.1.2.3
  pass from any to 10.2.3.4

Changing the position of either of those rules produces completely different effects. After swapping the first two rules, packets from 10.1.2.3 still get blocked, but they're now also logged. Exchange the last two rules, and packets from 10.1.2.3 to 10.2.3.4 are suddenly blocked. And switching the first and last rule blocks every packet.

In every case of possible dependancy, pfctl splits the rules into separate blocks. In the worst case, when no two adjacent rules can be freely reordered, each rule becomes a separate block containing only that rule, and pfctl can't make any modifications.

Otherwise, pfctl sorts the rules in each block so that skip step values are maximized:

  $ cat example
  pass proto tcp from 10.0.0.3 to 10.0.0.8
  pass proto udp from 10.0.0.1
  pass proto tcp from 10.0.0.2
  pass proto tcp from 10.0.0.4
  pass proto udp from 10.0.0.6
  pass proto tcp from 10.0.0.3 to 10.0.0.7

  $ pfctl -onvf example
  pass inet proto tcp from 10.0.0.3 to 10.0.0.8
  pass inet proto tcp from 10.0.0.3 to 10.0.0.7
  pass inet proto tcp from 10.0.0.2 to any
  pass inet proto tcp from 10.0.0.4 to any
  pass inet proto udp from 10.0.0.1 to any
  pass inet proto udp from 10.0.0.6 to any

When duplicate rules are found, they are removed:

  $ cat example
  pass proto tcp from 10.0.0.1
  pass proto udp from 10.0.0.2
  pass proto tcp from 10.0.0.1

  $ pfctl -onvf example
  pass inet proto tcp from 10.0.0.1 to any
  pass inet proto udp from 10.0.0.2 to any

Redundant rules are removed as well:

  $ cat example
  pass proto tcp from 10.1/16
  pass proto tcp from 10.1.2.3
  pass proto tcp from 10/8

  $ pfctl -onvf example
  pass inet proto tcp from 10.0.0.0/8 to any

Multiple rules are combined into a single rule using a table where possible and advantageous:

  $ cat example
  pass from 10.1.2.3
  pass from 10.2.3.4
  pass from 10.3.4.5
  pass from 10.4.5.6
  pass from 10.5.6.7
  pass from 10.8.9.1

  $ pfctl -onvf example
  table <__automatic_0> const { 10.1.2.3 10.2.3.4 10.3.4.5 10.4.5.6
                                10.5.6.7 10.8.9.1 }
  pass inet from <__automatic_0> to any

When called with -oo, pfctl also consults the evaluation counters shown by pfctl -vsr to reorder 'quick' rules according to matching frequency.

It's very conservative in doing any changes, only performing changes that are certain to not affect the outcome of the ruleset evaluation under any circumstances for any packet. This has the advantage that the optimizer can be used safely with any ruleset. The drawback is that pfctl might not dare change something which you could, if you thought about the effect of the change. Like the skip step optimization, the performance improvement depends on how large the blocks of reorderable rules are. By manually reordering rules first, you can potentially improve the gain the optimizer can produce.

The easiest way to see what -o or -oo does with your ruleset is to compare its suggestion with the original ruleset, like this:

  $ pfctl -nvf /etc/pf.conf >before
  $ pfctl -oonvf /etc/pf.conf >after
  $ diff -u before after

When run against a manually optimized ruleset, the differences are usually unspectacular. Significant improvements can be expected for rulesets that were automatically generated by rule editing frontends.

Copyright (c) 2004-2006 Daniel Hartmeier <daniel@benzedrine.ch>. Permission to use, copy, modify, and distribute this documentation for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.