How to Run High Capacity Tor Rel

From Mike Perry, 12 Years ago, written in Plain Text.

URL http://geopaste.scratchbook.ch/view/c1ac124e

Embed

After talking to Moritz and Olaf privately and asking them about their

nodes, and after running some experiments with some high capacity

relays, I've begun to realize that running a fast Tor relay is a

pretty black art, with a lot of ad-hoc practice. Only a few people

know how to do it, and if you just use Linux and Tor out of the box,

your relay will likely underperform on 100Mbit links and above.

In the interest of trying to help grow and distribute the network, my

ultimate plan is to try to collect all of this lore, use Science to

divine out what actually matters, and then write a more succinct blog

post about it.

However, that is a lot of work. It's also not totally necessary to do

all this work, when you can get a pretty good setup with a rough

superset of all of the ad-hoc voodoo. This post is thus about that

voodoo.

Hopefully others will spring forth from the darkness to dump their own

voodoo in this thread, as I suspect there is one hell of a lot of it

out there, some (much?) of which I don't yet know. Likewise, if any

blasphemous heretic wishes to apply Science to this voodoo, they

should yell out, "Stand Back, I'm Doing Science!" (at home please, not

on this list) and run some experiments to try to eliminate options

that are useless to Tor performance. Or cite academic research papers.

(But that's not Science, that's computerscience - which is a religion

like voodoo, but with cathedrals).

Anyway, on with the draft:

== Machine Specs ==

First, you want to run your OS in x64 mode because openssl should do

crypto faster in 64bit.

Tor is currently not fully multithreaded, and tends not to benefit

beyond 2 cores per process. Even then, the benefit is still marginal

beyond just 1 core. 64bit Tor nodes require about one 2Ghz Xeon/Core2

core per 100Mbit of capacity.

Thus, to fill an 800Mbit link, you need at least a dual socket, quad

core cpu config. You may be able to squeeze a full gigabit out of one

of these machines. As far as I know, no one has ever done this with

Tor, on any one machine.

The i7's also just came out in this form factor, and can do

hyperthreading (previous models may list 'ht' in cpuinfo, but actually

don't support it). This should give you a decent bonus if you set

NumCPUs to 2, since ht tends to work better with pure integer math

(like crypto). We have not benchmarked this config yet though, but I

suspect it should fill a gigabit link fairly easily, possibly

approaching 2Gbit.

At full capacity, exit node Tor processes running at this rate consume

about 500M of ram. You want to ensure your ram speed is sufficient,

but most newish hardware is good. Using on this chart:

https://secure.wikimedia.org/wikipedia/en/wiki/List_of_device_bandwidths#Memory_Interconnect.2FRAM_buses

you can do the math and see that with a dozen memcpys in each

direction, you come out needing DDR2 to be able to push 1Gbit full

duplex.

As far as ethernet cards, the Intel e1000e *should* be theoretically

good, but they seem to fail at properly irq balancing across multiple

CPUs on recent kernels, which can cause you to bottleneck at 100% CPU

on one core. At least that has been Moritz's experience. In our

experiments, the RTL-8169 works fine (once tweaked, see below).

== System Tweakscript Wibbles and Config Splatters ==

First, you want to ensure that you run no more than 2 Tor instances

per IP. Any more than this and clients will ignore them.

Next, paste the following smattering into the shell (or just read it

and make your own script):

# Set the hard limit of open file descriptors really high.

# Tor will also potentially run out of ports.

ulimit -SHn 65000

# Set the txqueuelen high, to prevent premature drops

ifconfig eth0 txqueuelen 20000

# Tell our ethernet card (interrupt found from /proc/interrupts)

# to balance its IRQs across one whole CPU socket (4 cpus, mask 0f).

# You only want one socket for optimal ISR and buffer caching.

#

# Note that e1000e does NOT seem to obey this, but RTL-8169 will.

echo 0f > /proc/irq/17/smp_affinity

# Make sure you have auxiliary nameservers. I've seen many ISP

# nameservers fall over under load from fast tor nodes, both on our

# nodes and from scans. Or run caching named and closely monitor it.

echo "nameserver 8.8.8.8" >> /etc/resolv.conf

echo "nameserver 4.2.2.2" >> /etc/resolv.conf

# Load an amalgam of gigabit-tuning sysctls from:

# http://datatag.web.cern.ch/datatag/howto/tcp.html

# http://fasterdata.es.net/TCP-tuning/linux.html

# http://www.acc.umu.se/~maswan/linux-netperf.txt

# http://www.psc.edu/networking/projects/tcptune/#Linux

# and elsewhere...

# We have no idea which of these are needed yet for our actual use

# case, but they do help (especially the nf-contrack ones):

sysctl -p << EOF

net.ipv4.tcp_rmem = 4096 87380 16777216

net.ipv4.tcp_wmem = 4096 65536 16777216

net.core.netdev_max_backlog = 2500

net.ipv4.tcp_no_metrics_save = 1

net.ipv4.tcp_moderate_rcvbuf = 1

net.core.rmem_max = 1048575

net.core.wmem_max = 1048575

net.ipv4.ip_local_port_range = 1025 61000

net.ipv4.tcp_synack_retries = 3

net.ipv4.tcp_tw_recycle = 1

net.ipv4.tcp_max_syn_backlog = 10240

net.ipv4.tcp_fin_timeout = 30

net.ipv4.tcp_keepalive_time = 1200

net.netfilter.nf_conntrack_tcp_timeout_established=7200

net.netfilter.nf_conntrack_checksum=0

net.netfilter.nf_conntrack_max=131072

net.netfilter.nf_conntrack_tcp_timeout_syn_sent=15

net.ipv4.tcp_keepalive_time=60

net.ipv4.tcp_keepalive_time = 60

net.ipv4.tcp_keepalive_intvl = 10

net.ipv4.tcp_keepalive_probes = 3

net.ipv4.ip_local_port_range = 1025 65530

net.core.netdev_max_backlog=300000

net.core.somaxconn=20480

net.ipv4.tcp_max_tw_buckets=2000000

net.ipv4.tcp_timestamps=0

vm.min_free_kbytes = 65536

net.ipv4.ip_forward=1

net.ipv4.tcp_syncookies = 1

net.ipv4.tcp_synack_retries = 2

net.ipv4.conf.default.forwarding=1

net.ipv4.conf.default.proxy_arp = 1

net.ipv4.conf.all.rp_filter = 1

net.ipv4.conf.default.send_redirects = 1

net.ipv4.conf.all.send_redirects = 0

EOF

# XXX: ethtool wibbles

# You may also have to tweak some parameters with ethtool, possibly

# also enabling some checksum offloading or irq coalescing options to

# spare CPU, but for us this hasn't been needed yet.

== Setting up the Torrc ==

Basically you can just read through the stock example torrc, but there

are some as-yet undocumented magic options, and options that need new

defaults.

# NumCPUs doesn't provide any benefit beyond 2, and setting it higher

# may cause cache misses.

NumCPUs 2

# These options have archaic maximums of 2-5Mbyte

BandwidthRate 100 MB

BandwidthBurst 200 MB

== Waiting for the Bootstrap and Measurement Process ==

Perhaps the most frustrating part of this setup is how long it takes

for you to acquire traffic. If you are starting new at an ISP, I would

consider only 200-400Mbit for your first month. Hitting that by the

end of the month may be a challenge, mostly because their may be dips

and total setbacks along the way.

The slow rampup is primarily due to limitations in Tor's ability to

rapidly publish descriptor updates, and to measure relays.

It ends up taking about 2-3 days to hit an observed bandwidth of

2Mbyte/sec per relay, but it can take well over a week or more (Moritz, do

you have a better number?) to reach 8-9Mbyte/relay. This is for an

Exit node. A middle node will likely gather traffic slower. Also once

you crash, you lose it. This bug is about that issue:

https://trac.torproject.org/projects/tor/ticket/1863

There is also a potential dip when you get the Guard flag, as our load

balancing formulas try to avoid you, but no clients have chosen you

yet. Changes to the authority voting on Guards in Tor 0.2.2.15 should

make this less drastic. It is also possible that your observed

bandwidth will be greater than without it. However, it will still take

up to 2 months for clients to choose you as their new Guard.

== Running temporary auxiliary nodes ==

One way to shortcut this process and avoid paying for bandwidth you

don't use is to spin up a bunch of temporary nodes to utilize the CPU

and quickly gather that easy first 2MB/sec of observed bandwidth.

But you need the spare IPs to do this.

== Monitoring ==

Personally, I prefer console-based options like nload, top, and

Damian's arm (http://www.atagar.com/arm/) because I don't like the

idea of running extra services to publish my monitoring data to the

world.

Other people have web-based monitoring using things like munin and

mrtg. It would be nice to get a script/howto for that too.

== Current 1-Box Capacity Record ==

Our setup has topped at 450Mbit, but averages between 300-400Mbit. We

are currently having uptime issues due to heat (melting, poorly

ventilated harddrives). It is likely that once we resolve this, we

will continually increase to our CPU ceiling.

I believe Moritz and Olaf also push this much capacity, possibly a bit

more, but with less nodes (4 as opposed to our 8). I hear Jake is also

ramping up some Guard nodes (or maybe I didn't? Did I just betray you

again Jake?)

== Did I leave anything out? ==

Well, did I?

Author

Title

Language

Your paste - Paste your paste here

After talking to Moritz and Olaf privately and asking them about their
nodes, and after running some experiments with some high capacity
relays, I've begun to realize that running a fast Tor relay is a
pretty black art, with a lot of ad-hoc practice. Only a few people
know how to do it, and if you just use Linux and Tor out of the box,
your relay will likely underperform on 100Mbit links and above.

In the interest of trying to help grow and distribute the network, my
ultimate plan is to try to collect all of this lore, use Science to
divine out what actually matters, and then write a more succinct blog
post about it.

However, that is a lot of work. It's also not totally necessary to do
all this work, when you can get a pretty good setup with a rough
superset of all of the ad-hoc voodoo. This post is thus about that
voodoo.

Hopefully others will spring forth from the darkness to dump their own
voodoo in this thread, as I suspect there is one hell of a lot of it
out there, some (much?) of which I don't yet know. Likewise, if any
blasphemous heretic wishes to apply Science to this voodoo, they
should yell out, &quot;Stand Back, I'm Doing Science!&quot; (at home please, not
on this list) and run some experiments to try to eliminate options
that are useless to Tor performance. Or cite academic research papers.
(But that's not Science, that's computerscience - which is a religion
like voodoo, but with cathedrals).

Anyway, on with the draft:

== Machine Specs ==

First, you want to run your OS in x64 mode because openssl should do
crypto faster in 64bit.

Tor is currently not fully multithreaded, and tends not to benefit
beyond 2 cores per process. Even then, the benefit is still marginal
beyond just 1 core. 64bit Tor nodes require about one 2Ghz Xeon/Core2
core per 100Mbit of capacity.

Thus, to fill an 800Mbit link, you need at least a dual socket, quad
core cpu config.  You may be able to squeeze a full gigabit out of one
of these machines. As far as I know, no one has ever done this with
Tor, on any one machine.

The i7's also just came out in this form factor, and can do
hyperthreading (previous models may list 'ht' in cpuinfo, but actually
don't support it). This should give you a decent bonus if you set
NumCPUs to 2, since ht tends to work better with pure integer math
(like crypto). We have not benchmarked this config yet though, but I
suspect it should fill a gigabit link fairly easily, possibly
approaching 2Gbit.

At full capacity, exit node Tor processes running at this rate consume
about 500M of ram. You want to ensure your ram speed is sufficient,
but most newish hardware is good. Using on this chart:
https://secure.wikimedia.org/wikipedia/en/wiki/List_of_device_bandwidths#Memory_Interconnect.2FRAM_buses
you can do the math and see that with a dozen memcpys in each
direction, you come out needing DDR2 to be able to push 1Gbit full
duplex.

As far as ethernet cards, the Intel e1000e *should* be theoretically
good, but they seem to fail at properly irq balancing across multiple
CPUs on recent kernels, which can cause you to bottleneck at 100% CPU
on one core. At least that has been Moritz's experience. In our
experiments, the RTL-8169 works fine (once tweaked, see below).

== System Tweakscript Wibbles and Config Splatters ==

First, you want to ensure that you run no more than 2 Tor instances
per IP. Any more than this and clients will ignore them.

Next, paste the following smattering into the shell (or just read it
and make your own script):

# Set the hard limit of open file descriptors really high.
# Tor will also potentially run out of ports.
ulimit -SHn 65000

# Set the txqueuelen high, to prevent premature drops
ifconfig eth0 txqueuelen 20000

# Tell our ethernet card (interrupt found from /proc/interrupts)
# to balance its IRQs across one whole CPU socket (4 cpus, mask 0f).
# You only want one socket for optimal ISR and buffer caching.
#
# Note that e1000e does NOT seem to obey this, but RTL-8169 will.
echo 0f &gt; /proc/irq/17/smp_affinity

# Make sure you have auxiliary nameservers. I've seen many ISP
# nameservers fall over under load from fast tor nodes, both on our
# nodes and from scans. Or run caching named and closely monitor it.
echo &quot;nameserver 8.8.8.8&quot; &gt;&gt; /etc/resolv.conf
echo &quot;nameserver 4.2.2.2&quot; &gt;&gt; /etc/resolv.conf

# Load an amalgam of gigabit-tuning sysctls from:
# http://datatag.web.cern.ch/datatag/howto/tcp.html
# http://fasterdata.es.net/TCP-tuning/linux.html
# http://www.acc.umu.se/~maswan/linux-netperf.txt
# http://www.psc.edu/networking/projects/tcptune/#Linux
# and elsewhere...
# We have no idea which of these are needed yet for our actual use
# case, but they do help (especially the nf-contrack ones):
sysctl -p &lt;&lt; EOF
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 2500
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.core.rmem_max = 1048575
net.core.wmem_max = 1048575
net.ipv4.ip_local_port_range = 1025 61000
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.netfilter.nf_conntrack_tcp_timeout_established=7200
net.netfilter.nf_conntrack_checksum=0
net.netfilter.nf_conntrack_max=131072
net.netfilter.nf_conntrack_tcp_timeout_syn_sent=15
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.ip_local_port_range = 1025 65530
net.core.netdev_max_backlog=300000
net.core.somaxconn=20480
net.ipv4.tcp_max_tw_buckets=2000000
net.ipv4.tcp_timestamps=0
vm.min_free_kbytes = 65536
net.ipv4.ip_forward=1
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 2
net.ipv4.conf.default.forwarding=1
net.ipv4.conf.default.proxy_arp = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.send_redirects = 1
net.ipv4.conf.all.send_redirects = 0
EOF

# XXX: ethtool wibbles

# You may also have to tweak some parameters with ethtool, possibly
# also enabling some checksum offloading or irq coalescing options to
# spare CPU, but for us this hasn't been needed yet.

== Setting up the Torrc ==

Basically you can just read through the stock example torrc, but there
are some as-yet undocumented magic options, and options that need new
defaults.

# NumCPUs doesn't provide any benefit beyond 2, and setting it higher
# may cause cache misses.
NumCPUs 2

# These options have archaic maximums of 2-5Mbyte
BandwidthRate 100 MB
BandwidthBurst 200 MB

== Waiting for the Bootstrap and Measurement Process ==

Perhaps the most frustrating part of this setup is how long it takes
for you to acquire traffic. If you are starting new at an ISP, I would
consider only 200-400Mbit for your first month. Hitting that by the
end of the month may be a challenge, mostly because their may be dips
and total setbacks along the way.

The slow rampup is primarily due to limitations in Tor's ability to
rapidly publish descriptor updates, and to measure relays.

It ends up taking about 2-3 days to hit an observed bandwidth of
2Mbyte/sec per relay, but it can take well over a week or more (Moritz, do
you have a better number?) to reach 8-9Mbyte/relay. This is for an
Exit node. A middle node will likely gather traffic slower. Also once
you crash, you lose it. This bug is about that issue:
https://trac.torproject.org/projects/tor/ticket/1863

There is also a potential dip when you get the Guard flag, as our load
balancing formulas try to avoid you, but no clients have chosen you
yet. Changes to the authority voting on Guards in Tor 0.2.2.15 should
make this less drastic. It is also possible that your observed
bandwidth will be greater than without it. However, it will still take
up to 2 months for clients to choose you as their new Guard.

== Running temporary auxiliary nodes ==

One way to shortcut this process and avoid paying for bandwidth you
don't use is to spin up a bunch of temporary nodes to utilize the CPU
and quickly gather that easy first 2MB/sec of observed bandwidth.

But you need the spare IPs to do this.

== Monitoring ==

Personally, I prefer console-based options like nload, top, and
Damian's arm (http://www.atagar.com/arm/) because I don't like the
idea of running extra services to publish my monitoring data to the
world.

Other people have web-based monitoring using things like munin and
mrtg. It would be nice to get a script/howto for that too.

== Current 1-Box Capacity Record ==

Our setup has topped at 450Mbit, but averages between 300-400Mbit. We
are currently having uptime issues due to heat (melting, poorly
ventilated harddrives). It is likely that once we resolve this, we
will continually increase to our CPU ceiling.

I believe Moritz and Olaf also push this much capacity, possibly a bit
more, but with less nodes (4 as opposed to our 8). I hear Jake is also
ramping up some Guard nodes (or maybe I didn't? Did I just betray you
again Jake?)

== Did I leave anything out? ==

Well, did I?

Create Shorturl - Create a shorter url that redirects to your paste?

Private - Private paste aren't shown in recent listings.

Delete After - When should we delete your paste?

How to Run High Capacity Tor Rel

Reply to "How to Run High Capacity Tor Rel"