- After talking to Moritz and Olaf privately and asking them about their
- nodes, and after running some experiments with some high capacity
- relays, I've begun to realize that running a fast Tor relay is a
- pretty black art, with a lot of ad-hoc practice. Only a few people
- know how to do it, and if you just use Linux and Tor out of the box,
- your relay will likely underperform on 100Mbit links and above.
- In the interest of trying to help grow and distribute the network, my
- ultimate plan is to try to collect all of this lore, use Science to
- divine out what actually matters, and then write a more succinct blog
- post about it.
- However, that is a lot of work. It's also not totally necessary to do
- all this work, when you can get a pretty good setup with a rough
- superset of all of the ad-hoc voodoo. This post is thus about that
- voodoo.
- Hopefully others will spring forth from the darkness to dump their own
- voodoo in this thread, as I suspect there is one hell of a lot of it
- out there, some (much?) of which I don't yet know. Likewise, if any
- blasphemous heretic wishes to apply Science to this voodoo, they
- should yell out, "Stand Back, I'm Doing Science!" (at home please, not
- on this list) and run some experiments to try to eliminate options
- that are useless to Tor performance. Or cite academic research papers.
- (But that's not Science, that's computerscience - which is a religion
- like voodoo, but with cathedrals).
- Anyway, on with the draft:
- == Machine Specs ==
- First, you want to run your OS in x64 mode because openssl should do
- crypto faster in 64bit.
- Tor is currently not fully multithreaded, and tends not to benefit
- beyond 2 cores per process. Even then, the benefit is still marginal
- beyond just 1 core. 64bit Tor nodes require about one 2Ghz Xeon/Core2
- core per 100Mbit of capacity.
- Thus, to fill an 800Mbit link, you need at least a dual socket, quad
- core cpu config. You may be able to squeeze a full gigabit out of one
- of these machines. As far as I know, no one has ever done this with
- Tor, on any one machine.
- The i7's also just came out in this form factor, and can do
- hyperthreading (previous models may list 'ht' in cpuinfo, but actually
- don't support it). This should give you a decent bonus if you set
- NumCPUs to 2, since ht tends to work better with pure integer math
- (like crypto). We have not benchmarked this config yet though, but I
- suspect it should fill a gigabit link fairly easily, possibly
- approaching 2Gbit.
- At full capacity, exit node Tor processes running at this rate consume
- about 500M of ram. You want to ensure your ram speed is sufficient,
- but most newish hardware is good. Using on this chart:
- https://secure.wikimedia.org/wikipedia/en/wiki/List_of_device_bandwidths#Memory_Interconnect.2FRAM_buses
- you can do the math and see that with a dozen memcpys in each
- direction, you come out needing DDR2 to be able to push 1Gbit full
- duplex.
- As far as ethernet cards, the Intel e1000e *should* be theoretically
- good, but they seem to fail at properly irq balancing across multiple
- CPUs on recent kernels, which can cause you to bottleneck at 100% CPU
- on one core. At least that has been Moritz's experience. In our
- experiments, the RTL-8169 works fine (once tweaked, see below).
- == System Tweakscript Wibbles and Config Splatters ==
- First, you want to ensure that you run no more than 2 Tor instances
- per IP. Any more than this and clients will ignore them.
- Next, paste the following smattering into the shell (or just read it
- and make your own script):
- # Set the hard limit of open file descriptors really high.
- # Tor will also potentially run out of ports.
- ulimit -SHn 65000
- # Set the txqueuelen high, to prevent premature drops
- ifconfig eth0 txqueuelen 20000
- # Tell our ethernet card (interrupt found from /proc/interrupts)
- # to balance its IRQs across one whole CPU socket (4 cpus, mask 0f).
- # You only want one socket for optimal ISR and buffer caching.
- #
- # Note that e1000e does NOT seem to obey this, but RTL-8169 will.
- echo 0f > /proc/irq/17/smp_affinity
- # Make sure you have auxiliary nameservers. I've seen many ISP
- # nameservers fall over under load from fast tor nodes, both on our
- # nodes and from scans. Or run caching named and closely monitor it.
- echo "nameserver 8.8.8.8" >> /etc/resolv.conf
- echo "nameserver 4.2.2.2" >> /etc/resolv.conf
- # Load an amalgam of gigabit-tuning sysctls from:
- # http://datatag.web.cern.ch/datatag/howto/tcp.html
- # http://fasterdata.es.net/TCP-tuning/linux.html
- # http://www.acc.umu.se/~maswan/linux-netperf.txt
- # http://www.psc.edu/networking/projects/tcptune/#Linux
- # and elsewhere...
- # We have no idea which of these are needed yet for our actual use
- # case, but they do help (especially the nf-contrack ones):
- sysctl -p << EOF
- net.ipv4.tcp_rmem = 4096 87380 16777216
- net.ipv4.tcp_wmem = 4096 65536 16777216
- net.core.netdev_max_backlog = 2500
- net.ipv4.tcp_no_metrics_save = 1
- net.ipv4.tcp_moderate_rcvbuf = 1
- net.core.rmem_max = 1048575
- net.core.wmem_max = 1048575
- net.ipv4.ip_local_port_range = 1025 61000
- net.ipv4.tcp_synack_retries = 3
- net.ipv4.tcp_tw_recycle = 1
- net.ipv4.tcp_max_syn_backlog = 10240
- net.ipv4.tcp_fin_timeout = 30
- net.ipv4.tcp_keepalive_time = 1200
- net.netfilter.nf_conntrack_tcp_timeout_established=7200
- net.netfilter.nf_conntrack_checksum=0
- net.netfilter.nf_conntrack_max=131072
- net.netfilter.nf_conntrack_tcp_timeout_syn_sent=15
- net.ipv4.tcp_keepalive_time=60
- net.ipv4.tcp_keepalive_time = 60
- net.ipv4.tcp_keepalive_intvl = 10
- net.ipv4.tcp_keepalive_probes = 3
- net.ipv4.ip_local_port_range = 1025 65530
- net.core.netdev_max_backlog=300000
- net.core.somaxconn=20480
- net.ipv4.tcp_max_tw_buckets=2000000
- net.ipv4.tcp_timestamps=0
- vm.min_free_kbytes = 65536
- net.ipv4.ip_forward=1
- net.ipv4.tcp_syncookies = 1
- net.ipv4.tcp_synack_retries = 2
- net.ipv4.conf.default.forwarding=1
- net.ipv4.conf.default.proxy_arp = 1
- net.ipv4.conf.all.rp_filter = 1
- net.ipv4.conf.default.send_redirects = 1
- net.ipv4.conf.all.send_redirects = 0
- EOF
- # XXX: ethtool wibbles
- # You may also have to tweak some parameters with ethtool, possibly
- # also enabling some checksum offloading or irq coalescing options to
- # spare CPU, but for us this hasn't been needed yet.
- == Setting up the Torrc ==
- Basically you can just read through the stock example torrc, but there
- are some as-yet undocumented magic options, and options that need new
- defaults.
- # NumCPUs doesn't provide any benefit beyond 2, and setting it higher
- # may cause cache misses.
- NumCPUs 2
- # These options have archaic maximums of 2-5Mbyte
- BandwidthRate 100 MB
- BandwidthBurst 200 MB
- == Waiting for the Bootstrap and Measurement Process ==
- Perhaps the most frustrating part of this setup is how long it takes
- for you to acquire traffic. If you are starting new at an ISP, I would
- consider only 200-400Mbit for your first month. Hitting that by the
- end of the month may be a challenge, mostly because their may be dips
- and total setbacks along the way.
- The slow rampup is primarily due to limitations in Tor's ability to
- rapidly publish descriptor updates, and to measure relays.
- It ends up taking about 2-3 days to hit an observed bandwidth of
- 2Mbyte/sec per relay, but it can take well over a week or more (Moritz, do
- you have a better number?) to reach 8-9Mbyte/relay. This is for an
- Exit node. A middle node will likely gather traffic slower. Also once
- you crash, you lose it. This bug is about that issue:
- https://trac.torproject.org/projects/tor/ticket/1863
- There is also a potential dip when you get the Guard flag, as our load
- balancing formulas try to avoid you, but no clients have chosen you
- yet. Changes to the authority voting on Guards in Tor 0.2.2.15 should
- make this less drastic. It is also possible that your observed
- bandwidth will be greater than without it. However, it will still take
- up to 2 months for clients to choose you as their new Guard.
- == Running temporary auxiliary nodes ==
- One way to shortcut this process and avoid paying for bandwidth you
- don't use is to spin up a bunch of temporary nodes to utilize the CPU
- and quickly gather that easy first 2MB/sec of observed bandwidth.
- But you need the spare IPs to do this.
- == Monitoring ==
- Personally, I prefer console-based options like nload, top, and
- Damian's arm (http://www.atagar.com/arm/) because I don't like the
- idea of running extra services to publish my monitoring data to the
- world.
- Other people have web-based monitoring using things like munin and
- mrtg. It would be nice to get a script/howto for that too.
- == Current 1-Box Capacity Record ==
- Our setup has topped at 450Mbit, but averages between 300-400Mbit. We
- are currently having uptime issues due to heat (melting, poorly
- ventilated harddrives). It is likely that once we resolve this, we
- will continually increase to our CPU ceiling.
- I believe Moritz and Olaf also push this much capacity, possibly a bit
- more, but with less nodes (4 as opposed to our 8). I hear Jake is also
- ramping up some Guard nodes (or maybe I didn't? Did I just betray you
- again Jake?)
- == Did I leave anything out? ==
- Well, did I?