aboutsummaryrefslogtreecommitdiffstats
path: root/doc-src/howmitmproxy.html
blob: 94afd52248103b2357dda4f7fd0bf09ebe9f2202 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
Mitmproxy is an enormously flexible tool. Knowing exactly how the proxying
process works will help you deploy it creatively, and take into account its
fundamental assumptions and how to work around them. This document explains
mitmproxy's proxy mechanism in detail, starting with the simplest unencrypted
explicit proxying, and working up to the most complicated interaction -
transparent proxying of SSL-protected traffic[^ssl] in the presence of
[SNI](http://en.wikipedia.org/wiki/Server_Name_Indication).


<div class="page-header">
    <h1>Explicit HTTP</h1>
</div>

Configuring the client to use mitmproxy as an explicit proxy is the simplest
and most reliable way to intercept traffic. The proxy protocol is codified in
the [HTTP RFC](http://www.ietf.org/rfc/rfc2068.txt), so the behaviour of both
the client and the server is well defined, and usually reliable. In the
simplest possible interaction with mitmproxy, a client connects directly to the
proxy, and makes a request that looks like this:

<pre>GET http://example.com/index.html HTTP/1.1</pre>

This is a proxy GET request - an extended form of the vanilla HTTP GET request
that includes a schema and host specification, and it includes all the
information mitmproxy needs to proceed.

<img src="explicit.png"/>

<table class="table">
    <tbody>
        <tr>

            <td><b>1</b></td>

            <td>The client connects to the proxy and makes a request.</td>

        </tr>

        <tr>

            <td><b>2</b></td>

            <td>Mitmproxy connects to the upstream server and simply forwards
            the request on.</td>

        </tr>
    </tbody>
</table>


<div class="page-header">
    <h1>Explicit HTTPS</h1>
</div>

The process for an explicitly proxied HTTPS connection is quite different. The
client connects to the proxy and makes a request that looks like this:

<pre>CONNECT example.com:443 HTTP/1.1</pre>

A conventional proxy can neither view nor manipulate an SSL-encrypted data
stream, so a CONNECT request simply asks the proxy to open a pipe between the
client and server. The proxy here is just a facilitator - it blindly forwards
data in both directions without knowing anything about the contents. The
negotiation of the SSL connection happens over this pipe, and the subsequent
flow of requests and responses are completely opaque to the proxy.

## The MITM in mitmproxy

This is where mitmproxy's fundamental trick comes into play. The MITM in its
name stands for Man-In-The-Middle - a reference to the process we use to
intercept and interfere with these theoretically opaque data streams. The basic
idea is to pretend to be the server to the client, and pretend to be the client
to the server, while we sit in the middle decoding traffic from both sides. The
tricky part is that the [Certificate
Authority](http://en.wikipedia.org/wiki/Certificate_authority) system is
designed to prevent exactly this attack, by allowing a trusted third-party to
cryptographically sign a server's SSL certificates to verify that they are
legit. If this signature doesn't match or is from a non-trusted party, a secure
client will simply drop the connection and refuse to proceed. Despite the many
shortcomings of the CA system as it exists today, this is usually fatal to
attempts to MITM an SSL connection for analysis. Our answer to this conundrum
is to become a trusted Certificate Authority ourselves. Mitmproxy includes a
full CA implementation that generates interception certificates on the fly. To
get the client to trust these certificates, we [register mitmproxy as a trusted
CA with the device manually](@!urlTo("ssl.html")!@).

## Complication 1: What's the remote hostname?

To proceed with this plan, we need to know the domain name to use in the
interception certificate - the client will verify that the certificate is for
the domain it's connecting to, and abort if this is not the case. At first
blush, it seems that the CONNECT request above gives us all we need - in this
example, both of these values are "example.com".  But what if the client had
initiated the connection as follows:

<pre>CONNECT 10.1.1.1:443 HTTP/1.1</pre>

Using the IP address is perfectly legitimate because it gives us enough
information to initiate the pipe, even though it doesn't reveal the remote
hostname.

Mitmproxy has a cunning mechanism that smooths this over - [upstream
certificate sniffing](@!urlTo("features/upstreamcerts.html")!@). As soon as we
see the CONNECT request, we pause the client part of the conversation, and
initiate a simultaneous connection to the server. We complete the SSL handshake
with the server, and inspect the certificates it used. Now, we use the Common
Name in the upstream SSL certificates to generate the dummy certificate for the
client. Voila, we have the correct hostname to present to the client, even if
it was never specified.


## Complication 2: Subject Alternative Name

Enter the next complication. Sometimes, the certificate Common Name is not, in
fact, the hostname that the client is connecting to. This is because of the
optional [Subject Alternative
Name](http://en.wikipedia.org/wiki/SubjectAltName) field in the SSL certificate
that allows an arbitrary number of alternative domains to be specified. If the
expected domain matches any of these, the client will proceed, even though the
domain doesn't match the certificate Common Name. The answer here is simple:
when extract the CN from the upstream cert, we also extract the SANs, and add
them to the generated dummy certificate.


## Complication 3: Server Name Indication

One of the big limitations of vanilla SSL is that each certificate requires its
own IP address. This means that you couldn't do virtual hosting where multiple
domains with independent certificates share the same IP address. In a world
with a rapidly shrinking IPv4 address pool this is a problem, and we have a
solution in the form of the [Server Name
Indication](http://en.wikipedia.org/wiki/Server_Name_Indication) extension to
the SSL and TLS protocols. This lets the client specify the remote server name
at the start of the SSL handshake, which then lets the server select the right
certificate to complete the process.

SNI breaks our upstream certificate sniffing process, because when we connect
without using SNI, we get served a default certificate that may have nothing to
do with the certificate expected by the client. The solution is another tricky
complication to the client connection process. After the client connects, we
allow the SSL handshake to continue until just _after_ the SNI value has been
passed to us. Now we can pause the conversation, and initiate an upstream
connection using the correct SNI value, which then serves us the correct
upstream certificate, from which we can extract the expected CN and SANs.

There's another wrinkle here. Due to a limitation of the SSL library mitmproxy
uses, we can't detect that a connection _hasn't_ sent an SNI request until it's
too late for upstream certificate sniffing. In practice, we therefore make a
vanilla SSL connection upstream to sniff non-SNI certificates, and then discard
the connection if the client sends an SNI notification. If you're watching your
traffic with a packet sniffer, you'll see two connections to the server when an
SNI request is made, the first of which is immediately closed after the SSL
handshake. Luckily, this is almost never an issue in practice.

## Putting it all together

Lets put all of this together into the complete explicitly proxied HTTPS flow.

<img src="explicit_https.png"/>

<table class="table">
    <tbody>
        <tr>
            <td><b>1</b></td>
            <td>The client makes a connection to mitmproxy, and issues an HTTP
            CONNECT request.</td>
        </tr>
        <tr>
            <td><b>2</b></td>

            <td>Mitmproxy responds with a 200 Connection Established, as if it
            has set up the CONNECT pipe.</td>
        </tr>
        <tr>
            <td><b>3</b></td>

            <td>The client believes it's talking to the remote server, and
            initiates the SSL connection. It uses SNI to indicate the hostname
            it is connecting to.</td>
        </tr>

        <tr>
            <td><b>4</b></td>

            <td>Mitmproxy connects to the server, and establishes an SSL
            connection using the SNI hostname indicated by the client.</td>

        </tr>
        <tr>
            <td><b>5</b></td>

            <td>The server responds with the matching SSL certificate, which
            contains the CN and SAN values needed to generate the interception
            certificate.</td>
        </tr>
        <tr>
            <td><b>6</b></td>

            <td>Mitmproxy generates the interception cert, and continues the
            client SSL handshake paused in step 3.</td>
        </tr>
        <tr>
            <td><b>7</b></td>

            <td>The client sends the request over the established SSL
            connection.</td>
        </tr>
        <tr>
            <td><b>7</b></td>

            <td>Mitmproxy passes the request on to the server over the SSL
            connection initiated in step 4.</td>
        </tr>
    </tbody>
</table>


<div class="page-header">
    <h1>Transparent HTTP</h1>
</div>

When a transparent proxy is used, the HTTP/S connection is redirected into a
proxy at the network layer, without any client configuration being required.
This makes transparent proxying ideal for those situations where you can't
change client behaviour - proxy-oblivious Android applications being a common
example.

To achieve this, we need to introduce two extra components. The first is a
redirection mechanism that transparently reroutes a TCP connection destined for
a server on the Internet to a listening proxy server. This usually takes the
form of a firewall on the same host as the proxy server -
[iptables](http://www.netfilter.org/) on Linux or
[pf](http://en.wikipedia.org/wiki/PF_\(firewall\)) on OSX. Once the client has
initiated the connection, it makes a vanilla HTTP request, which might look
something like this:

<pre>GET /index.html HTTP/1.1</pre>

Note that this request differs from the explicit proxy variation, in that it
omits the scheme and hostname. How, then, do we know which upstream host to
forward the request to? The routing mechanism that has performed the
redirection keeps track of the original destination for us.  Each routing
mechanism has a different way of exposing this data, so this introduces the
second component required for working transparent proxying: a host module that
knows how to retrieve the original destination address from the router. In
mitmproxy, this takes the form of a built-in set of
[modules](https://github.com/mitmproxy/mitmproxy/tree/master/libmproxy/platform)
that know how to talk to each platform's redirection mechanism.  Once we have
this information, the process is fairly straight-forward.

<img src="transparent.png"/>


<table class="table">
    <tbody>
        <tr>
            <td><b>1</b></td>
            <td>The client makes a connection to the server.</td>
        </tr>
        <tr>
            <td><b>2</b></td>

            <td>The router redirects the connection to mitmproxy, which is
            typically listening on a local port of the same host. Mitmproxy
            then consults the routing mechanism to establish what the original
            destination was.</td>
        </tr>
        <tr>
            <td><b>3</b></td>

            <td>Now, we simply read the client's request...</td>
        </tr>

        <tr>
            <td><b>4</b></td>

            <td>... and forward it upstream.</td>

        </tr>
    </tbody>
</table>

<div class="page-header">
    <h1>Transparent HTTPS</h1>
</div>

The first step is to determine whether we should treat an incoming connection
as HTTPS. The mechanism for doing this is simple - we use the routing mechanism
to find out what the original destination port is. By default, we treat all
traffic destined for ports 443 and 8443 as SSL.

From here, the process is a merger of the methods we've described for
transparently proxying HTTP, and explicitly proxying HTTPS. We use the routing
mechanism to establish the upstream server address, and then proceed as for
explicit HTTPS connections to establish the CN and SANs, and cope with SNI.

<img src="transparent_https.png"/>


<table class="table">
    <tbody>
        <tr>
            <td><b>1</b></td>
            <td>The client makes a connection to the server.</td>
        </tr>
        <tr>
            <td><b>2</b></td>

            <td>The router redirects the connection to mitmproxy, which is
            typically listening on a local port of the same host. Mitmproxy
            then consults the routing mechanism to establish what the original
            destination was.</td>
        </tr>
        <tr>
            <td><b>3</b></td>

            <td>The client believes it's talking to the remote server, and
            initiates the SSL connection. It uses SNI to indicate the hostname
            it is connecting to.</td>
        </tr>

        <tr>
            <td><b>4</b></td>

            <td>Mitmproxy connects to the server, and establishes an SSL
            connection using the SNI hostname indicated by the client.</td>

        </tr>
        <tr>
            <td><b>5</b></td>

            <td>The server responds with the matching SSL certificate, which
            contains the CN and SAN values needed to generate the interception
            certificate.</td>
        </tr>
        <tr>
            <td><b>6</b></td>

            <td>Mitmproxy generates the interception cert, and continues the
            client SSL handshake paused in step 3.</td>
        </tr>
        <tr>
            <td><b>7</b></td>

            <td>The client sends the request over the established SSL
            connection.</td>
        </tr>
        <tr>
            <td><b>7</b></td>

            <td>Mitmproxy passes the request on to the server over the SSL
            connection initiated in step 4.</td>
        </tr>
    </tbody>
</table>


[^ssl]: I use "SSL" to refer to both SSL and TLS in the generic sense, unless otherwise specified.