Issues with https proxy in Python via suds and urllib2

Linux
Python
Coding
Author

Vinh Nguyen

Published

July 31, 2014

I recently had the need to access a SOAP API to obtain some data. SOAP works by posting an xml file to a site url in a format defined by the API's schema. The API then returns data, also in a form of an xml file. Based on this post, I figured suds was the easiest way to utilize Python to access the API so I could sequentially (and hence, parallelize) query data repeatedly. suds did turn out to be relatively easy to use:

```{python}
from suds.client import Client
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
client = Client(url)
print client
client.service.LocalTimeByZipCode('90210')
```

This worked on my home network. At work, I had to utilize a proxy in order to access the outside world. Otherwise, I'd get a connection refuse message: urllib2.URLError: <urlopen error [Errno 111] Connection refused>. The modification to use a proxy was straightforward:

```{python}
from suds.client import Client
proxy = {'http': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
# client = Client(url)
client = Client(url, proxy=proxy)
print client
client.service.LocalTimeByZipCode('90210')
```

The previous examples were from a public SOAP API I found online. Now, the site I wanted to actually hit uses ssl for encryption (i.e., https site) and requires authentication. I thought the fix would be as simple as:

```{python}
from suds.client import Client
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, proxy=proxy, username=un, password=pw)
print client
client.service.someFunction(args)
```

However, I got the error message: Exception: (404, u'/path/to/soap_api'). Very weird to me. Is it an authentication issue? Is it a proxy issue? If a proxy issue, how so, as my previous toy example worked. Tried the same site on my home network where there is no firewall, and things worked:

```{python}
from suds.client import Client
url = 'https://some_server.com/path/to/soap_api?wsdl'
un = 'site_username'
pw = 'site_password'
# client = Client(url)
client = Client(url, username=un, password=pw)
print client
client.service.someFunction(args)
```

Conclusion? Must be a proxy issue with https. I used the following prior to calling suds to help with debugging:

```{python}
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)
logging.getLogger('suds.transport').setLevel(logging.DEBUG)
logging.getLogger('suds.xsd.schema').setLevel(logging.DEBUG)
logging.getLogger('suds.wsdl').setLevel(logging.DEBUG)
```

My initial thoughts after some debugging: there must be something wrong with the proxy as the log shows python sending the request to the target url, but I get back a response that shows the path (minus the domain name) not found. What happened to the domain name? I notified the firewall team to look into this, as it appears the proxy is modifying something (url is not complete?). The firewall team investigated, and found that the proxy is returning a message that warns the ClientHello message is too large. This is one clue. The log also shows that the user was never authenticated and that the ssl handshake was never completed. My thought: still a proxy issue, as the python code works at home. However, the proxy team was able to access the https SOAP API through the proxy using the SOA Client plugin for Firefox. Now that convinced me that something else may be the culprit.

Googled for help, and thought this would be helpful.

```{python}
import urllib2
import urllib
import httplib
import socket

class ProxyHTTPConnection(httplib.HTTPConnection):
    _ports = {'http' : 80, 'https' : 443}
    def request(self, method, url, body=None, headers={}):
        #request is called before connect, so can interpret url and get
        #real host/port to be used to make CONNECT request to proxy
        proto, rest = urllib.splittype(url)
        if proto is None:
            raise ValueError, "unknown URL type: %s" % url
        #get host
        host, rest = urllib.splithost(rest)
        #try to get port
        host, port = urllib.splitport(host)
        #if port is not defined try to get from proto
        if port is None:
            try:
                port = self._ports[proto]
            except KeyError:
                raise ValueError, "unknown protocol for: %s" % url
        self._real_host = host
        self._real_port = port
        httplib.HTTPConnection.request(self, method, url, body, headers)
    def connect(self):
        httplib.HTTPConnection.connect(self)
        #send proxy CONNECT request
        self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (self._real_host, self._real_port))
        #expect a HTTP/1.0 200 Connection established
        response = self.response_class(self.sock, strict=self.strict, method=self._method)
        (version, code, message) = response._read_status()
        #probably here we can handle auth requests...
        if code != 200:
            #proxy returned and error, abort connection, and raise exception
            self.close()
            raise socket.error, "Proxy connection failed: %d %s" % (code, message.strip())
        #eat up header block from proxy....
        while True:
            #should not use directly fp probablu
            line = response.fp.readline()
            if line == '\r\n': break

class ProxyHTTPSConnection(ProxyHTTPConnection):
    default_port = 443
    def __init__(self, host, port = None, key_file = None, cert_file = None, strict = None, timeout=0): # vinh added timeout
        ProxyHTTPConnection.__init__(self, host, port)
        self.key_file = key_file
        self.cert_file = cert_file
    def connect(self):
        ProxyHTTPConnection.connect(self)
        #make the sock ssl-aware
        ssl = socket.ssl(self.sock, self.key_file, self.cert_file)
        self.sock = httplib.FakeSocket(self.sock, ssl)

class ConnectHTTPHandler(urllib2.HTTPHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPHandler.do_open(self, ProxyHTTPConnection, req)

class ConnectHTTPSHandler(urllib2.HTTPSHandler):
    def do_open(self, http_class, req):
        return urllib2.HTTPSHandler.do_open(self, ProxyHTTPSConnection, req)

from suds.client import Client
# from httpsproxy import ConnectHTTPSHandler, ConnectHTTPHandler ## these are code from above classes
import urllib2, urllib
from suds.transport.http import HttpTransport
opener = urllib2.build_opener(ConnectHTTPHandler, ConnectHTTPSHandler)
urllib2.install_opener(opener)
t = HttpTransport()
t.urlopener = opener
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
un = 'site_username'
pw = 'site_password'
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw)
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw, location='https://some_server.com/path/to/soap_api?wsdl') ## some site suggests specifying location
```

This too did not work. Continued to google, and found that lot's of people are having issues with https and proxy. I knew suds depended on urllib2, so googled about that as well, and people too had issues with urllib2 in terms of https and proxy. I then decided to investigate using urllib2 to contact the https url through a proxy:

```{python}
## http://stackoverflow.com/questions/5227333/xml-soap-post-error-what-am-i-doing-wrong
## http://stackoverflow.com/questions/34079/how-to-specify-an-authenticated-proxy-for-a-python-http-connect
### at home this works
import urllib2
url = 'https://some_server.com/path/to/soap_api?wsdl'
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
page = urllib2.urlopen(url)
page.read()

### work network, does not work:
url = 'https://some_server.com/path/to/soap_api?wsdl'
proxy = urllib2.ProxyHandler({'https':'proxy_username:proxy_password@proxy_server.com:port', 'http':'proxy_username:proxy_password@proxy_server.com:port'})
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None,
                          uri=url,
                          user='site_username',
                          passwd='site_password')
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(proxy, auth_handler, urllib2.HTTPSHandler)
urllib2.install_opener(opener)
page = urllib2.urlopen(site)
### also tried re-doing above, but with the custom handler as defined in the previous code chunk (http://code.activestate.com/recipes/456195/) running first (run the list of classes)
```

No luck. I re-read this post that I ran into before, and really agreed that urllib2 is severely flawed, especially when using https proxy. At the end of the page, the author suggested using the requests package. Tried it out, and I was able to connect using the https proxy:

```{python}
import requests
import xmltodict
p1 = 'http://proxy_username:proxy_password@proxy_server.com:port'
p2 = 'https://proxy_username:proxy_password@proxy_server.com:port'
proxy = {'http': p1, 'https':p2}

site = 'https://some_server.com/path/to/soap_api?wsdl'
r = requests.get(site, proxies=proxy, auth=('site_username', 'site_password'))
r.text ## works
soap_xml_in = """<?xml version="1.0" encoding="UTF-8"?>
...
"""
headers = {'SOAPAction': u'""', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'""'}
soap_xml_out = requests.post(site, data=soap_xml_in, headers=headers, proxies=proxy, auth=('site_username', 'site_password')).text
```

My learnings?

NOTE: I had to install libuuid-devel in cygwin64 because I was getting an installation error.