Urllib2 y BeautifulSoup: Bonita pareja pero demasiado lenta: ¿urllib3 y temas?

Question

Apr 22, 2012, 05:59 AM

urllib2 beautifulsoup multithreading python performance

Urllib2 y BeautifulSoup: Bonita pareja pero demasiado lenta: ¿urllib3 y temas?

Estaba buscando una manera de optimizar mi código cuando escuché algunas cosas buenas acerca de los hilos y urllib3. Al parecer, la gente no está de acuerdo con qué solución es la mejor.

El problema con mi script a continuación es el tiempo de ejecución: ¡tan lento!

Paso 1: Traigo esta páginahttp://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on

Paso 2: Analizo la página con BeautifulSoup

Paso 3: Pongo los datos en un excel doc.

Etapa 4: Lo hago una y otra vez, y otra vez para todos los países en mi lista (lista grande) (Estoy cambiando "Afganistán" en la url a otro país)

Aquí está mi código:

<code>ws = wb.add_sheet("BULATS_IA") #We add a new tab in the excel doc
    x = 0 # We need x and y for pulling the data into the excel doc
    y = 0
    Countries_List = ['Afghanistan','Albania','Andorra','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
    Longueur = len(Countries_List)



    for Countries in Countries_List:
        y = 0

        htmlSource = urllib.urlopen("http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % (Countries)).read() # I am opening the page with the name of the correspondant country in the url
        s = soup(htmlSource)
        tableGood = s.findAll('table')
        try:
            rows = tableGood[3].findAll('tr')
            for tr in rows:
                cols = tr.findAll('td')
                y = 0
                x = x + 1
                for td in cols:
                    hum =  td.text
                    ws.write(x,y,hum)
                    y = y + 1
                    wb.save("%s.xls" % name_excel)

        except (IndexError):
            pass
</code>

Así que sé que no todo es perfecto, ¡pero estoy deseando aprender cosas nuevas en Python! El script es muy lento porque urllib2 no es tan rápido, y BeautifulSoup. Para lo de la sopa, creo que realmente no puedo hacerlo mejor, pero para urllib2, no lo hago.

EDITAR 1:¿Multiprocesamiento inútil con urllib2? Parece ser interesante en mi caso. ¿Qué piensan ustedes acerca de esta solución potencial?

<code># Make sure that the queue is thread-safe!!

def producer(self):
    # Only need one producer, although you could have multiple
    with fh = open('urllist.txt', 'r'):
        for line in fh:
            self.queue.enqueue(line.strip())

def consumer(self):
    # Fire up N of these babies for some speed
    while True:
        url = self.queue.dequeue()
        dh = urllib2.urlopen(url)
        with fh = open('/dev/null', 'w'): # gotta put it somewhere
            fh.write(dh.read())
</code>

EDIT 2: URLLIB3 ¿Puede alguien decirme más cosas sobre eso?

Reutilice la misma conexión de socket para múltiples solicitudes (HTTPConnectionPool y HTTPSConnectionPool) (con verificación de certificado del lado del cliente opcional).https://github.com/shazow/urllib3

En la medida en que solicito 122 veces el mismo sitio web para diferentes páginas, creo que reutilizar la misma conexión de socket puede ser interesante, ¿me equivoco? ¿No puede ser más rápido? ...

<code>http = urllib3.PoolManager()
r = http.request('GET', 'http://www.bulats.org')
for Pages in Pages_List:
    r = http.request('GET', 'http://www.bulats.org/agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=%s' % (Pages))
    s = soup(r.data)
</code>