Hacer una solicitud POST posterior en la sesión no funciona - web scraping

Question

Apr 19, 2016, 06:33 PM

python beautifulsoup python-requests web-scraping

Hacer una solicitud POST posterior en la sesión no funciona - web scraping

Esto es lo que estoy tratando de hacer: iraquí, luego presione "buscar". Toma los datos, luego presiona "siguiente" y sigue presionando siguiente hasta que te quedes sin páginas. Todo hasta llegar al "siguiente" funciona. Aquí está mi código. El formato de r.content es radicalmente diferente las dos veces que lo imprimo, lo que indica que algo diferente está sucediendo entre las solicitudes GET y POST aunque quiero un comportamiento muy similar. ¿Por qué podría estar ocurriendo esto?

Lo que me parece extraño es que incluso después de la solicitud POST que parece devolver el material incorrecto, aún puedo analizar las URL que necesito, pero no el campo de entrada __EVENTVALIDATION.

El mensaje de error (final del código) indica que el contenido no incluye estos datos que necesito para hacer una solicitud posterior, pero navegar a la página muestra que sí tiene esos datos y que el formato es muy similar al primera página.

EDITAR: Estoy haciendo que abra páginas web basadas en el HTML que está analizando, y definitivamente algo no está bien. ejecutar el código a continuación abrirá esas páginas.

GET me consigue un sitio web con datos como este:

<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="4424DBE6">
<input type="hidden" name="__VIEWSTATEENCRYPTED" id="__VIEWSTATEENCRYPTED" value="">
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="TlIgNH

Mientras que la POST produce un sitio con todos esos datos en la parte inferior de la página en texto sin formato, así:

|0|hiddenField|__EVENTTARGET||0|hiddenField|__EVENTARGUMENT||0|hiddenField|_

Contenido incorrecto

Buen contenido

import requests
from lxml import html
from bs4 import BeautifulSoup



page = requests.get('http://search.cpsa.ca/physiciansearch')
print('got page!')
d = {"ctl00$ctl13": "ctl00$ctl13|ctl00$MainContent$physicianSearchView$btnSearch",
     "ctl00$MainContent$physicianSearchView$txtLastName": "",
     'ctl00$MainContent$physicianSearchView$txtFirstName': "",
     'ctl00$MainContent$physicianSearchView$txtCity': "",
     "__VIEWSTATEENCRYPTED":"",
     'ctl00$MainContent$physicianSearchView$txtPostalCode': "",
     'ctl00$MainContent$physicianSearchView$rblPractice': "",
     'ctl00$MainContent$physicianSearchView$ddDiscipline': "",
     'ctl00$MainContent$physicianSearchView$rblGender': "",
     'ctl00$MainContent$physicianSearchView$txtPracticeInterests': "",
     'ctl00$MainContent$physicianSearchView$ddApprovals': "",
     'ctl00$MainContent$physicianSearchView$ddLanguage': "",
     "__EVENTTARGET": "ctl00$MainContent$physicianSearchView$btnSearch",
     "__EVENTARGUMENT": "",
     'ctl00$MainContent$physicianSearchView$hfPrefetchUrl': "http://service.cpsa.ca/OnlineService/OnlineService.svc/Services/GetAlbertaCities?name=",
     'ctl00$MainContent$physicianSearchView$hfRemoveUrl': "http://service.cpsa.ca/OnlineService/OnlineService.svc/Services/GetAlbertaCities?name=%QUERY",
     '__ASYNCPOST': 'true'}

h ={ "X-MicrosoftAjax":"Delta = true",
"X-Requested-With":"XMLHttpRequest",
     "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
}

urls = []

with requests.session() as s:
    r = s.get("http://search.cpsa.ca/PhysicianSearch",headers=h)
    soup = BeautifulSoup(r.content, "lxml")
    tree = html.fromstring(r.content)
    html.open_in_browser(tree)

    ev = soup.select("#__EVENTVALIDATION" )[0]["value"]
    vs = soup.select("#__VIEWSTATE")[0]["value"]
    vsg = soup.select("#__VIEWSTATEGENERATOR")[0]["value"]
    d["__EVENTVALIDATION"] = ev
    d["__VIEWSTATEGENERATOR"] = vsg
    d["__VIEWSTATE"] = vs
    r = s.post('http://search.cpsa.ca/PhysicianSearch', data=d,headers=h)



    print('opening in browser')
    retrievedUrls = tree.xpath('//*[@id="MainContent_physicianSearchView_gvResults"]/tr/td[2]/a/@href')
    print(retrievedUrls)

    for url in retrievedUrls:
        urls.append(url)

    endSearch = False    
    while endSearch == False:

        tree = html.fromstring(r.content)
        html.open_in_browser(tree)


        soup = BeautifulSoup(r.content, "lxml")
        print('soup2:')
        ## BREAKS HERE
        ev = soup.select("#__EVENTVALIDATION" )[0]["value"]
        ## BREAKS HERE, 
        vs = soup.select("#__VIEWSTATE")[0]["value"]
        vsg = soup.select("#__VIEWSTATEGENERATOR")[0]["value"]

        d["ctl00$ctl13"] = "ctl00$MainContent$physicianSearchView$ResultsPanel|ctl00$MainContent$physicianSearchView$gvResults$ctl01$btnNextPage"
        d["__EVENTVALIDATION"] = ev
        d["__EVENTTARGET"] = ""
        d["__VIEWSTATEGENERATOR"] = vsg
        d["__VIEWSTATE"] = vs
        d["ctl00$MainContent$physicianSearchView$gvResults$ctl01$ddlPager"] = 1
        d["ctl00$MainContent$physicianSearchView$gvResults$ctl01$ddlPager"] = 1
        d["ctl00$MainContent$physicianSearchView$gvResults$ctl01$btnNextPage"] = "Next"
        r = requests.post('http://search.cpsa.ca/PhysicianSearch', data=d,headers=h)
        tree = html.fromstring(r.content)
        tree = html.fromstring(r.content)
        retrievedUrls = tree.xpath('//*[@id="MainContent_physicianSearchView_gvResults"]/tr/td[2]/a/@href')
        print(urls)
        print(retrievedUrls)
        endSearch = True

...

Traceback (most recent call last):
  File "C:\Users\daniel.bak\workspace\Alberta Physician Scraper\main\main.py", line 63, in <module>
    ev = soup.select("#__EVENTVALIDATION" )[0]["value"]
IndexError: list index out of range