Scrapy start Rastreamento após o login
Isenção de responsabilidade: o site que estou rastreando é uma intranet corporativa e modifiquei um pouco o URL para garantir a privacidade corporativa.
Consegui fazer login no site, mas não consegui rastrear o site.
Começar destart_url
https: //kmssqkr.sarg/LotusQuickr/dept/Main.nsf(este site direcionará você para um site semelhante com um URL mais complexo:
isto é
https: //kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/? OpenDocument {unid = ADE682E34FC59D274825770B0037D278})
para cada página, incluindo ostart_url
, Quero rastrear tudohref
encontrado em//li/<a>
(Para cada página rastreada, haveria um número abundante de hiperlinks disponíveis, e alguns deles serão duplicados porque você poderá acessar os sites pai e filho na mesma página.
Como você pode ver, ohref
não compõe o link real (o link citado acima) que vemos quando rastreamos para essa página. Há também um # na frente de seu conteúdo útil. Seria a fonte do problema?
Pararestricted_xpaths
, Restringi o caminho para 'sair' da página.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
import scrapy
class kmssSpider(CrawlSpider):
name='kmss'
start_url = ('https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf',)
login_page = 'https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
allowed_domain = ["kmssqkr.sarg"]
rules= (Rule(LinkExtractor(allow=(r'https://kmssqkr.sarg/LotusQuickr/dept/\w*'),restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique= True),
callback='parse_item', follow = True),
)
# r"LotusQuickr/dept/^[ A-Za-z0-9_@./#&+-]*$"
# restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique = True)
def start_requests(self):
yield Request(url=self.login_page, callback=self.login ,dont_filter = True
)
def login(self,response):
return FormRequest.from_response(response,formdata={'user':'user','password':'pw'},
callback = self.check_login_response)
def check_login_response(self,response):
if 'Welcome' in response.body:
self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
yield Request(url=self.start_url[0])
else:
self.log("\n\n You are not logged in \n\n " )
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
pass
Registro:
2015-07-27 16:46:18 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-27 16:46:18 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-27 16:46:19 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2015-07-27 16:46:19 [boto] ERROR: Unable to read instance data, giving up
2015-07-27 16:46:19 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-27 16:46:19 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-27 16:46:19 [scrapy] INFO: Enabled item pipelines:
2015-07-27 16:46:19 [scrapy] INFO: Spider opened
2015-07-27 16:46:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-27 16:46:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-27 16:46:24 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-27 16:46:28 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr.ccgo.sarg/names.nsf?Login> (referer: https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-27 16:46:29 [kmss] DEBUG:
Successfuly Logged in
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf>
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument>
2015-07-27 16:46:29 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> (referer: https://kmssqkr.sarg/names.nsf?Login)
2015-07-27 16:46:29 [scrapy] INFO: Closing spider (finished)
2015-07-27 16:46:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1954,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 31259,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 27, 8, 46, 29, 286000),
'log_count/DEBUG': 8,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2015, 7, 27, 8, 46, 19, 528000)}
2015-07-27 16:46:29 [scrapy] INFO: Spider closed (finished)
[1]: http://i.stack.imgur.com/REQXJ.png
----------------------------------ATUALIZADA--------------- ------------------------
Eu vi o formato de cookies emhttp://doc.scrapy.org/en/latest/topics/request-response.html. Esses são meus cookies no site, mas não tenho certeza do que e como devo adicioná-los junto com a solicitação.