Krabbeln auf der Website

Question

Oct 25, 2013, 01:44 AM

Krabbeln auf der Website

Ich bin sowohl neu in Python als auch in Scrapy, aber ich versuche, Daten von einer Website mit einem nicht vertrauenswürdigen Zertifikat abzurufen. Aus diesem Grund kann ich sie möglicherweise nicht crawlen, obwohl ich die Spinne nur falsch gemacht habe

Hier ist das Fehlerprotokoll, das ich bekomme, wenn ich versuche zu crawlen

2013-10-24 21:19:08-0200 [scrapy] INFO: Scrapy 0.18.4 started (bot: tutorial)
2013-10-24 21:19:08-0200 [scrapy] DEBUG: Optional features available: ssl, http11, libxml2
2013-10-24 21:19:08-0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2013-10-24 21:19:12-0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-10-24 21:19:15-0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-10-24 21:19:15-0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-10-24 21:19:15-0200 [scrapy] DEBUG: Enabled item pipelines: 
2013-10-24 21:19:15-0200 [tutorial] INFO: Spider opened
2013-10-24 21:19:15-0200 [tutorial] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-10-24 21:19:15-0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-10-24 21:19:15-0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-10-24 21:19:16-0200 [tutorial] DEBUG: Redirecting (301) to <GET https://matriculaweb.unb.br/matriculaweb/graduacao/oferta_campus.aspx> from <GET http://matriculaweb.unb.br/matriculaweb/graduacao/oferta_campus.aspx>
2013-10-24 21:19:16-0200 [tutorial] DEBUG: Retrying <GET https://matriculaweb.unb.br/matriculaweb/graduacao/oferta_campus.aspx> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2013-10-24 21:19:17-0200 [tutorial] DEBUG: Retrying <GET https://matriculaweb.unb.br/matriculaweb/graduacao/oferta_campus.aspx> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2013-10-24 21:19:17-0200 [tutorial] DEBUG: Gave up retrying <GET https://matriculaweb.unb.br/matriculaweb/graduacao/oferta_campus.aspx> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2013-10-24 21:19:17-0200 [tutorial] ERROR: Error downloading <GET https://matriculaweb.unb.br/matriculaweb/graduacao/oferta_campus.aspx>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2013-10-24 21:19:17-0200 [tutorial] INFO: Closing spider (finished)
2013-10-24 21:19:17-0200 [tutorial] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/scrapy.xlib.tx._newclient.ResponseNeverReceived': 3,
     'downloader/request_bytes': 1064,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 4,
     'downloader/response_bytes': 384,
     'downloader/response_count': 1,
     'downloader/response_status_count/301': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 10, 24, 23, 19, 17, 283862),
     'log_count/DEBUG': 10,
     'log_count/ERROR': 1,
     'log_count/INFO': 3,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 10, 24, 23, 19, 15, 955787)}
2013-10-24 21:19:17-0200 [tutorial] INFO: Spider closed (finished)

Und hier ist mein Code

from __future__ import absolute_import
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class MySpider(CrawlSpider):
    name = 'tutorial'
    allowed_domains = ['matriculaweb.unb.br']
    start_urls = ['http://matriculaweb.unb.br/matriculaweb/graduacao/oferta_campus.aspx']

    rules = [Rule(SgmlLinkExtractor(allow=('/oferta_dis.aspx?cod=\d+'))),Rule(SgmlLinkExtractor(allow=('/oferta_dados.aspx?cod=\d+&dep=\d+')), 'parse_dep')]

    def parse_dep(self, response):
        sel = Selector(response)
        discplina = Disciplina()

        url_disciplina = '/html/body/center/table/tbody/tr/td/table[4]/tbody/tr/td[2]/div/center/table/tbody/tr/td/font/strong/a'.re(r'.*cod=([0-9]+)')
        yield Request(url_disciplina, meta={'disc':disciplina}, callback=self.parse_disc)


    def parse_disc(self, response):
        sel = Selector(response)
        disciplina = response.request.meta['disc']

        disciplina['nome'] = sel.xpath('/html/body/center/table/tbody/tr/td/center/table/tbody/tr[3]/td[2]').extract()
        disciplina['codigo'] = sel.xpath('/html/body/center/table/tbody/tr/td/center/table/tbody/tr[2]/td[2]').extract()
        disciplina['requisitos'] = sel.xpath('/html/body/center/table/tbody/tr/td/center/table/tbody/tr[6]/td[2]').extract()

        yield disciplina

Wenn mir jemand helfen könnte, wäre ich unglaublich dankbar