For example will take http://skidpaste.org.
If you will try to get the main page with requests python module:
>>> import requests >>> r = requests.get('http://skidpaste.org') >>> r.status_code 503 >>>
or with mechanize module:
>>> import mechanize >>> br = mechanize.Browser() >>> br.set_handle_robots(False) >>> br.open('http://skidpaste.org') Traceback (most recent call last): File "There different response codes, but the main point is clear: you haven't the website content.", line 1, in File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open return self._mech_open(url, data, timeout=timeout) File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open raise response mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden >>>
If we'll open the resource with Firefox browser and wait for the actual website, we'll receive a CloudFlare cookies. Which will be checked every time when you'll access the resource.
So the idea is to get these cookies, and pass to my lovely requests module :)
Pseudocode look like this:
1. Open website with selenium
2. Wait for 10 seconds
3. Get CloudFlare cookies
4. Close selenium browser.
Python example:
#!/usr/bin/python from selenium import webdriver from time import sleep import cookielib import requests print 'Launching Firefox..' browser = webdriver.Firefox() print 'Entering to skidpaste.org...' browser.get('http://skidpaste.org') print 'Waiting 10 seconds...' sleep(10) a = browser.get_cookies() print 'Got cloudflare cookies:\n' print 'Closing Firefox..' browser.close() h = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0'} b = cookielib.CookieJar() for i in a: ck = cookielib.Cookie(name=i['name'], value=i['value'], domain=i['domain'], path=i['path'], secure=i['secure'], rest=False, version=0,port=None,port_specified=False,domain_specified=False,domain_initial_dot=False,path_specified=True,expires=i['expiry'],discard=True,comment=None,comment_url=None,rfc2109=False) b.set_cookie(ck) r = requests.get('http://skidpaste.org', cookies=b, headers=h) print len(r.content) print r.status_codeThe output:
# ./cloudflare_bypass.py Launching Firefox.. Entering to skidpaste.org... Waiting 10 seconds... Got cloudflare cookies: [{u'domain': u'.skidpaste.org', u'name': u'__cfduid', u'value': u'd8af70c3b49361a5a1b818e91171e598d1431355518', u'expiry': 1462891518, u'path': u'/', u'secure': False}, {u'domain': u'.skidpaste.org', u'name': u'cf_clearance', u'value': u'5857af9797c612cde4ac590fe900e0e9f3d7098f-1431355526-57600', u'expiry': 1431416726, u'path': u'/', u'secure': False}, {u'domain': u'skidpaste.org', u'name': u'PHPSESSID', u'value': u'eefc5d29f6cea1ddb70ca5a0baaf60e1', u'expiry': None, u'path': u'/', u'secure': False}] Closing Firefox.. 115026 200Follow @nopernik
Is It Okey To Use => https://github.com/Anorov/cloudflare-scrape ???
ReplyDeleteAfter the sleep(10) the redirection is done and you can directly work on the updated page
ReplyDelete