Enterprise search cannot be searched with selenium headless browser
Enterprise search cannot be searched with selenium headless browser
may have anti-crawler means, but selenium still has some characteristics, such as some special properties in global objects.
see that you have asked a lot of questions for crawling. Here I would like to remind you:
if you use ChromeDriver headless mode, you cannot visit the site with js scripts inserted through document.write ()
. Refer to a question on stackoverflow :
example:
>>> from selenium import webdriver
>>> option = webdriver.ChromeOptions()
>>> option.add_argument('--headless')
>>> driver = webdriver.Chrome(chrome_options=option)
[0608/163830.206:ERROR:gpu_process_transport_factory.cc(1007)] Lost UI shared context.
DevTools listening on ws://127.0.0.1:60357/devtools/browser/36a1f861-d1ab-4cef-a5a9-3072bbada0fc
>>> driver.get('https://www.baidu.com')
[0608/163849.677:INFO:CONSOLE(715)] "A parser-blocking, cross site (i.e. different eTLD+1) script, https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/global/js/all_async_search_8d20902.js, is invoked via document.write. The network request for this script MAY be blocked by the browser in this or a future page load due to poor network connectivity. If blocked in this page load, it will be confirmed in a subsequent console message. See https://www.chromestatus.com/feature/5718547946799104 for more details.", source: https://www.baidu.com/ (715)
here https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/global/js/all_async_search_8d20902.js
is written into the html text through document.write ()
and then loaded, and will not be executed, so an error is reported.
but Firefox doesn't have this problem, so I recommend you use Firefox's headless mode, or phantomjs, a headless browser.
Firefox example:
from selenium import webdriver
option = webdriver.FirefoxOptions()
option.add_argument('--headless')
driver = webdriver.Firefox(firefox_options=option)
driver.get('https://www.qichacha.com')
-sharp ...
of course, you need to install Firefox. before using Firefox
chat 3327815988 DATA, qq.com, the full library of Sky Eye
self.s = requests.session () -sharp -sharp proxyHost = "http-dyn.abuyun.com" proxyPort = "9020" -sharp proxyUser = "HH30H1A522679P8D" proxyPass = "74EF13F061719736" proxyMeta = "http: %(user)s:%(pas...
<tr> <td>8< td> <td> ...
URL: https: book.douban.com subje. I want to climb to get the names, number of reviews, and ratings of all books searched by Douban keywords, but after I opened the source code interface, the following situation occurred. There is no problem with usin...
found that a page still cannot get page data after configuring host,U-An in header routinely. the get command sent is checked through the debugging tool, and there is no difference. I really can t find the reason. Is it because I lack that part of k...
class qichacha: def __init__(self): option = webdriver.ChromeOptions() option.add_argument( --start-maximized ) -sharp option.add_argument( --headless ) -sharp self.driver = webdriver.Chrome(chrome_options...
Traceback (most recent call last): File "qichacha.py", line 139, in <module> qichacha().read_data() File "qichacha.py", line 39, in read_data self.search_index(name) File "qichacha.py", line 92, in search...
I encountered a problem when I wrote for the first time that the crawler wanted to crawl the travel notes on the home page of the hornet s nest. as follows figure 1.1 I want to mainly crawl the popular travel notes on the home page. 1.1 Chrome page...
how to switch the format of ip with account and password in selenium how to switch ip with account and password on selenium ip and port, account and password for example: wrewre52a@117.41.186.194:888 can t be found on the Internet. ...
website is "Enterprise search " ...
* * I would like to ask Senior Daniel two questions 1, java and python. Which two languages are more suitable for crawling systems? 2. In what language is Jinri Toutiao s crawler crawling system written? * * ...
topic description I want to write a crawler to crawl Ctrip s train ticket information. I found that the ticket information was loaded asynchronously using Ajax, so I constructed a post request. Although headers,data and other data are available, the ...
for example, I need all the source code within the < table > tag for special reasons, do not use the page_source method ...
the addresses I found through Baidu search are incomplete, such as https: codeshelper.com a 11. ellipsis is not the same as the one opened. Ask the I requested through the search interface. ...
Python Selenium Webdriver reuses an open browser instance ...
the following code, I want to use beatuifulsoup to get the value of posid (1). How do I write it? <div class="ec_ad_results" posid="1" prank="2" sourceid="160"> ...
import requests from bs4 import BeautifulSoup import re user_agent = Mozilla 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit 537.36 (KHTML, like Gecko) Chrome 70.0.3521.2 Safari 537.36 headers = { User-Agent :user_agent} url = http: bxjg.bi...
I want to climb the ip list of the following website https: free-proxy-list.net because every page will be updated with ip, I need to turn the page. At first, I can do it with selenium, but I think the cost is too high. So I want to use requests to...
I want to get some ip http: spys.one en free-proxy. of this website. because if I click servers per page to change to 100 or 50, there will be more ip in the table. I check that Firebug, should be a post request, and then I replace headers and param...
-sharp! usr bin env python3 __author__ = Stephen import scrapy, json from Espider.tools.get_cookies import get_cookies from scrapy_redis.spiders import RedisSpider from scrapy_redis.utils import bytes_to_str from Espider.items.jingzhunitem import jin...
...