Project 1: Crawling Baidu Translate
Project purpose: Understand the crawler program writing in the post request mode
The first step is to observe the webpage
Open the developer tools, try to enter the content that needs to be translated in the website, and observe the webpage. After each new input, a new sug request is formed. After clicking to open, you can find the requested URL and the request method as POST. Each time a translation is performed, a form will be passed, and the content of the form is the content that needs to be translated.
The second step is to write the program
# 导入可用的包
import requests
url = "https://fanyi.baidu.com/sug" #定义url
dat ={
"kw":'dog'
}# 尝试翻译dog这个单词
resp = requests.post(url, data=dat)#提交请求
print(resp.json())#输出结果
resp.close()# 关掉response
#输出结果为
{'errno': 0, 'data': [{'k': 'dog', 'v': 'n. 狗; 蹩脚货; 丑女人; 卑鄙小人 v. 困扰; 跟踪'}, {'k': 'DOG', 'v': 'abbr. Data Output Gate 数据输出门'}, {'k': 'doge', 'v': 'n. 共和国总督'}, {'k': 'dogm', 'v': 'abbr. dogmatic 教条的; 独断的; dogmatism 教条主义; dogmatist'}, {'k': 'Dogo', 'v': '[地名] [马里、尼日尔、乍得] 多戈; [地名] [韩国] 道高'}]}
The third step is to optimize the integration process
Just now we need to manually change the content that needs to be translated in the submission form, and then we convert this process into an input and output step. The integrated complete procedure is as follows:
# 导入可用的包
import requests
url = "https://fanyi.baidu.com/sug"
s = input("请输入你要翻译的单词")
dat ={
"kw":s
}
resp = requests.post(url, data=dat)
print(resp.json())
resp.close()# 关掉response
When the input is dog, the following results are output:
请输入你要翻译的单词dog
{'errno': 0, 'data': [{'k': 'dog', 'v': 'n. 狗; 蹩脚货; 丑女人; 卑鄙小人 v. 困扰; 跟踪'}, {'k': 'DOG', 'v': 'abbr. Data Output Gate 数据输出门'}, {'k': 'doge', 'v': 'n. 共和国总督'}, {'k': 'dogm', 'v': 'abbr. dogmatic 教条的; 独断的; dogmatism 教条主义; dogmatist'}, {'k': 'Dogo', 'v': '[地名] [马里、尼日尔、乍得] 多戈; [地名] [韩国] 道高'}]}
The crawler works fine.
Project 2: Crawling Youdao Translation
Project purpose: Preliminary understanding of website anti-crawling mechanism
The first step is to observe the webpage
Open the developer tools, try to enter the content that needs to be translated in the website, and observe the webpage. After each new input, a new request for translate_o?smartresult=dict&smartresult=rule is formed. After clicking to open, you can find the requested URL and the request method as POST. Each time a translation is performed, a form will be passed. The content of the form is the one that needs to be translated. content. We found that in addition to i, that is, the content to be translated, the content of the form passed each time is different, and there are several different data "salt", "sign", "lts"
The second step is to explore the rules of form generation
The console directly searches for translate_o. In the source code, we find a file named fanyi.min.js, which is formatted and displayed so that we can further understand this encryption mechanism.
Search for salt in the formatted file to find the locally encrypted algorithm part.
You can see from the formatted document that the salt parameter is composed of timestamps and random numbers, and the sign parameter is composed of "fanyideskweb" + translated content + salt + a string of characters ('Ygy_4c=r#e#4EX^ NUGUc5') into md5 format, the Its parameter is missing the last digit than salt. Therefore, to build these keys, some other third-party libraries need to be used:
-
time generation time -
random generates random numbers -
hashlib converts strings to md5 format
The third step is to write the code
import time
import random
import hashlib
import requests
# 定义请求网址和表单查询内容
url = "https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"
s = input("请输入你要翻译的单词")
# 模拟生成表单
u = 'fanyideskweb'
salt = str(int(time.time()*1000) + random.randint(1,10))
c = 'Ygy_4c=r#e#4EX^NUGUc5'
sign = hashlib.md5((u + s + salt + c).encode('utf-8')).hexdigest()
its = salt[:-1]
dat ={
"i":s,
"from": "AUTO",
"to": "AUTO",
"smartresult": "dict",
"client": "fanyideskweb",
"salt": salt,
"sign": sign,
"lts": its,
"bv": "6f1d3ad76bcde34b6b6745e8ab9dc20a",
"doctype": "json",
"version": "2.1",
"keyfrom": "fanyi.web",
"action": "FY_BY_CLICKBUTTON"
}
#模拟浏览器发出请求
headers = {
# 'Cookie': 'OUTFOX_SEARCH_USER_ID_NCOO=387409182.3548826; OUTFOX_SEARCH_USER_ID="[email protected]"; _ga=GA1.2.527249081.1606700101; _ntes_nnid=7d04aa5336af433ea9b89954bd6b05fe,1634886459403; P_INFO=jiangcong5055; _dd_s=logs=1&id=8d3b4893-9b9c-4d42-bfbd-58b7d8dc8d15&created=1654560604551&expire=1654561519417; ___rl__test__cookies=1654560619440',
'Cookie': 'OUTFOX_SEARCH_USER_ID="[email protected]"',
'Referer': 'https://fanyi.youdao.com/',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
}
resp = requests.post(url,headers=headers, data=dat)
#打印输出结果
print(resp.json())
resp.close()# 关掉response
请输入你要翻译的单词dog
{'errorCode': 0, 'translateResult': [[{'tgt': '狗', 'src': 'dog'}]], 'type': 'en2zh-CHS', 'smartResult': {'entries': ['', 'n. 狗,犬;公狗,公狐,公狼;质量极差的东西;无吸引力的女子;卑鄙小人,无赖;<美,非正式>朋友\r\n', 'v. 困扰,纠缠;跟踪,尾随\r\n'], 'type': 1}}