1. 写在前面
这个网站也是作者最近接触到的一个APP应用市场类网站。讲实话,还是蛮适合新手朋友去动手学习的。毕竟爬虫领域要想进步,还是需要多实战、多分析!该网站中的一些小细节也是能够锻炼分析能力的,也有反爬虫处理。甚至是下载APP的话在Web端是无法拿到APK下载的直链,需要去APP端接口数据获取
2. 接口分析
需要抓取的内容为整个游戏板块(当然可以是所有板块甚至是关键词去搜素命中)。游戏板块包含了所有分类与子分类下APP信息,如下所示:文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
首先我们打开控制台发个包先,监测一下请求内容,如下所示:文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
这里可以直接把请求CURL出来,转换成Python代码,如下所示:文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
python
import requests
headers = {
"sec-ch-ua": ""Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"",
"Accept": "application/json, text/plain, */*",
"Referer": "https://appgallery.huawei.com/",
"Interface-Code": "6bc2bec970e616747dc0a99b57aa6730_1716973000358",
"sec-ch-ua-mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"sec-ch-ua-platform": ""macOS""
}
url = "https://web-drcn.hispace.dbankcloud.com/edge/uowap/index"
params = {
"method": "internal.getTabDetail",
"serviceType": "20",
"reqPageNum": "1",
"uri": "b2b4752f0a524fe5ad900870f88c11ed",
"maxResults": "25",
"zone": "",
"locale": "en"
}
response = requests.get(url, headers=headers, params=params)
print(response.json())
这里直接请求,会失败!因为其中有一个细节的反爬虫检测,代码运行会提示接口验证失败,如下所示:文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
bash
{'rtnCode': 1002, 'rtnDesc': 'InterfaceCode Verification failed.'}
这个是什么原因?根据返回提示中接口检验的问题,回看请求头内携带有Interface-Code参数,大概率是这个参数的问题,重放失败!这个参数是动态的,不过好在并非算法加密生成文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
Interface-Code参数其实相当于一个动态注册的令牌,在我们请求接口的时候。需要去固定的接口请求并拿到值,然后携带这个参数的值进行后续的任何请求文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
注意!这个动态的值也是有时效性的,大概在一分钟时间不等~文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
文章源自灵鲨社区-https://www.0s52.com/bcjc/pythonjc/15616.html
如上图,动态获取Interface-Code参数的接口,可每次在提交页面请求之前,从这个接口拿动态值携带!
3. 爬虫开发
这里我们经过分析可以找到属于游戏板块全量分类的一个唯一ID,拿它可以获取到所有的子分类ID,如下所示:
完整链接地址如下:
bash
https://web-drcn.hispace.dbankcloud.com/edge/uowap/index?method=internal.getTabDetail&serviceType=20&reqPageNum=1&uri=33ef450cbac34770a477cfa78db4cf8c&maxResults=25&zone=&locale=en
这个接口请求之后,去解析拿到tabInfo下面的tabId字段即可!这是子分类板块的ID,用来后续请求细分栏目下的所有游戏类APP所需要的字段,同样放到上面链接替换uri,如下所示:
接下来就是详情页数据抓取,找到详情页面的接口,如下所示:
这里我们构造出对详情接口请求的具体实现功能,代码如下:
python
# appDetail是从子分类接口过来的结构化数据
appid = appDetail.get('appid', '')
if appid:
detail_link = f"https://appgallery.huawei.com/#/app/{appid}"
encoded_appid = quote(f"app|{appid}")
encoded_detail_link = quote(quote(detail_link))
detail_json_link = (
'https://web-drcn.hispace.dbankcloud.cn/uowap/index?'
'method=internal.getTabDetail&serviceType=20&reqPageNum=1&maxResults=25&'
f'uri={encoded_appid}&shareTo=¤tUrl={encoded_detail_link}&accessId=&'
f'appid={appid}&zone=&locale=zh'
)
response = requests.get(detail_json_link, headers=headers)
对详情页面接口构造完请求后同样将会得到JSON数据,需要对结构化数据进行解析,实现代码如下:
python
import json
import dateparser
item = {}
app_keys = [
"icoUintrori", "intro", "name", "sizeDesc", "versionName", "downurl",
"package", "commentCount", "appid", "md5", "kindName", "images",
"releaseDate", "developer", "appIntro", "authority", "portalUrl"
]
appdata = {}
response_data = json.loads(response)
layoutDatas = response_data.get('layoutData', [])
for layoutData in layoutDatas:
# 详情里面数据分支结构不一样,在3、6中
if layoutData.get('dataList-type') in (3, 6):
appinfo = layoutData.get('dataList', [])[0]
appdata.update({key: value for key, value in appinfo.items() if key in app_keys and key not in appdata})
item['apk_name'] = appdata.get('name')
item['developer'] = appdata.get("developer")
item["apk_size"] = appdata.get("sizeDesc")
item['apk_version_name'] = appdata.get("versionName")
pub_time = appdata.get("releaseDate")
item['apk_update_time'] = dateparser.parse(pub_time).strftime('%Y-%m-%d %H:%M:%S') if pub_time else None
item['apk_screenshots'] = appdata.get("images")
item['apk_introduction'] = appdata.get("appIntro")
item['apk_download_times'] = process_downloads_times(appdata.get('intro', ''))
如上,拿到完整的APP信息数据(包括名称、版本、发布时间、大小、下载量、开发者...)其中process_downloads_times方法是一个自定义方法,对下载量数据进行清洗,页面下载量如下:
所以需要一个对该下载量数据清洗的方法,实现代码如下所示:
python
import re
def process_downloads_times(data):
if not data:
return None
try:
return int(data)
except (ValueError, TypeError):
pass
data = re.sub(r"[++,|次|,|<|>]", "", data).strip()
pa = re.compile(r"[d+.]+")
units = {
"百亿": 10000000000,
"千万": 10000000,
"百万": 1000000,
"十万": 100000,
"亿": 100000000,
"十": 10,
"百": 100,
"千": 1000,
"万": 10000,
}
for unit, number in units.items():
if unit in data:
num = "".join(pa.findall(data))
try:
return str(int(float(num) * number))
except ValueError:
return data
return data
4. 下载链接获取
接下来,最重要的就是对APK包的下载了,在Web端可以看到有一个getAppDownloadUrl的接口,但是它并不是APK包的下载链接,它是华为应用商店的APP下载链接,所以注意不要被误导了,如下所示:
要获取APK包的下载链接,需要从APP端入手,Web网站内不再提供APK的下载直链!以前是可以的,早期的下载链接直接拼appid即可,早期链接如下所示:
bash
ttps://appgallery.cloud.huawei.com/appdl/C100210735
这里需要对手机端的华为应用商店APP进行抓包分析,拿到下载链接接口!最终APK在移动端的下载接口如下所示:
bash
https://store-drcn.hispace.dbankcloud.com/hwmarket/api/clientApi
这里下载接口的请求参数巨长,好在没有加密的参数字段,如下所示:
bash
data = 'apsid=1705374325915&brand=google&channelId=background&clientPackage=com.huawei.appmarket&cno=4010001&code=0200&contentPkg=&dataFilterSwitch=&deviceId=a7a6dcc273d12e3f63865060533dec04f24e5042e48a621c1b5e601953e1bb69&deviceIdRealType=4&deviceIdType=9&deviceSpecParams=%7B%22abis%22%3A%22arm64-v8a%2Carmeabi-v7a%2Carmeabi%22%2C%22deviceFeatures%22%3A%22U%2Ccom.verizon.hardware.telephony.lte%2Ccom.verizon.hardware.telephony.ehrpd%2CP%2CB%2C0c%2Ce%2C0J%2Cp%2Ca%2Cb%2C04%2Cm%2Candroid.hardware.wifi.rtt%2Ccom.google.android.feature.PIXEL_2017_EXPERIENCE%2C08%2C03%2CC%2CS%2C0G%2Cq%2Ccom.google.android.feature.PIXEL_2018_EXPERIENCE%2CL%2C2%2C6%2Ccom.google.android.feature.GOOGLE_BUILD%2CY%2C0M%2Candroid.hardware.vr.high_performance%2Cf%2C1%2C07%2C8%2C9%2Candroid.hardware.sensor.hifi_sensors%2Candroid.hardware.strongbox_keystore%2CO%2CH%2Ccom.google.android.feature.TURBO_PRELOAD%2Candroid.hardware.vr.headtracking%2CW%2Cx%2CG%2Co%2C06%2C0N%2Ccom.google.android.feature.PIXEL_EXPERIENCE%2C3%2CR%2Cd%2CQ%2Cn%2Candroid.hardware.telephony.carrierlock%2Cy%2CT%2Ci%2Cr%2Cu%2Ccom.google.android.feature.WELLBEING%2Cl%2C4%2C0Q%2CN%2Candroid.software.device_id_attestation%2CM%2C01%2C09%2CV%2C7%2C5%2C0H%2Cg%2Cs%2Cc%2CF%2Ct%2C0L%2C0W%2Ccom.google.hardware.camera.easel_2018%2Ccom.google.android.apps.dialer.SUPPORTED%2C0X%2Ck%2C00%2Ccom.google.android.feature.GOOGLE_EXPERIENCE%2Ccom.google.android.feature.EXCHANGE_6_2%2Ccom.google.android.apps.photos.PIXEL_2018_PRELOAD%2Candroid.hardware.sensor.assist%2Ccom.google.android.feature.DREAMLINER%2Candroid.hardware.audio.pro%2CK%2CE%2C02%2CI%2CJ%2Cj%2CD%2Ch%2Candroid.hardware.wifi.aware%2CX%2Cv%22%2C%22dpi%22%3A560%2C%22openglExts%22%3A%221%2C0y%2C1d%2C1e%2C1f%2C0c%2CM%2CO%2C0q%2C0r%2C0s%2CB%2CA%2C5%2C4%2C0p%2CD%2CE%2C0m%2C0n%2C7%2C0i%2C0j%2C0k%2C8%2C0t%2C0w%2C1g%2C1o%2C1m%2C1n%2CH%2C0u%2C1i%2C1l%22%2C%22preferLan%22%3A%22zh%22%2C%22usesLibrary%22%3A%225%2C6%2Ccom.vzw.apnlib%2C09%2C0D%2Ccom.google.android.camera.experimental2018%2CA%2C0E%2C0L%2Ccom.google.android.poweranomalydatamodeminterface%2C9%2C2%2Cb%2C0C%2Ccom.android.ims.rcsmanager%2CE%2C1%2Ccom.verizon.embms%2Ccom.qualcomm.qti.uim.uimservicelibrary%2Ccom.google.android.lowpowermonitordevicefactory%2Ccom.google.android.lowpowermonitordeviceinterface%2Ccom.google.android.poweranomalydatafactory%2C0H%2C08%2C7%2Ccom.google.android.dialer.support%2Ccom.google.vr.platform%2CF%2Ccom.verizon.provider%2Cd%2C0A%2Ccom.google.android.hardwareinfo%2CB%2C0K%2CC%2C07%22%7D&fid=0&globalTrace=null&gradeLevel=0&gradeType=&hardwareType=0&isSupportPage=1&manufacturer=Google&maxResults=25&method=client.getTabDetail&net=1&outside=0&recommendSwitch=1&reqPageNum=1&roamingTime=0&runMode=2&serviceType=0&shellApkVer=0&sid=1705386105840&sign=h9001090fp00010920000000000001000a0000000600100000011190000010000040240b0100001001000%40000F6C248B294FAD99F665A13B23795D&thirdPartyPkg=com.huawei.appmarket&translateFlag=1&ts={ts}&uri=app%7C{appid}__HiAd__fed345ef91e24079b46e93c92d7e7d4e__cds_111122701__21__null____null%3Faglocation%3D%257B%2522cres%2522%253A%257B%2522lPos%2522%253A0%252C%2522lid%2522%253A%2522905310%2522%252C%2522pos%2522%253A21%252C%2522relResId%2522%253A%2522{appid}%2522%252C%2522relType%2522%253A%2522app%2522%252C%2522resid%2522%253A%2522{appid}%2522%252C%2522rest%2522%253A%2522app%2522%252C%2522tid%2522%253A%2522dist_83ae03cda9994714a499347e25bbccd1%2522%257D%252C%2522ftid%2522%253A%2522dist_25f5040f852a48c1b3bb52c3a99f4cf3%2522%252C%2522pres%2522%253A%257B%2522lPos%2522%253A0%252C%2522pos%2522%253A1%252C%2522resid%2522%253A%252283ae03cda9994714a499347e25bbccd1%2522%252C%2522rest%2522%253A%2522tab%2522%252C%2522tid%2522%253A%2522dist_25f5040f852a48c1b3bb52c3a99f4cf3%2522%257D%257D%26templateId%3D1428744958c44b80aad5091114d7fce4%26maple%3D0%26trackId%3D0%26attribution%3D%257B%2522taskid%2522%253A%2522910297447%2522%252C%2522subTaskId%2522%253A%25229102974470000%2522%252C%2522RTAID%2522%253A%2522%2522%252C%2522callback%2522%253A%2522security%253A44D8410ED7A32A50E8AABFD6%253A03EC36808D57EC58A89C207497BB1CED2B6192ED09D44BAC9D825C50BF49E14A95A9E0E6281F83A1DA4C1F2E6D7A8A%2522%252C%2522channel%2522%253A%25220%2522%257D%26appoid%3Ddde9f2d763154f37acc76b8620d195a9%26listId%3Dcds_111122701%26listIdType%3Dcds%26adFlag%3DHiAd%26cdrInfo%3D20240116142810el21s692259%255E%257BopType%257D%255E910297447%255E{appid}%255Ecds_111122701%255E21%255E608743720197185117_p21%255E608743720197185117%255E14effd3e76d34d76d6aa4185d8d08e5bb340dfdcb8ce0227285d09b8c9001ce2f315fce6f842b5449b4eac5ebebbd78bf7f6f470baaadc081ae820e43b6653825f66a6e7b6f2892cd760d063df840dd9954254fb48fa6a9f19259d68c7170c54%255Ectr%255Efn5-Y29tLmh1YXdlaS5hcHBtYXJrZXR-LTF-MTcwNTMxMDUxODQ0OQ%255E2024-01-16%2B14%253A28%253A10%255E%255EU8800%255E0.000126%255E0%255E2.0%255E2.0%255E9102974470000%255E900011964%255E%255E%255E%255E%255E11.4.1.303%255E1705374325914%255E0%255E1%255E%255E%255E%255E9%255E00000000-0000-0000-0000-000000000000%255E1%255E%255E%255ECN%255ECNY%255E243728192482956951%255Eappgallery%255E%255E%255E%255E2ea67221765846d69b880f43a78a7ab3%255ECPD%255E69%255E%255E%255E243728192570833732%255EWIFI%255EMA%255E%255EZXhwU3RyYXRlZ3k9I2V4cFBhcmFtcz0xMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnkjYnVja2V0SWQ9I2V4cERlcz0xMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnkjZXhwSWQ9MTMxN19WMi5hZ18xMzE3X1YyX2FsbF90YXJnZXRfb3B0aW1pemF0aW9uX29uZWZvcmFsbHwxMzAxLmFncmVjX2Fkdl9leHRlbmRfdWJyX2Rvd25fYXBwXzF5X3dpdGhfcmVhbHRpbWVfcXVlcnl8MTMwN19WMi5hZ18xMzA3X1YyX2NvbnRyb2xsZXI%255E%255E%255E%255E1%255E1%255E4010001%255E%255E3G%255E%255E0%255E%255E%255E%255E%255E%255E0%255E1%255E1027162%255E%255E1%255EP8%255E%255E%255Ecom.huawei.appmarket%255E%255E%255E%255E%255E%255E%255E%255E%255Ecom.huawei.appmarket%255E0.000252%255E%255E%255E20240116142810el21s692259%255E20240116142810el21s692182%255E1%255E0%255E%255E%255Esign%253A987064c954dab704e2b30dfe17f5ca63dae4245392d4fff32e75670e44da8998%26version%3D2%26phase%3D0%26requestId%3D2ea67221765846d69b880f43a78a7ab3%26algExp%3D%257B%2522expStrategy%2522%253A%2522246353%2522%252C%2522expParams%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%252C%2522bucketId%2522%253A%252236%2522%252C%2522expDes%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%252C%2522expId%2522%253A%25221112.appstore_mtp_fulllist_new_baseline_dataops%2522%257D&ver=1.1'
这里我们拿一个APP的appid对移动端的详情进行请求,可以看到接口数据内包含APK的下载链接,如下所示:
最终,完整的APP类爬虫运行及数据抓取效果如下所示:
评论