踏上起點的旅途: 爬蟲筆記-cookies 和連續爬取

這邊無法正常顯示文章內容.進入後

注意到出現的 over18=1

文章摘要在<div class='title'>元素中

上頁的鏈結在<a class='btn wide'>元素中

import requests
from bs4 import BeautifulSoup
import html5lib
c=1 #全域變數, 用來計算目前爬到第幾頁
def catchurl(url='https://www.ptt.cc/bbs/Gossiping/index.html'):
    """
    爬取指定URL的文章摘要
    """       
    cookie={'over18':'1'} #帶入cookies
    header={'user-agent':'myapp/0.0.1'} #模擬正常進入網頁的標頭
    response=requests.get(url,cookies=cookie,headers=header)
    
    if response.status_code ==200 and response.headers['content-type']=='text/html; charset=utf-8': 
    #如果響應正常並且屬於html類型,就解析html
        data=response.text 
    #取得回應的文本
        soup=BeautifulSoup(data,'html5lib')  
    #使用html5lib解析html
        print(soup.title.text) 
    #印出目前爬取的頁面的標題
        titles=soup.find_all('div',class_='title')  
    #將定位文章摘要的元素
        for tilte in titles:
            print(tilte.text.strip()) 
    #只印出字串的部分,並去除左右的空字串
        pages=soup.find_all('a',class_='btn wide') 
    #定位其他頁數的連結
        for page in pages:
            if '上頁' in page.text:
                url='https://www.ptt.cc'+page['href'] 
    #將下一頁的url賦值給url
    
    global c 
    #宣告要使用的c 是全域變數
    print(c)
    c = c+1
    while c <10:
        catchurl(url) 
    #連續爬取9頁的文章摘要
    
if __name__ =='__main__':  
    #限定在主函式執行
    catchurl(url='https://www.ptt.cc/bbs/Gossiping/index.html')  

標籤：爬蟲

踏上起點的旅途

2023年5月19日星期五

爬蟲筆記-cookies 和連續爬取

0 個意見:

張貼留言

關於我自己

先前的文章

踏上起點的旅途

2023年5月19日 星期五

爬蟲筆記-cookies 和 連續爬取

0 個意見:

張貼留言

關於我自己

先前的文章

2023年5月19日星期五

爬蟲筆記-cookies 和連續爬取