踏上起點的旅途: 爬蟲筆記-爬取HTML中的標籤

import requests

url='https://www.ptt.cc/bbs/movie/index.html'

#如果我們沒有帶入瀏覽器等參數會被回傳403,被拒絕訪問, 

因為"User-Agent"這串通常是使用者正常透過瀏覽器生成的資料

沒有帶入這個參數很明顯的是由程式進行爬取

reponse=requests.get(url,headers={'User-Agent':'網頁F12中的Network有該參數'})

#透過request向指定的URL發出GET請求,取得伺服器給的HTTP響應

並把他存放到reponse物件中,其中設定headers標頭參數避免被當惡意程式阻擋

data=reponse.text

----------------------------------------------------

data=<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>看板 movie 文章列表 - 批踢踢實業坊</title> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print"> </head> <body> <div id="topbar-container"> <div id="topbar" class="bbs-content"> <a id="logo" href="/bbs/">批踢踢實業坊</a>

略

-------------------------------------

將HTTP響應的內容以字串的方式存放到data變數中,也就是我們看到的網頁原始碼

import bs4

#觀察網頁原始碼後,發現是使用HTML,所以使用bs4專門用來解析HTML的模組

root = bs4.BeautifulSoup(data,'html.parser')

# print(root.title) #抓到root裡面title的標籤

# print(root.title.string) #抓到root裡面的title標籤裏頭的文字

"""

            <div class="title">

                <a href="/bbs/movie/M.1683726942.A.FA9.html">[討論] 絕地營救是不是沒有發行原聲帶？</a>

            </div>

"""

titles=root.find('div',class_='title') #從網頁中尋找標籤,我要的標籤在div class='title'中,也就是尋找class='title' 的div標籤

#要找所有列表就改成find_all

"""注意注意!!!   

titles=root.find('div',class_='title')

這段程式碼中,因為我們要找的標題在div標籤裡面的class='title'中

但是class是python的保留字, 所以beautifulsoup4中 必須要使用class_  

"""

print(titles,type(titles)) # <class 'bs4.element.Tag'> 

#bs4中 可以使用text把Tag轉換成str ,

#或者是直接使用 titles.a 這是指titles的a標籤<a href="/bbs/movie/M.1683726942.A.FA9.html">[討論] 絕地營救是不是沒有發行原聲帶？</a>

#如果再加上 titles.a.string 就變成 a編成裡面的字串 [討論] 絕地營救是不是沒有發行原聲帶？

titles_str=titles.text  #會將標籤化的對象轉化為字串對象 

print('-'*20)

print(titles_str) #[討論] 絕地營救是不是沒有發行原聲帶？

------------------------------------

<div class="title"> <a href="/bbs/movie/M.1683811357.A.A89.html">[新聞] 荷莉貝莉入選《小美人魚》嗨哭整天</a> </div> <class 'bs4.element.Tag'> -------------------- [新聞] 荷莉貝莉入選《小美人魚》嗨哭整天 (?祈0_30) PS C:\Users\K\OneDrive\桌面\爬蟲0_30>

-------------------------------------

'''

import re

from bs4 import BeautifulSoup

html = """

<div class="title">

    <a href="/bbs/movie/M.1683637198.A.639.html">《哆啦A夢》有新貓了！偶像男團成員為他獻聲　守</a>

</div>

"""

soup = BeautifulSoup(html, 'html.parser')

title_tag = soup.find('div', class_='title')

title_str = title_tag.text   # 将标签对象转化为字符串对象

pattern = r'《(.+)》'

title = re.findall(pattern, title_str)[0]

print(title)   # 输出：哆啦A夢

'''

標籤：爬蟲

踏上起點的旅途

2023年5月11日星期四

爬蟲筆記-爬取HTML中的標籤

0 個意見:

張貼留言

關於我自己

先前的文章