踏上起點的旅途: 爬蟲筆記-html5lib(容錯率) & Beauitfulsoup

Beautifulsoup 適用於解析HTML 與 XML文件的Python函式庫,並且支援多種解析器, 其中html.parser是Python的內建解析器,優點速度快, 缺點容錯率低

其中最常用的解析器就是html5libㄝ, 基於HTML5標準, 高容錯率,處理各種特殊情況下的HTML文檔,並且能正確解析不完整或錯誤的標籤, 並生成一致的解析樹,盡管html5lib解析器速度較慢,處理大型文件更是如此, 但將html5lib搭配Beautifulsoup可以更準確的解析HTML文檔,同時Beautifulsoup提供簡單而值觀的API,讓我們能輕鬆地尋找看題曲網頁的元素進行數據的處理和分析

, soup另外提供了簡單的方式方便我們進行網頁資料的處理分析

import requests
from bs4 import BeautifulSoup
url='https://ithelp.ithome.com.tw/users/20134430/ironman/4307'
response=requests.get(url,headers={
    'User-Agent':'my-app/0.0.1'
})
#發送HTTP請求至指定URL,並指定User-Agent標頭
#使用html5lib解析會比html.parser 容錯率更強但速度會更慢
#注意容錯率可能會導致解出來東西與實際不符,當找不到要找的元素
#可以考慮換解析器,其中html5lib是容錯率最高的解析器
data=response.text
#取得回應的HTML內容
soup=BeautifulSoup(data,'html5lib') 
#將HTML內容轉換成BeautifulSoup物件
#使用容錯率高但速度慢的html5解析器

印出網頁的title

#輸出網頁的title

print(soup.title)#找到title標籤並輸出標籤的內容,包括標籤本身和內部文字

print(soup.title.get_text()) #只印出title標籤中的文字內容

print(soup.title.text) #同上

print('-'*30)

-----------------------------------------------------------

<title>網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術 :: 2021 iThome 鐵人賽</title> 網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術 :: 2021 iThome 鐵人賽網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術 :: 2021 iThome 鐵人賽 ------------------------------

#找到第一個<li>元素的文字

print(soup.li.text)

print(soup.find('li').getText)

print(f'尋找索地的<li>標籤文字')

lis=soup.find_all('li')

#找到所有<li>標籤, 以list形式存入lis變數中

print(len(lis))

for i in lis:

    print(i.text)

#印出標籤中的文字內容

------------------------------ 技術問答 <bound method PageElement.get_text of <li class="menu__item"> <a class="menu__item-link menu__item-link--pl" href="https://ithelp.ithome.com.tw/questions"> 技術問答</a> </li>> 尋找索地的<li>標籤文字技術問答技術文章 iT 徵才

..略

#如果想要取得標籤屬性

#就要像字典那樣操作

'''

 <div><a href="#" class="invitation-list__name">{{ result.label }}</a></div>

 suop.a["href"]就可以取得該標籤屬性#

 suop.a["class"]就可以取得該標籤屬性invitation-list__name

'''

tes=soup.find_all('a')

#tes是找到所有a標籤元素

for i in tes:

    if 'href' in i.attrs: 

    #然後用迴圈代出每一筆標籤,另外檢查標籤的屬性是否存在href屬性

        print(i['href'])

    #如果存在就印出i['href']=屬性值

---------------------------------------

https://ithelp.ithome.com.tw/questions https://ithelp.ithome.com.tw/articles?tab=tech https://ithelp.ithome.com.tw/articles?tab=job https://ithelp.ithome.com.tw/tags https://ithelp.ithome.com.tw/talks /2022ironman?utm_source=ithelp&utm_medium=navbar&utm_campaign=ironman14 https://ithelp.ithome.com.tw/users/login https://ithelp.ithome.com.tw/users/20134430/points https://ithelp.ithome.com.tw/users/20134430/traced https://ithelp.ithome.com.tw/messages/getGroup/20134430 https://ithelp.ithome.com.tw/users/login https://ithelp.ithome.com.tw/users/20134430/profile https://ithelp.ithome.com.tw/users/20134430/questions https://ithelp.ithome.com.tw/users/20134430/articles https://ithelp.ithome.com.tw/users/20134430/answers https://ithelp.ithome.com.tw/users/20134430/invited https://ithelp.ithome.com.tw/users/20134430/best_answers

.

.

..略

#想定 我目前要爬取 【DAY 01 】 - 前言 : 從 0 開始的網路爬蟲 網址

work=soup.select('.qa-list__title-link')

print(type(work))

for i in work:

    if "01" in i.text and 'href' in i.attrs:

        print(i)

#想定 我目前要爬取 【DAY 01 】 - 前言 : 從 0 開始的網路爬蟲 網址

work=soup.select('.qa-list__title-link')

print(type(work))

for i in work:

    if "01" in i.text and 'href' in i.attrs:

        print(i)

print('-----'*10)

link=soup.find('a',class_='qa-list__title-link')

print(link)

print('-----'*10)

print(link['href'].strip())

#string.strip()用來刪除字串兩側的字符, 預設是空白

-----------------------------------

<class 'bs4.element.ResultSet'> <a class="qa-list__title-link" href=" https://ithelp.ithome.com.tw/articles/10263628 "> 【Day 01】- 前言: 從 0 開始的網路爬蟲 </a> 我有跑嗎 -------------------------------------------------- <a class="qa-list__title-link" href=" https://ithelp.ithome.com.tw/articles/10263628 "> 【Day 01】- 前言: 從 0 開始的網路爬蟲 </a> -------------------------------------------------- https://ithelp.ithome.com.tw/articles/10263628

標籤：爬蟲

踏上起點的旅途

2023年5月15日星期一

爬蟲筆記-html5lib(容錯率) & Beauitfulsoup

0 個意見:

張貼留言

關於我自己

先前的文章