Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and

Delightful Soup is a Python library for hauling information out of HTML and XML records. It works with your preferred parser to give informal methods for exploring, looking, and altering the parse tree. It regularly spares software engineers hours or long stretches of work. 

These guidelines represent every single significant element of Excellent Soup 4, with models. I give you what the library is useful for, how it works, how to utilize it, how to cause it to do what you need, and what to do when it damages your desires. 

The models in this documentation should work a similar route in Python 2.7 and Python 3.2. 

You may be searching for the documentation for Lovely Soup 3. Provided that this is true, you should realize that Wonderful Soup 3 is never again being created and that Excellent Soup 4 is prescribed for every single new venture. In the event that you need to find out about the contrasts between Excellent Soup 3 and Wonderful Soup 4, see Porting code to BS4. 

This documentation has been converted into different dialects by Wonderful Soup clients:

这篇文档当然还有中文版.

このページは日本語で利用できます(外部リンク)

이 문서는 한국어 번역도 가능합니다. (외부 링크)

an HTML document which will be used as an example throughout this document. It’s part of a story from Alice in Wonderland:

html_doc = “””

<html><head><title>The Dormouse’s story</title></head>

<body>

<p class=”title”><b>The Dormouse’s story</b></p>

<p class=”story”>Once upon a time there were three little sisters; and their names were

<a href=”http://example.com/elsie” class=”sister” id=”link1″>Elsie</a>,

<a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and

<a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;

and they lived at the bottom of a well.</p>

<p class=”story”>…</p>

“””

Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, ‘html.parser’)

print(soup.prettify())

# <html>

#  <head>

#   <title>

#    The Dormouse’s story

#   </title>

#  </head>

#  <body>

#   <p class=”title”>

#    <b>

#     The Dormouse’s story

#    </b>

#   </p>

#   <p class=”story”>

#    Once upon a time there were three little sisters; and their names were

#    <a class=”sister” href=”http://example.com/elsie” id=”link1″>

#     Elsie

#    </a>

#    ,

#    <a class=”sister” href=”http://example.com/lacie” id=”link2″>

#     Lacie

#    </a>

#    and

#    <a class=”sister” href=”http://example.com/tillie” id=”link2″>

#     Tillie

#    </a>

#    ; and they lived at the bottom of a well.

#   </p>

#   <p class=”story”>

#    …

#   </p>

#  </body>

# </html>

Here are some simple ways to navigate that data structure:

soup.title

# <title>The Dormouse’s story</title>

soup.title.name

# u’title’

soup.title.string

# u’The Dormouse’s story’

soup.title.parent.name

# u’head’

soup.p

# <p class=”title”><b>The Dormouse’s story</b></p>

soup.p[‘class’]

# u’title’

soup.a

# <a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>

soup.find_all(‘a’)

# [<a class=”sister” href=”http://example.com/elsie” id=”link1″>Elsie</a>,

#  <a class=”sister” href=”http://example.com/lacie” id=”link2″>Lacie</a>,

#  <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>]

soup.find(id=”link3″)

# <a class=”sister” href=”http://example.com/tillie” id=”link3″>Tillie</a>

One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all(‘a’):

    print(link.get(‘href’))

# http://example.com/elsie

# http://example.com/lacie

# http://example.com/tillie

Another common task is extracting all the text from a page:

print(soup.get_text())

# The Dormouse’s story

#

# The Dormouse’s story

#

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

#

# …

Installing Beautiful Soup

If you’re utilizing an ongoing form of Debian or Ubuntu Linux, you can introduce Delightful Soup with the system package manager: 

$ able get introduce python-bs4 (for Python 2) 

$ adept get introduce python3-bs4 (for Python 3) 

Wonderful Soup 4 is distributed through PyPi, so in the event that you can’t introduce it with theystem packager, you can introduce it with easy_install or pip. The bundle name is beautifulsoup4, and a similar bundle deals with Python 2 and Python 3. Ensure you utilize the correct rendition of pip or easy_install for your Python form (these might be named pip3 and easy_install3 separately in case you’re utilizing Python 3). 

$ easy_install beautifulsoup4 

$ pip introduce beautifulsoup4 

(The BeautifulSoup bundle is most likely not what you need. That is the past significant discharge, Excellent Soup 3. Loads of programming utilizes BS3, so it’s as yet accessible, however in case you’re composing new code you ought to introduce beautifulsoup4.) 

On the off chance that you don’t have easy_install or pip introduced, you can download the Delightful Soup 4 source tarball and introduce it with setup.py. 

$ python setup.py introduce 

As a last resort, the permit for Delightful Soup enables you to bundle the whole library with your application. You can download the tarball, duplicate its bs4 catalog into your application’s codebase, and utilize Delightful Soup without introducing it by any means. 

I use Python 2.7 and Python 3.2 to create Delightful Soup, yet it should work with other ongoing forms. 

Issues after establishment 

Excellent Soup is bundled as Python 2 code. At the point when you introduce it for use with Python 3, it’s consequently changed over to Python 3 code. On the off chance that you don’t introduce the bundle, the code won’t be changed over. There have likewise been reports on Windows machines of an inappropriate adaptation being introduced. 

On the off chance that you get the ImportError “No module named HTMLParser”, your concern is that you’re running the Python 2 adaptation of the code under Python 3. 

In the event that you get the ImportError “No module named html.parser”, your concern is that you’re running the Python 3 form of the code under Python 2. 

In the two cases, your most solid option is to totally expel the Lovely Soup establishment from your system  (counting any index made when you unfastened the tarball) and attempt the establishment once more. 

On the off chance that you get the SyntaxError “Invalid sentence structure” on hold ROOT_TAG_NAME = u'[document]’, you have to change over the Python 2 code to Python 3. You can do this either by introducing the bundle: 

$ python3 setup.py introduce 

or then again by physically running Python’s 2to3 transformation content on the bs4 registry: 

$ 2to3-3.2 – w bs4

Languages

Weekly newsletter

No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.