What is the purpose of the Beautifulsoup Python library 1 point?

Lazar GugletaFollowApr 8, 2020·5 min readSaveTop 5 Beautiful Soup Functions That Will Make Your Life EasierOnce you get into Web Scraping and data processing, you will find so many

What is the purpose of the Beautifulsoup Python library 1 point?
Lazar Gugleta

Lazar GugletaFollow

Apr 8, 2020·5 min read

Save

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Once you get into Web Scraping and data processing, you will find so many tools that can do that job for you. One of them is Beautiful Soup, which is a python library for pulling data out of HTML and XML files. It creates data parse trees in order to get data easily.

Original photo by Joshua Sortino on Unsplash

The basic process goes something like this:

Get the data and then process it any way you want.

That is why today I want to show you some of the top functions that Beautiful Soup has to offer.

If you are also interested in other libraries like Selenium, here are other examples you should look into:
I have written articles about Selenium and Web Scraping before, so before you begin with these, I would recommend you read this article Everything About Web Scraping, because of the setup process. And if you are already more advanced with Web Scraping, try my advanced scripts like How to Save Money with Python and How to Make an Analysis Tool with Python.

Also, a good example of setting up the environment for BeautifulSoup is in the article How to Save Money with Python.

Lets just jump right into it!

Beautiful Soup Setup

Before we get into Top 5 Functions, we have to set up our environment and libraries that we are going to use in order to get data.

In that terminal you should install libraries:pip3 install requests

Requests can be used so you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.sudo pip3 install beautifulsoup4

This is our main library Beautiful Soup that we already mentioned above.

Also when you start your Python script at the beginning you should include the libraries we just installed:import requestsfrom bs4 import BeautifulSoup

Now lets move on to the functions!

get()

This function is absolutely essential since with it you will get to the certain web page you desire. Let me show you.

First, we have to find a URL we want to scrape (get data) from:URL = 'https://www.amazon.de/gp/product/B0756CYWWD/ref=as_li_tl?ie=UTF8&tag=idk01e-21&camp=1638&creative=6742&linkCode=as2&creativeASIN=B0756CYWWD&linkId=868d0edc56c291dbff697d1692708240'headers = {"User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}

I took a random Amazon product and with the get function, we are going to get access to data from the web page. Headers are just a definition for your browser. You can check yours here.

Using the requests library we get to the desired URL with defined headers.
After that, we create an object instance soup that we can use to find anything we want on the page.page = requests.get(URL, headers=headers)soup = BeautifulSoup(page.content, 'html.parser')

BeautifulSoup(,) creates a data structure representing a parsed HTML or XML document.
Most of the methods youll call on a BeautifulSoup object are inherited from PageElement or Tag.
Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers.

We can now move on to the next function, which actually searches the object we just created.

find()

With the find() function, we are able to search for anything in our web page.
Lets say we want to get a title and the price of the product based on their ids.title = soup.find(id="productTitle").get_text()price = soup.find(id="priceblock_ourprice").get_text()

The id of these Web elements you can find by clicking F12 on your keyboard or right-click ->  Inspect.

Lets look closely at what just happened there!

get_text()

As you can see in the previous function we used get_text() to extract the text part of the newly found elements title and price.

But before we get to the final results there are a few more things that we have to perform on our product in order to get perfect output.

strip()

The strip() method returns a copy of the string with both leading and trailing characters removed (based on the string argument passed).

We use this function in order to remove the empty spaces we have in our title:

This function can also be used in any other python usage, not just Beautiful Soup, but in my personal experience, it has come in handy so many times when operating on text elements and that is why I am putting it on this list.

split()

This function also has a general-purpose for Python but I found it very useful as well.
It splits the string into different parts and we can use the parts that we desire.
It works with a combination of the separator and a string.

We use sep as the separator in our string for price and convert it to integer (whole number).

replace() just replaces . with an empty string.sep = ','
con_price = price.split(sep, 1)[0]
converted_price = int(con_price.replace('.', ''))

Here are the final results:

I put the complete code for you in this Gist:

Just check your headers before you execute it.

If you want to run it, here is the terminal command:python3 bs_tutorial.py

We are done!

Last words

As mentioned before, this is not my first time writing about Beautiful Soup, Selenium and Web Scraping in general. There are many more functions I would love to cover and many more to come. I hope you liked this tutorial and in order to keep up, follow me for more!

Thanks for reading!

Check out my other articles and follow me on Medium

Follow me on Twitter for info when I get a new article out

Video liên quan