Beautifulsoup Cheat Sheet

Python makes it simple to grab data from the web. This is a guide (or maybe cheat sheet) on how you can scrape the web easily with Requests and Beautiful Soup 4.

Beautifulsoup Cheat Sheet Pdf
Beautiful Soup 4 Documentation

Getting started

First, you need to install the right tools.

These are the ones we will use for the scraping. Create a new python file and import them at the top of your file.

If you have a NavigableString (but not a tag), you can reference.string to get a str with the string’s content. This is the same as calling str on it. You can access.string on a Tag, but the meaning in that case is convoluted.I find it easier to just avoid it. Str and gettext are enough anyway. “beautifulsoup cheat sheet” Code Answer. Beautifulsoup cheat sheet. Whatever by Expensive Eagle on Jan 10 2021 Donate.

Fetch with Requests

The Requests library will be used to fetch the pages. To make a GET request, you simply use the GET method.

You can get a lot of information from the request.

To be able to scrape your page, you need to use the Beautiful Soup library. You need to save the response content to turn it into a soup object.

You can see the HTML in a readable format with the prettify method.

Scrape with Beautiful Soup

Now to the actual scraping. Getting the data from the HTML code.

Using CSS Selector

Beautifulsoup Cheat Sheet Pdf

The easiest way is probably to use the CSS selector, which can be copied within Chrome.

Here, I have selected the first Google result. Inspected the HTML. Right clicked the element, selected copy and choose the Copy selector alternative.

The select element will, however, return an array. If you only want one object, you can use the select_one method instead.

Using Tags

You can also scrape by tags (a, h1, p, div) with the following syntax.

It is also possible to use the id or class attribute to scrape the HTML.

Using find_all

Another method you can use is find_all. It will basically return all elements that match.

You can also use the find method, which will return a single element instead of an array.

Get the values

The most important part of scarping is getting the actual values (or text) from the element.

Get the inner text (the actual text printed on the page) with this method.

If you want to get a specific attribute of an element, like the href, use this syntax:

parse HTML by default, can parse XML

BeautifulSoup
CData
ProcessingInstruction
Declaration
DocType

import urllib2
from bs4 import BeautifulSoup

# use the line below to down load a webpage
html = urllib2.urlopen('web address').read()
soup = BeautifulSoup(open(doc.html))

soup.get_text() => all text

soup.get_text(‘|’, strip=True) => all text as unicode, separate tags with |, remove line breaks

Search

string
string, string
attr = ‘’text”
attrs={'data-foo': 'value'}
regex
list
true => all tags

----

for tag in soup.find_all(re.compile('t')):

----

return tag.has_attr('class') and not tag.has_attr('id’)

soup.find_all(has_class_but_no_id)

.a_tag.b_tag => get first b_tag in a_tag

.strings => text from the doc

.children

.next_element => different then children

Beautiful Soup 4 Documentation

.next_sibling

.decendents

.next_elements

.next_siblings

Main

Tag
NavigableString
BeautifulSoup
Comment

Lesser - all subclass NavigableString

CData
ProcessingInstruction
Declaration
DocType

Tag Object

.tag.name => tag

.tag.get(‘attr’) => use if you don’t know if tag is defined

multivalued tag attributes => list
multivalued tag attributes - class, rev, accept-charset, headers, accesskey
‘id’ is not multivalued => string
you can change tag attributes

tag.string => text within a string
tag.string.replace('any_text”)
use outside of BeautifulSoup by converting to unicode
unicode(tag.string)
supports all navigation except .contents .string .find()

whole document
soup.name => u’[document]’
supports most navigation

NavigableString subclass
<!— text -->
display with special formatting when prettified