Python ProgrammingPython Programming

How to read or parse data from Web Pages?

Sometimes we need to extract text data from blogs and other HTML web pages for our analysis.

Beautifulsoup is required library for this recipe. Installing Beautifulsoup on your computer is a very simple. You simply need to install it using pip.

pip install bs4

Blog or Web Page Data Collection for Analysis

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
              headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()

# Parsing
soup = BeautifulSoup(webpage, 'html.parser')

# Formating the parsed html file
strhtm = soup.prettify()

# Print first 500 lines
print(strhtm[:500])

# Extract meta tag value
print(soup.title.string)
print(soup.find('meta', attrs={'property':'og:description'}))

# Extract anchor tag value
for x in soup.find_all('a'):
    print(x.string)

# Extract Paragraph tag value    
for x in soup.find_all('p'):
    print(x.text)