How to write data to new columns in csv when webscraping? - python

I am scraping the billboard hot r&b/hip hop charts and I am able to get all my data but when I start to write my data to a csv the formatting is all wrong.
The data for Last Week Number, Peak Position and Weeks On Chart all appear under the first 3 columns of my csv and not the columns where the respective headers are.
This is my current code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
# Opens web connetion and grabs page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# HTML parsing
page_soup = soup(page_html, "html.parser")
# Grabs song title, artist and picture
mainContainer = page_soup.findAll("div", {"class":"chart-row__main-
display"})
# CSV filename creation
filename = "Billboard_Hip_Hop_Charts.csv"
f = open(filename, "w")
# Creating Headers
headers = "Billboard Number, Artist Name, Song Title, Last Week Number, Peak
Position, Weeks On Chart\n"
f.write(headers)
# Get Billboard Number, Artist Name and Song Title
for container in mainContainer:
# Gets billboard number
billboard_number = container.div.span.text
# Gets artist name
artist_name_a_tag = container.findAll("", {"class":"chart-row__artist"})
artist_name = artist_name_a_tag[0].text.strip()
# Gets song title
song_title = container.h2.text
print("Billboard Number: " + billboard_number)
print("Artist Name: " + artist_name)
print("Song Title: " + song_title)
f.write(billboard_number + "," + artist_name + "," + song_title + "\n")
# Grabs side container from main container
secondaryContainer = page_soup.findAll("div", {"class":"chart-row__secondary"})
# Get Last Week Number, Peak Position and Weeks On Chart
for container in secondaryContainer:
# Gets last week number
last_week_number_tag = container.findAll("", {"class":"chart-row__value"})
last_week_number = last_week_number_tag[0].text
# Gets peak position
peak_position_tag = container.findAll("", {"class":"chart-row__value"})
peak_position = peak_position_tag[1].text
# Gets week on chart
weeks_on_chart_tag = container.findAll("", {"class":"chart-row__value"})
weeks_on_chart = weeks_on_chart_tag[2].text
print("Last Week Number: " + last_week_number)
print("Peak Position: " + peak_position)
print("Weeks On Chart: " + weeks_on_chart)
f.write(last_week_number + "," + peak_position + "," + weeks_on_chart + "\n")
f.close()
This is what my csv looks like with headers Billboard Number, Artist Name, Song Title, Last Week Number, Peak Position and Weeks On Chart.
1 Drake Nice For What
2 Post Malone Featuring Ty Dolla $ign Psycho
3 Drake God's Plan
4 Post Malone Better Now
5 Post Malone Featuring 21 Savage Rockstar
6 BlocBoy JB Featuring Drake Look Alive
7 Post Malone Paranoid
8 Lil Dicky Featuring Chris Brown Freaky Friday
9 Post Malone Rich & Sad
10 Post Malone Featuring Swae Lee Spoil My Night
11 Post Malone Featuring Nicki Minaj Ball For Me
12 Migos Featuring Drake Walk It Talk It
13 Post Malone Featuring G-Eazy & YG Same Bitches
14 Cardi B| Bad Bunny & J Balvin I Like It
15 Post Malone Zack And Codeine
16 Post Malone Over Now
17 Cardi B Be Careful
18 Post Malone Takin' Shots
19 The Weeknd & Kendrick Lamar Pray For Me
20 Rich The Kid Plug Walk
21 The Weeknd Call Out My Name
22 Bruno Mars & Cardi B Finesse
23 Post Malone Candy Paint
24 Ella Mai Boo'd Up
25 Rae Sremmurd & Juicy J Powerglide
26 Post Malone 92 Explorer
27 J. Cole ATM
28 J. Cole KOD
29 Post Malone Otherside
30 Post Malone Blame It On Me
31 J. Cole Kevin's Heart
32 Kendrick Lamar & SZA All The Stars
33 Nicki Minaj Chun-Li
34 Lil Pump Esskeetit
35 Migos Stir Fry
36 Famous Dex Japan
37 Post Malone Sugar Wraith
38 Cardi B Featuring Migos Drip
39 XXXTENTACION Sad!
40 Jay Rock| Kendrick Lamar| Future & James Blake King's Dead
41 Rich The Kid Featuring Kendrick Lamar New Freezer
42 Logic & Marshmello Everyday
43 J. Cole Motiv8
44 YoungBoy Never Broke Again Outside Today
45 Post Malone Jonestown (Interlude)
46 Cardi B Featuring 21 Savage Bartier Cardi
47 YoungBoy Never Broke Again Overdose
48 J. Cole 1985 (Intro To The Fall Off)
49 J. Cole Photograph
50 Khalid| Ty Dolla $ign & 6LACK OTW
1 1 2
2 1 6
3 1 17
4 2 12
5 3 14
10 6 8
...
Any help on placing the data into the right columns helps!

Your code is unnecessarily messy and very hard to read. You didn't need to create two containers at all because one container is sufficient to fetch you the required data. Try the below way and find the csv with data filled in accordingly:
import requests, csv
from bs4 import BeautifulSoup
url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
with open('Billboard_Hip_Hop_Charts.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Billboard Number','Artist Name','Song Title','Last Week Number','peak_position','weeks_on_chart'])
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for container in soup.find_all("article",class_="chart-row"):
billboard_number = container.find(class_="chart-row__current-week").text
artist_name_a_tag = container.find(class_="chart-row__artist").text.strip()
song_title = container.find(class_="chart-row__song").text
last_week_number_tag = container.find(class_="chart-row__value")
last_week_number = last_week_number_tag.text
peak_position_tag = last_week_number_tag.find_parent().find_next_sibling().find(class_="chart-row__value")
peak_position = peak_position_tag.text
weeks_on_chart_tag = peak_position_tag.find_parent().find_next_sibling().find(class_="chart-row__value").text
print(billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag)
writer.writerow([billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag])
Output are like:
1 Childish Gambino This Is America 1 1 2
2 Drake Nice For What 2 1 6
3 Drake God's Plan 3 1 17
4 Post Malone Featuring Ty Dolla $ign Psycho 4 2 12
5 BlocBoy JB Featuring Drake Look Alive 5 3 14
6 Ella Mai Boo'd Up 10 6 8

Related

scraping basketball results and associate related competition to each match

I want to scrape basketball results from this webpage:
http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
I created the code using bs4 and requests:
url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = session.get(url, timeout=30)
soup = BeautifulSoup(r.content, 'html.parser')
The issue I face is how to add competition to each row I scrape
I want to create a table and each row is the match results (competition, home team, away team, score...)
Selenium
Try this (selenium):
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
if table.get('class'):
competition = table.select_one('a b font').text
else:
for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
res.append([f"{competition}",
f"{home.select_one('td a').text}",
f"{away.select_one('td a').text}",
f"{home.select_one('td.red').text}",
f"{away.select_one('td.red').text}",
f"{home.select_one('td.odds1').text}",
f"{away.select_one('td.odds1').text}",
f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
'home',
'away',
'home score',
'away score',
'home odds',
'away odds',
'home ht',
'away ht',
'odds'
])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score home odds away odds home ht away ht odds
0 National Basketball Association Portland Trail Blazers Oklahoma City Thunder 120 131 2.72 1.45 50/70 63/68 http://data.nowgoal.group/OddsCompBasket/387520.html
1 National Basketball Association Houston Rockets Boston Celtics 137 112 1.49 2.58 77/60 60/52 http://data.nowgoal.group/OddsCompBasket/387521.html
2 National Basketball Association Philadelphia 76ers Dallas Mavericks 115 118 2.04 1.76 39/64 48/55 http://data.nowgoal.group/OddsCompBasket/387522.html
3 Women’s National Basketball Association Connecticut Sun Washington Mystics 89 94 2.28 1.59 52/37 48/46 http://data.nowgoal.group/OddsCompBasket/385886.html
4 Women’s National Basketball Association Chicago Sky Los Angeles Sparks 96 78 2.72 1.43 40/56 36/42 http://data.nowgoal.group/OddsCompBasket/385618.html
5 Women’s National Basketball Association Seattle Storm Minnesota Lynx 90 66 1.21 4.19 41/49 35/31 http://data.nowgoal.group/OddsCompBasket/385884.html
6 Friendly Competition Labas Pasauli LT Balduasenaras 85 78 52/33 31/47 http://data.nowgoal.group/OddsCompBasket/387769.html
7 Friendly Competition BC Vikings Nemuno Banga KK 66 72 29/37 30/42 http://data.nowgoal.group/OddsCompBasket/387771.html
8 Friendly Competition NRG Kiev Hizhaki 51 76 31/20 28/48 http://data.nowgoal.group/OddsCompBasket/387766.html
9 Friendly Competition Finland Estonia 97 76 2.77 1.40 48/49 29/47 http://data.nowgoal.group/OddsCompBasket/387740.html
10 Friendly Competition Synkarb Sk nemenchine 82 79 37/45 38/41 http://data.nowgoal.group/OddsCompBasket/387770.html
and so on....
And saves a Res.csv that looks like this:
Requests
Try this (requests):
import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')
for item in items:
values = item.text.split('^')
res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score
0 NBA Portland Trail Blazers Oklahoma City Thunder 120 131
1 NBA Houston Rockets Boston Celtics 137 112
2 NBA Philadelphia 76ers Dallas Mavericks 115 118
3 WNBA Connecticut Sun Washington Mystics 89 94
4 WNBA Chicago Sky Los Angeles Sparks 96 78
5 WNBA Seattle Storm Minnesota Lynx 90 66
6 FC Labas Pasauli LT Balduasenaras 85 78
7 FC BC Vikings Nemuno Banga KK 66 72
8 FC NRG Kiev Hizhaki 51 76
And saves a Res.csv that looks like this:
If you do not want the index column you can simply add index=False to df.to_csv('Res.csv') so it looks like this df.to_csv('Res.csv', index=False)
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
The selenium version is slower but has no need to fetch and find the XML file with devtools
This page uses JavaScript to load data but requests/BeautifulSoup can't run JavaScript.
So you have two options.
First: you can use Selenium to control real web browser which can run JavaScript. It can be better when page use complex JavaScript code to generate data - but this slower because it needs to run web browser which has to render page and run JavaScript.
Second: you can try to use DevTools in Firefox/Chrome (tab Network, filter XHR) to find URL used by JavaScript/AJAX(XHR) to get data from server and use this URL with requests. often you can get JSON data which can be converted to Python list/dictionary and then you don't need BeautifulSoupto scrape data. It is faster but sometimes page uses some JavaScript code which hard to replace with Python code.
I choose second method.
I found it reads data from
http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000
but it gives XML data so it still needs BeautifulSoup (or lxml) to scrape data.
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_items = soup.find_all('h')
for item in all_items:
values = item.text.split('^')
#print(values)
print(values[8], values[11])
print(values[10], values[12])
print('---')
Result:
Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---

Missing data not being scraped from Hansard

I'm trying to scrape data from Hansard, the official verbatim record of everything spoken in the UK House of Parliament. This is the precise link I'm trying to scrape: in a nutshell, I want to scrape every "mention" container on this page and the following 50 pages after that.
But I find that when my scraper is "finished," it's only collected data on 990 containers and not the full 1010. Data on 20 containers is missing, as if it's skipping a page. When I only set the page range to (0,1), it fails to collect any values. When I set it to (0,2), it collects only the first page's values. Asking it to collect data on 52 pages does not help. I thought that this was perhaps due to the fact that I wasn't giving the URLs enough time to load, so I added some delays in the scraper's crawl. That didn't solve anything.
Can anyone provide me with any insight into what I may be missing? I'd like to make sure that my scraper is collecting all available data.
pages = np.arange(0, 52)
for page in pages:
hansard_url = "https://hansard.parliament.uk/search/Contributions? searchTerm=%22civilian%20casualties%22&startDate=01%2F01%2F1988%2000%3A00%3A00&endDate=07%2F14%2F2020%2000%3A00%3A00"
full_url = hansard_url + "&page=" + str(page) + "&partial=true"
page = get(full_url)
html_soup = BeautifulSoup(page.text, 'html.parser')
mention_containers = html_soup.find_all('div', class_="result contribution")
time.sleep(randint(2,10))
for mention in mention_containers:
topic = mention.div.span.text
topics.append(topic)
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
houses.append("House of Lords")
elif house == "Commons Portcullis":
houses.append("House of Commons")
else:
houses.append("N/A")
name = mention.find('div', class_="secondaryTitle").text
names.append(name)
date = mention.find('div', class_="").text
dates.append(date)
time.sleep(randint(2,10))
hansard_dataset = pd.DataFrame(
{'Date': dates, 'House': houses, 'Speaker': names, 'Topic': topics})
)
print(hansard_dataset.info())
print(hansard_dataset.isnull().sum())
hansard_dataset.to_csv('hansard.csv', index=False, sep="#")
Any help in helping me solve this problem is appreciated.
The server returns on page 48 empty container, so total results are 1000 from pages 1 to 51 (inclusive):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://hansard.parliament.uk/search/Contributions'
params = {
'searchTerm':'civilian casualties',
'startDate':'01/01/1988 00:00:00',
'endDate':'07/14/2020 00:00:00',
'partial':'True',
'page':1,
}
all_data = []
for page in range(1, 52):
params['page'] = page
print('Page {}...'.format(page))
soup = BeautifulSoup(requests.get(url, params=params).content, 'html.parser')
mention_containers = soup.find_all('div', class_="result contribution")
if not mention_containers:
print('Empty container!')
for mention in mention_containers:
topic = mention.div.span.text
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
house = "House of Lords"
elif house == "Commons Portcullis":
house = "House of Commons"
else:
house = "N/A"
name = mention.find('div', class_="secondaryTitle").text
date = mention.find('div', class_="").get_text(strip=True)
all_data.append({'Date': date, 'House': house, 'Speaker': name, 'Topic': topic})
df = pd.DataFrame(all_data)
print(df)
Prints:
...
Page 41...
Page 42...
Page 43...
Page 44...
Page 45...
Page 46...
Page 47...
Page 48...
Empty container! # <--- here is the server error
Page 49...
Page 50...
Page 51...
Date House Speaker Topic
0 14 July 2014 House of Lords Baroness Warsi Gaza debate in Lords Chamber
1 3 March 2016 House of Lords Lord Touhig Armed Forces Bill debate in Grand Committee
2 2 December 2015 House of Commons Mr David Cameron ISIL in Syria debate in Commons Chamber
3 3 March 2016 House of Lords Armed Forces Bill debate in Grand Committee
4 27 April 2016 House of Lords Armed Forces Bill debate in Lords Chamber
.. ... ... ... ...
995 18 June 2003 House of Lords Lord Craig of Radley Defence Policy debate in Lords Chamber
996 7 September 2004 House of Lords Lord Rea Iraq debate in Lords Chamber
997 14 February 1994 House of Lords The Parliamentary Under-Secretary of State, Mi... Landmines debate in Lords Chamber
998 12 January 2000 House of Commons The Minister of State, Foreign and Commonwealt... Serbia And Kosovo debate in Westminster Hall
999 26 February 2003 House of Lords Lord Rea Iraq debate in Lords Chamber
[1000 rows x 4 columns]

Parsing html with correct encoding

I'm trying to use the basketball-reference API using python with the requests and bs4 libraries.
from requests import get
from bs4 import BeautifulSoup
Here's a minimal example of what I'm trying to do:
# example request
r = get(f'https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fteams%2FMIL%2F2015.html&div=div_roster')
soup = BeautifulSoup(dd.content, 'html.parser')
table = soup.find('table')
It all works well, I can then feed this table to pandas with its read_html and get the data I need nicely packed into a dataframe.
The problem I have is the encoding.
In this particular request I got two NBA player names with weird characters: Ersan Ä°lyasova (Ersan İlyasova) and Jorge Gutiérrez (Jorge Gutiérrez). In the current code they are interpreted as "Ersan Ä°lyasova" and "Jorge Gutiérrez", which is obviously not what I want.
So the question is -- how do I fix it? This website seems to suggest they have the windows-1251 encoding, but I'm not sure how to use that information (in fact I'm not even sure if that's true).
I know I'm missing something fundamental here as I'm a bit confused how these encodings work at which point they are being "interpreted" etc, so I'll be grateful if you help me with this!
I really don't know why you are usingformat string and even your question is not clear. you've just copy/paste the url from the network traffic and then you mixing things about quoted string with encoding.
Below you should be able to done it.
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/teams/MIL/2015.html")
print(df)
Output:
[ No. Player Pos ... Unnamed: 6 Exp College
0 34 Giannis Antetokounmpo SG ... gr 1 NaN
1 19 Jerryd Bayless PG ... us 6 Arizona
2 5 Michael Carter-Williams PG ... us 1 Syracuse
3 9 Jared Dudley SG ... us 7 Boston College
4 11 Tyler Ennis PG ... ca R Syracuse
5 13 Jorge Gutiérrez PG ... mx 1 California
6 31 John Henson C ... us 2 UNC
7 7 Ersan İlyasova PF ... tr 6 NaN
8 23 Chris Johnson SF ... us 2 Dayton
9 11 Brandon Knight PG ... us 3 Kentucky
10 5 Kendall Marshall PG ... us 2 UNC
11 6 Kenyon Martin PF ... us 14 Cincinnati
12 0 O.J. Mayo SG ... us 6 USC
13 22 Khris Middleton SF ... us 2 Texas A&M
14 3 Johnny O'Bryant PF ... us R LSU
15 27 Zaza Pachulia C ... ge 11 NaN
16 12 Jabari Parker PF ... us R Duke
17 21 Miles Plumlee C ... us 2 Duke
18 8 Larry Sanders C ... us 4 Virginia Commonwealth
19 6 Nate Wolters PG ... us 1 South Dakota State

Pandas - groupby where each row has multiple values stored in list

I'm working with last.fm listening data and have a DataFrame that looks like this:
Artist Plays Genres
0 John Coltrane 10 [jazz, modal jazz, hard bop]
1 Miles Davis 15 [jazz, cool jazz, modal jazz, hard bop]
2 Charlie Parker 20 [jazz, bebop]
I want to group the data by the genres and then aggregate by the sum of plays for each genre, to get something like this:
Genre Plays
0 jazz 45
1 modal jazz 25
2 hard bop 25
3 bebop 20
4 cool jazz 15
Been trying to figure this out for a while now but can't seem to find the solution. Do I need to change the way that the genre data is stored?
I was able to find this post which addresses a similar question, but that user was only looking to get the count of each list value. This gets me about halfway there, but I couldn't figure out how to use that to aggregate another column in the dataframe.
In general, you should not store lists in a DataFrame, so yes, probably best to change how they are stored. With this you can use some join + str.get_dummies + .multiply. Choose a sep that doesn't appear in any of your strings.
sep = '*'
df.Genres.apply(sep.join).str.get_dummies(sep=sep).multiply(df.Plays, axis=0).sum()
Output
bebop 20
cool jazz 15
hard bop 25
jazz 45
modal jazz 25
dtype: int64
An easier form to work with would be if your lists were split across lines as in:
import pandas as pd
df1 = pd.concat([pd.DataFrame(df.Genres.values.tolist()).stack().reset_index(1, drop=True).to_frame('Genres'),
df[['Plays', 'Artist']]], axis=1)
Genres Plays Artist
0 jazz 10 John Coltrane
0 modal jazz 10 John Coltrane
0 hard bop 10 John Coltrane
1 jazz 15 Miles Davis
1 cool jazz 15 Miles Davis
1 modal jazz 15 Miles Davis
1 hard bop 15 Miles Davis
2 jazz 20 Charlie Parker
2 bebop 20 Charlie Parker
Making it a simple sum within genres:
df1.groupby('Genres').Plays.sum()
Genres
bebop 20
cool jazz 15
hard bop 25
jazz 45
modal jazz 25
Name: Plays, dtype: int64

How to read wikipedia table of 2018 in film using python pandas and BeautifulSoup

I was attempting to find the movies of 2018 January to March of 2018 from wikipedia page using pandas read html.
Here is my code:
import pandas as pd
import numpy as np
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link)
jan_march = tables[5].iloc[1:]
jan_march.columns = ['Opening1','Opening2','Title','Studio','Cast','Genre','Country','Ref']
jan_march.head()
There is some error in reading the columns. If anybody has already scraped some
wikipedia tables may be they can help me solving the problem.
Thanks a lot.
Related links:
Scraping Wikipedia tables with Python selectively
https://roche.io/2016/05/scrape-wikipedia-with-python
Scraping paginated web table with python pandas & beautifulSoup
I am getting this:
But am expecting:
Because of how the table is designed it is not as simple as pd.read_html() while that is a start you will need to do some manipulation to get it in a desirable formate:
import pandas as pd
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link,header=0)[5]
# find na values and shift cells right
i = 0
while i < 2:
row_shift = tables[tables['Unnamed: 7'].isnull()].index
tables.iloc[row_shift,:] = tables.iloc[row_shift,:].shift(1,axis=1)
i+=1
# create new column names
tables.columns = ['Month', 'Day', 'Title', 'Studio', 'Cast and crew', 'Genre', 'Country', 'Ref.']
# forward fill values
tables['Month'] = tables['Month'].ffill()
tables['Day'] = tables['Day'].ffill()
out:
Month Day Title Studio Cast and crew Genre Country Ref.
0 JANUARY 5 Insidious: The Last Key Universal Pictures / Blumhouse Productions Adam Robitel (director); Leigh Whannell (scree... Horror, Thriller US [33]
1 JANUARY 5 The Strange Ones Vertical Entertainment Lauren Wolkstein (director); Christopher Radcl... Drama US [34]
2 JANUARY 5 Stratton Momentum Pictures Simon West (director); Duncan Falconer, Warren... Action, Thriller IT, UK [35]
3 JANUARY 10 Sweet Country Samuel Goldwyn Films Warwick Thornton (director); David Tranter, St... Drama AUS [36]
4 JANUARY 12 The Commuter Lionsgate / StudioCanal / The Picture Company Jaume Collet-Serra (director); Byron Willinger... Action, Crime, Drama, Mystery, Thriller US, UK [37]
5 JANUARY 12 Proud Mary Screen Gems Babak Najafi (director); John S. Newman, Chris... Action, Thriller US [38]
6 JANUARY 12 Acts of Violence Lionsgate Premiere Brett Donowho (director); Nicolas Aaron Mezzan... Action, Thriller US [39]
...

Categories