I've written a script in python to parse some data from a webpage and write it to a csv file via panda. So far what I've written can parse all the tables available in that page but in case of writing to a csv file it is writing the last table from that page to that csv file. Definitely, the data are being overwritten because of the loop. How can I fix this flaw so that my scraper will be able to write all the data from different tables instead of only the last table? Thanks in advance.
import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
df.to_csv("table_item.csv")
print(df)
Btw, I expect to write data to a csv file using panda only. Thanks again.
You can use read_html what return list of DataFrames in webpage, so then need concat for one df:
dfs = pd.read_html('http://www.espn.com/nba/schedule/_/date/20171001')
df = pd.concat(dfs, ignore_index=True)
#if necessary rename columns
d = {'Unnamed: 1':'a', 'Unnamed: 7':'b'}
df = df.rename(columns=d)
print (df.head())
matchup a time (ET) nat tv away tv home tv \
0 Atlanta ATL Miami MIA NaN NaN NaN NaN
1 LA LAC Toronto TOR NaN NaN NaN NaN
2 Guangzhou Guangzhou Washington WSH NaN NaN NaN NaN
3 Charlotte CHA Boston BOS NaN NaN NaN NaN
4 Orlando ORL Memphis MEM NaN NaN NaN NaN
tickets b
0 2,401 tickets available from $6 NaN
1 284 tickets available from $29 NaN
2 2,792 tickets available from $2 NaN
3 2,908 tickets available from $6 NaN
4 1,508 tickets available from $3 NaN
And last to_csv for write to file:
df.to_csv("table_item.csv", index=False)
EDIT:
For learning is possible append each DataFrame to list and then concat:
res = requests.get('http://www.espn.com/nba/schedule/_/date/20171001').text
soup = BeautifulSoup(res,"lxml")
dfs = []
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
print(df)
df.to_csv("table_item.csv")
Related
I'm trying to make a script that can read through a table of company names from one website and take the names of each company and put them in an url (the url exists, and it contains more data specific to each company. This data is what I want to analyze).
However, I cannot get the names to be put in the url without python also putting in parts of the table, giving me the error below:
import numpy as np
import pandas as pd
import requests
url1 = "http://openinsider.com/latest-penny-stock-buys"
df1 = pd.read_html(url1)
table = df1[11]
# sorting
n = np.quantile(table["Qty"], [0.99])
print("20th percentile: ", n)
q = table.sort_values("Qty", ascending=False)
name = q["Ticker"].str.replace("\d+", "")
page = requests.get(url1)
name = table["Ticker"]
# Buyers for the company
url = "http://openinsider.com/"
for entry in name: # <- Question starts here
name = entry + 1
table2 = pd.read_html(url + str(name))
df2 = table2[11]
print(df2)
Error: InvalidURL: URL can't contain control characters. '/0 OPK\n1 VEII\n2 NGM\n3 STRR\n4
IMRA\n ... \n95 NaN\n96 CDXC\n97 PED\n98 FOA\n99 CAMP\nName:
Ticker, Length: 100, dtype: object' (found at least ' ')```
Thanks!
In your for-loop:
remove name = entry + 1
replace url+str(name) with url + entry
And so, you get the expected output printed:
X Filing Date Trade Date Ticker Insider Name Title \
0 NaN 2022-10-21 19:03:38 2022-10-19 MIST Wills Robert James Dir
1 NaN 2022-10-21 19:02:50 2022-10-20 MIST Pasternak Richard C Dir
2 M 2022-10-21 19:02:01 2022-10-20 MIST Liebert Debra K. Dir
3 NaN 2022-10-21 19:01:16 2022-10-19 MIST Tomsicek Michael John Dir
4 NaN 2022-09-09 16:15:34 2022-09-07 MIST Rtw Investments, LP 10%
5 NaN 2022-09-09 16:15:34 2022-09-07 MIST Rtw Investments, LP 10%
6 D 2022-06-01 21:32:38 2022-05-31 MIST Truex Paul F Dir
Trade Type Price Qty Owned ΔOwn Value 1d 1w 1m 6m
0 P - Purchase $4.93 15000 15000 New +$73,950 NaN NaN NaN NaN
1 P - Purchase $5.20 10000 10000 New +$52,000 NaN NaN NaN NaN
2 P - Purchase $5.28 14000 14127 >999% +$73,940 NaN NaN NaN NaN
3 P - Purchase $5.32 15000 15000 New +$79,800 NaN NaN NaN NaN
...
I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.
I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code
import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")
So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.
Thank you
To open a zipfile and read the files there to a dataframe you can use next example:
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"
dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
for file in zf.namelist():
df = pd.read_csv(
zf.open(file),
sep=";",
skiprows=1,
skipfooter=1,
engine="python",
header=None,
)
dfs.append(df)
final_df = pd.concat(dfs)
# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))
Prints:
0
1
2
3
4
5
6
2017
1
1
1
58.82
58.82
nan
2017
1
1
2
58.23
58.23
nan
2017
1
1
3
51.95
51.95
nan
2017
1
1
4
47.27
47.27
nan
2017
1
1
5
46.9
45.49
nan
2017
1
1
6
46.6
44.5
nan
2017
1
1
7
46.25
44.5
nan
2017
1
1
8
46.1
44.72
nan
2017
1
1
9
46.1
44.22
nan
2017
1
1
10
45.13
45.13
nan
I am trying to read a CSV file from my private Google Drive. The file has as authorisation: Anyone with the link. Here is the link: https://drive.google.com/file/d/12txcYHcO8aiwO9f948_nsaIE3wBGAuJa/view?usp=sharing
and here is a sample of the file:
email first_name last_name
uno#gmail.com Luca Rossi
due#gmail.com Daniel Bianchi
tre#gmail.com Gabriel Domeneghetti
qua#gmail.com Christian Bona
cin#gmail.com Simone Marsango
I need to read this file in order to parse this data into a program. I tried many ways, such as every possibility that has been suggested in this question: Pandas: How to read CSV file from google drive public?.
This is the code I wrote to do that so far:
csv_file_url = 'the file URL as copied in the drive UI'
file_id = csv_file_url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
url2 = requests.get(dwn_url).text
csv_raw = StringIO(url2)
df = pd.read_csv(csv_raw)
print(df.head())
And that should work, but returns only this table:
ÿþe Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
I think it is only a format matter, but I don't know how to get rid of it. Please, if you know how, help me.
You data is UTF16 encoded. You can read it specifying the encoding:
pd.read_csv(dwn_url, encoding='utf16')
Result:
email first_name last_name
0 NaN NaN NaN
1 uno#gmail.com Luca Rossi
2 due#gmail.com Daniel Bianchi
3 tre#gmail.com Gabriel Domeneghetti
4 qua#gmail.com Christian Bona
5 cin#gmail.com Simone Marsango
(read_csv can directly read from a url, no need for requests and StringIO.)
I have an excel files and the first two rows are:
Weekly Report
December 1-7, 2014
And after that comes the relevant table.
When I use
filename = r'excel.xlsx'
df = pd.read_excel(filename)
print(df)
I get
Weekly Report Unnamed: 1 Unnamed: 2 Unnamed:
3 Unnamed: 4 Unnamed: 5
0 December 1-7, 2014 NaN NaN
NaN NaN NaN
1 NaN NaN NaN
NaN NaN NaN
2 Date App Campaign
Country Cost Installs
What I mean is that the columns name is unnamed because it is in the first irrelevant row.
If pandas would read only the table my columns will be installs, cost etc... which I want.
How can I tell him to read starting from line 3?
Use skiprows to your advantage -
df = pd.read_excel(filename, skiprows=[0,1])
This should do it. pandas ignores the first two rows in this case -
skiprows : list-like
Rows to skip at the beginning (0-indexed)
More details here
I am trying to read a content of a Wikipedia table in a pandas DataFrame.
In [110]: import pandas as pd
In [111]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")[0]
However, this dataframe contains gibberish values in certain columns:
0 1 2 \
0 City/Metropolitan area Country Geographical zone[1]
1 Aberdeen United Kingdom Northern Europe
2 Abidjan Côte d'Ivoire (Ivory Coast) Africa
3 Abu Dhabi United Arab Emirates Western Asia
4 Addis Ababa Ethiopia Africa
3 \
0 Official est. Nominal GDP ($BN)
1 7001113000000000000♠11.3 (2008)[5]
2 NaN
3 7002119000000000000♠119 [6]
4 NaN
4 \
0 Brookings Institution[2] 2014 est. PPP-adjuste...
1 NaN
2 NaN
3 7002178300000000000♠178.3
4 NaN
5 \
0 PwC[3] 2008 est. PPP-adjusted GDP ($BN)
1 NaN
2 7001130000000000000♠13
3 NaN
4 7001120000000000000♠12
6 7
0 McKinsey[4] 2010 est. Nominal GDP ($BN) Other est. Nominal GDP ($BN)
1 NaN NaN
2 NaN NaN
3 7001671009999900000♠67.1 NaN
4 NaN NaN
For example, in the above dataframe in the column for Official est. Nominal GDP, the first entry is 11.3(2008) but we see some big number before that. I thought that this must be problem with encoding and I tried passing ASCII as well as UTI encodings:
In [113]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP", encoding = 'ASCII')[0]
However, even this doesn't solve the problem. Any ideas?
This is because of the invisible (in the browser) "sort key" elements:
<td style="background:#79ff76;">
<span style="display:none" class="sortkey">7001130000000000000♠</span>
13
</td>
May be there is a better way to clean it up, but here is a working solution based on the idea of finding these "sort key" elements and removing them from the table, then let pandas parse the table HTML:
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")
soup = BeautifulSoup(response.content, "html.parser")
table = soup.select_one("table.wikitable")
for span in table.select("span.sortkey"):
span.decompose()
df = pd.read_html(str(table))[0]
print(df)
If you look at the HTML source of that page, you'll see that a lot of cells have a hidden <span> containing a "sortkey". These are the strange numbers you're seeing.
If you look at the documentation for read_html, you'll see this:
Expect to do some cleanup after you call this function. [...] We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.
Put them together and you have your answer: garbage in, garbage out. The table you're reading from has junk data in it, and you'll have to figure out how to handle that yourself.