I have been trying to obtain the Statewise sheet from this public googlesheet link as a python dataframe.
The URL of this sheet is different from the URL of other examples for achieving the goal of getting sheet as dataframe, seen on this website.
URL is this :
https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml#
One standard way maybe the following,
import pandas
googleSheetId = '<Google Sheets Id>'
worksheetName = '<Sheet Name>'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheetId,
worksheetName
)
df = pandas.read_csv(URL)
print(df)
But in the present URL I do not see a structure used here. Can someone help to clarify. Thanks.
The Google spreadsheet is actually an html thing. So you should use read_html to load it into a list of pandas dataframes:
dfs = pd.read_html(url, encoding='utf8')
if lxml is available or, if you use BeautifulSoup4:
dfs = pd.read_html(url, flavor='bs4', encoding='utf8')
You will get a list of dataframes and for example dfs[0] is:
0 1 2 3
0 1 id Banner Number_Of_Times
1 2 1 Don't Hoard groceries and essentials. Please e... 2
2 3 2 Be compassionate! Help those in need like the ... 2
3 4 3 Be considerate : While buying essentials remem... 2
4 5 4 Going out to buy essentials? Social Distancing... 2
5 6 5 Plan ahead! Take a minute and check how much y... 2
6 7 6 Plan and calculate your essential needs for th... 2
7 8 7 Help out the elderly by bringing them their gr... 2
8 9 8 Help out your workers and domestic help by not... 2
9 10 9 Lockdown means LOCKDOWN! Avoid going out unles... 1
10 11 10 Panic mode : OFF! ❌ESSENTIALS ARE ON! ✔️ 1
11 12 11 Do not panic! ❌ Your essential needs will be t... 1
12 13 12 Be a true Indian. Show compassion. Be consider... 1
13 14 13 If you have symptoms and suspect you have coro... 1
14 15 14 Stand Against FAKE News and WhatsApp Forwards!... 1
15 16 15 If you have any queries, Reach out to your dis... 1
You can use the following snippet:
from io import BytesIO
import requests
r = requests.get(URL)
data = r.content
df = pd.read_csv(BytesIO(data), index_col=0, error_bad_lines=False)
Related
Update: so I'm messing around with the two files, I copied the data from DOB column from the 2nd file to the 1st file to make the files look visually identical. However I notice some really interesting behaviour when using Ctrl+F in Microsoft Excel. When I leave the search box blank in the first file it finds no matches. However when I repeat the same operation with the 2nd file it finds 21 matches from each cell between E1 to G7. I suppose somehow there are some blank/invisible cells in the 2nd file - and that's what's causing read_excel to behave differently.
My goal is to simply execute the pandas read_excel function. However, I'm running into a very strange situation in which I am trying to run pandas.read_excel on two very similar excel files but am getting substantially different results.
Code
import pandas
data1 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_1.xlsx')
print(data1)
data2 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_2.xlsx')
print(data2)
Output
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
Redeye438 2011-09-20 Fresh 1 NaN NaN NaN
Redeye439 2010-09-22 Soph 2 NaN NaN NaN
Redeye440 2009-09-20 Junior 3 NaN NaN NaN
Redeye441 2008-09-22 Senior 4 NaN NaN NaN
Redeye442 2011-09-22 Fresh 4 NaN NaN NaN
Redeye443 2010-09-20 Soph 3 NaN NaN NaN
Why are the columns mapped incorrectly for data2?
The excel files in question (the only difference is the data in the DOB column):
Excel file downloads
It looks like you're using an older version of Pandas because I can't reproduce the issue on the latest version.
import pandas as pd
pd.show_versions()
INSTALLED VERSIONS
------------------
python : 3.10.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
pandas : 1.5.0
As you can see below, the columns of data2 are mapped correctly.
import pandas as pd
data1 = pd.read_excel(r"C:\Users\abokey\Downloads\test_1.xlsx")
print(data1)
data2 = pd.read_excel(r"C:\Users\abokey\Downloads\test_2.xlsx")
print(data2)
Name DOB Class Year GPA
0 Redeye438 2008-09-22 Fresh 1
1 Redeye439 2009-09-20 Soph 2
2 Redeye440 2010-09-22 Junior 3
3 Redeye441 2011-09-20 Senior 4
4 Redeye442 2008-09-20 Fresh 4
5 Redeye443 2009-09-22 Soph 3
Name DOB Class Year GPA
0 Redeye438 2011-09-20 Fresh 1
1 Redeye439 2010-09-22 Soph 2
2 Redeye440 2009-09-20 Junior 3
3 Redeye441 2008-09-22 Senior 4
4 Redeye442 2011-09-22 Fresh 4
5 Redeye443 2010-09-20 Soph 3
However, you're right because the format of the two Excel files are not the same. In fact, when looking at test2.xlsx more carefully, this one seems to carry 7x3 blank cells (rows/cols). The latest version of pandas seems to handle this kind of Excel file since the empty cells are ignored when calling pandas.read_excel.
So in principle, upgrading your pandas version should fix the problem :
pip install --upgrade pandas
If the problem persists, clean your Excel file test2.xlsx by doing this :
Open up the file in Excel
Click on Find & Select
Choose Go To Special and Blanks
Click on Clear All
Save your changes
I am very new to web scrapping and I am trying to understand how I can scrape all the zip files and regular files that are on this website. The end goal is the scrape all the data, I was originally thinking I could use pd.read_html and feed in a list of each link and loop through each zip file.
I am very new to web scraping so any help at all would be very useful, I have tried a few examples this far please see the below code
import pandas as pd
pd.read_html("https://www.omie.es/en/file-access-list?parents%5B0%5D=/&parents%5B1%5D=Day-ahead%20Market&parents%5B2%5D=1.%20Prices&dir=%20Day-ahead%20market%20hourly%20prices%20in%20Spain&realdir=marginalpdbc",match="marginalpdbc_2017.zip")
So this is what I would like the output to look like except each zip file would need to be its own data frame to work with/loop through. Currently, all it seems to be doing is downloading all the names of the zip files, not the actual data.
Thank you
To open a zipfile and read the files there to a dataframe you can use next example:
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
zip_url = "https://www.omie.es/es/file-download?parents%5B0%5D=marginalpdbc&filename=marginalpdbc_2017.zip"
dfs = []
with ZipFile(BytesIO(requests.get(zip_url).content)) as zf:
for file in zf.namelist():
df = pd.read_csv(
zf.open(file),
sep=";",
skiprows=1,
skipfooter=1,
engine="python",
header=None,
)
dfs.append(df)
final_df = pd.concat(dfs)
# print first 10 rows:
print(final_df.head(10).to_markdown(index=False))
Prints:
0
1
2
3
4
5
6
2017
1
1
1
58.82
58.82
nan
2017
1
1
2
58.23
58.23
nan
2017
1
1
3
51.95
51.95
nan
2017
1
1
4
47.27
47.27
nan
2017
1
1
5
46.9
45.49
nan
2017
1
1
6
46.6
44.5
nan
2017
1
1
7
46.25
44.5
nan
2017
1
1
8
46.1
44.72
nan
2017
1
1
9
46.1
44.22
nan
2017
1
1
10
45.13
45.13
nan
I have the following dataframe:
ID mutex add atomic add cas add ys_add blocking ticket queued fifo
Cores
1 21.0 7.1 12.1 9.8 32.2 44.6
2 121.8 40.0 119.2 928.7 7329.9 7460.1
3 160.5 81.5 227.9 1640.9 14371.8 11802.1
4 188.9 115.7 347.6 1945.1 29130.5 15660.1
There is both a column index (ID) and a row index (Cores). When I use DataFrame.to_html(), I get a table like this:
Instead, I'd like a table with a single header row, composed of all the column names (but without the column index name ID) and with the row index name Cores in that same header row, like so:
I'm open to manipulating the dataframe prior to the to_html() call, or adding parameters to the to_html() call, but not messing around with the generated html.
Initial setup:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],
columns = ['attr_a', 'attr_b', 'attr_c', 'attr_c'])
df.columns.name = 'ID'
df.index.name = 'Cores'
df
ID attr_a attr_b attr_c attr_c
Cores
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Then set columns.name to 'Cores', and index.name to None. df.to_html() should then give you the output you want.
df.columns.name='Cores'
df.index.name = None
df.to_html()
Cores attr_a attr_b attr_c attr_c
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Good afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated from a US university and employed while currently taking the Coursera.org course of "Python for Data Science" just for professional development, which is offered online at Coursera's platform by the University of Michigan. I'm not sharing answers to anyone either as I abide by Coursera's Honor Code.
First, I was given this panda dataframe chart concerning Olympic medals won by countries around the world:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ
Second, the question asked is, "Which country has won the most gold medals in summer games?"
Third, a hint given me as to how to answer using Python's panda syntax is this:
"This function should return a single string value."
Fourth, I tried entering this as the answer in Python's panda syntax:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
def answer_one():
if df.columns[:2]=='00':
df.rename(columns={col:'Country'+col[4:]}, inplace=True)
df_max = df[df[max('Gold')]]
return df_max['Country']
answer_one()
Fifth, I have tried other various answers like this in Coursera's auto-grader, but
it keeps giving this error message:
There was a problem evaluating function answer_one, it threw an exception was thus counted as incorrect.
0.125 points were not awarded.
Could you please help me solve that question? Any hints/suggestions/comments are welcome for that.
Thanks, Kevin
You can use pandas' loc function to find the country name corresponding to the maximum of the "Gold" column:
data = [('Afghanistan', 13),
('Algeria', 12),
('Argentina', 23)]
df = pd.DataFrame(data, columns=['Country', 'Gold'])
df['Country'].loc[df['Gold'] == df['Gold'].max()]
The last line returns Argentina as answer.
Edit 1:
I just noticed you import the .csv file using pd.read_csv('olympics.csv', index_col=0, skiprows=1). If you leave out the skiprows argument you will get a dataframe where the first line in the .csv file correspond to column names in the dataframe. This makes handling of your dataframe much easier in pandas and is encouraged. Second, I see that using the index_col=0 argument you use the country names as indices in the dataframe. In this case you should choose to use index over the loc function as follows:
df.index[df['Gold'] == df['Gold'].max()][0]
import pandas as pd
def answer_one():
df1=pd.Series.max(df['Gold'])
df1=df[df['Gold']==df1]
return df1.index[0]
answer_one()
Function argmax() returns the index of the maximum element in the data frame.
return df['Gold'].argmax()
I have a problem creating a Pandas Dataframe with multi indexing issue. In the data below, you will see that its the data for 2 banks, and each bank has 2 assets and each asset has 3 features.
My data is similarly structured and I want to create a dataframe out of this.
Data = [[[2,4,5],[3,4,5]],[[6,7,8],[9,10,11]]]
Banks = ['Bank1', 'Bank2']
Assets = ['Asset1', 'Asset2']
Asset_feature = ['private','public','classified']
I have tried various ways to do this but I've always failed to create an accurate dataframe. The result should look something like this:
Asset1 Asset2
private public classified private public classified
Bank1 2 4 5 3 4 5
Bank2 6 7 8 9 10 11
Any help would be much appreciated.
import pandas as pd
import numpy as np
assets = ['Asset1', 'Asset2']
Asset_feature = ['private','public','classified']
Banks = ['Bank1', 'Bank2']
Data = [[[2,4,5],[3,4,5]],[[6,7,8],[9,10,11]]]
Data = np.array(Data).reshape(len(Banks),len(Asset_feature) * len(assets))
midx = pd.MultiIndex.from_product([assets, Asset_feature])
test = pd.DataFrame(Data, index=Banks, columns=midx)
test
which gives this output
Asset1 Asset2
private public classified private public classified
Bank1 2 4 5 3 4 5
Bank2 6 7 8 9 10 11