Im trying to read CSV file thats on github with Python using pandas> i have looked all over the web, and I tried some solution that I found on this website, but they do not work. What am I doing wrong?
I have tried this:
import pandas as pd
url = 'https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv'
df = pd.read_csv(url,index_col=0)
#df = pd.read_csv(url)
print(df.head(5))
You should provide URL to raw content. Try using this:
import pandas as pd
url = 'https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv'
df = pd.read_csv(url, index_col=0)
print(df.head(5))
Output:
alpha-2 ... intermediate-region-code
name ...
Afghanistan AF ... NaN
Åland Islands AX ... NaN
Albania AL ... NaN
Algeria DZ ... NaN
American Samoa AS ... NaN
Add ?raw=true at the end of the GitHub URL to get the raw file link.
In your case,
import pandas as pd
url = 'https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv?raw=true'
df = pd.read_csv(url,index_col=0)
print(df.head(5))
Output:
alpha-2 alpha-3 country-code iso_3166-2 region \
name
Afghanistan AF AFG 4 ISO 3166-2:AF Asia
Åland Islands AX ALA 248 ISO 3166-2:AX Europe
Albania AL ALB 8 ISO 3166-2:AL Europe
Algeria DZ DZA 12 ISO 3166-2:DZ Africa
American Samoa AS ASM 16 ISO 3166-2:AS Oceania
sub-region intermediate-region region-code \
name
Afghanistan Southern Asia NaN 142.0
Åland Islands Northern Europe NaN 150.0
Albania Southern Europe NaN 150.0
Algeria Northern Africa NaN 2.0
American Samoa Polynesia NaN 9.0
sub-region-code intermediate-region-code
name
Afghanistan 34.0 NaN
Åland Islands 154.0 NaN
Albania 39.0 NaN
Algeria 15.0 NaN
American Samoa 61.0 NaN
Note: This works only with GitHub links and not with GitLab or Bitbucket links.
You can copy/paste the url and change 2 things:
Remove "blob"
Replace github.com by raw.githubusercontent.com
For instance this link:
https://github.com/mwaskom/seaborn-data/blob/master/iris.csv
Works this way:
import pandas as pd
pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
I recommend to either use pandas as you tried to and others here have explained, or depending on the application, the python csv-handler CommaSeperatedPython, which is a minimalistic wrapper for the native csv-library.
The library returns the contents of a file as a 2-Dimensional String-Array. It's is in its very early stage though, so if you want to do large scale data-analysis, I would suggest Pandas.
First convert the github csv file to raw in order to access the data, follow the link below in comment on how to convert csv file to raw .
import pandas as pd
url_data = (r'https://raw.githubusercontent.com/oderofrancis/rona/main/Countries-Continents.csv')
data_csv = pd.read_csv(url_data)
data_csv.head()
Related
I need to grab a table from a web site by web scraping using BeautifulSoup library in Python. From the URL https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html
When I run this code, I get an empty table:
import requests
from bs4 import BeautifulSoup
#
vaacineProgressResponse = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
vaacineProgressContent = BeautifulSoup(vaacineProgressResponse.content, 'html.parser')
vaacineProgressContentTable = vaacineProgressContent.find_all('table', class_="g-summary-table svelte-2wimac")
if vaacineProgressContentTable is not None and len(vaacineProgressContentTable) > 0:
vaacineProgressContentTable = vaacineProgressContentTable[0]
#
print ('the table =', vaacineProgressContentTable)
The output:
the table = []
Process finished with exit code 0
The screen shot below shows table in the web page (at left) and related Inspect element section (at right):
Very simple - it's because there's an extra space in the class you're searching for.
If you change the class to g-summary-table svelte-2wimac, the tags should be correctly returned.
The following code should work:
import requests
from bs4 import BeautifulSoup
#
url = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', class_="g-summary-table svelte-2wimac")
print(table)
I've also done similar scraping on the NYTimes interactive website, and spaces can be very tricky. If you added an extra space or missed one, an empty result is returned.
If you cannot find the tags, I would recommend printing the entire document first using print(soup.prettify()) and find the desired tags you plan to scrape. Make sure you copy the exact text of the class name from the contents printed by BeautifulSoup.
As an alternative, if you want to download the data in json format, then read into pandas, you can do this. same starting code from above and working off the soup object
There are several apis that are available (below are three), but pulled out of the html like:
import re
import pandas as pd
latest_dataset = soup.find(string=re.compile('latest')).splitlines()[2].split('"')[1]
requests.get(latest_dataset).json()
latest_timeseries = soup.find(string=re.compile('timeseries')).splitlines()[2].split('"')[3]
requests.get(latest_timeseries).json()
allwithrate = soup.find(string=re.compile('all_with_rate')).splitlines()[2].split('"')[1]
requests.get(allwithrate).json()
pd.DataFrame(requests.get(allwithrate).json())
output of the last one
geoid location last_updated total_vaccinations people_vaccinated display_name ... Region IncomeGroup country gdp_per_cap vaccinations_rate people_fully_vaccinated
0 MUS Mauritius 2021-02-17 3843.0 3843.0 Mauritius ... Sub-Saharan Africa High income Mauritius 11099.24028 0.3037 NaN
1 DZA Algeria 2021-02-19 75000.0 NaN Algeria ... Middle East & North Africa Lower middle income Algeria 3973.964072 0.1776 NaN
2 LAO Laos 2021-03-17 40732.0 40732.0 Laos ... East Asia & Pacific Lower middle income Lao PDR 2534.89828 0.5768 NaN
3 MOZ Mozambique 2021-03-23 57305.0 57305.0 Mozambique ... Sub-Saharan Africa Low income Mozambique 503.5707727 0.1943 NaN
4 CPV Cape Verde 2021-03-24 2184.0 2184.0 Cape Verde ... Sub-Saharan Africa Lower middle income Cabo Verde 3603.781793 0.4016 NaN
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
243 GUF NaN NaN NaN NaN French Guiana ... NaN NaN NaN NaN NaN NaN
244 KOS NaN NaN NaN NaN Kosovo ... NaN NaN NaN NaN NaN NaN
245 CUW NaN NaN NaN NaN Cura�ao ... Latin America & Caribbean High income Curacao 19689.13982 NaN NaN
246 CHI NaN NaN NaN NaN Channel Islands ... Europe & Central Asia High income Channel Islands 74462.64675 NaN NaN
247 SXM NaN NaN NaN NaN Sint Maarten ... Latin America & Caribbean High income Sint Maarten (Dutch part) 29160.10381 NaN NaN
[248 rows x 17 columns]
import pandas as pd
dane= pd.read_csv('WHO-COVID-19-global-data _2.csv')
dane
dane.groupby('Country')[['Cumulative_cases']].sum()
KeyError: 'Country'
I don't know why this code doesn't run?
There are spaces at the beginning of dane columns
Remove them with the following line:
dane.rename(columns=lambda x: x.strip(), inplace=True)
dane.groupby('Country')[['Cumulative_cases']].sum()
Cumulative_cases
Country
Afghanistan 5702767
Albania 1300156
Algeria 5561691
American Samoa 0
Andorra 273756
... ...
Wallis and Futuna 14
Yemen 256353
Zambia 1323403
Zimbabwe 692447
occupied Palestinian territory, including east ... 4057017
I am following along with this project guide and I reached a segment where I'm not exactly sure how the code works. Can someone explain the following block of code please:
to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']
df.drop(to_drop, inplace=True, axis=1)
This is the format of the csv file before the previous code is executed:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London
Date of Publication Publisher \
0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Contributors Corporate Author \
0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN
Corporate Contributors Former owner Engraver Issuance type \
0 NaN NaN NaN monographic
1 NaN NaN NaN monographic
2 NaN NaN NaN monographic
3 NaN NaN NaN monographic
4 NaN NaN NaN monographic
Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...
Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.
Which part of the code tells pandas to remove the columns and not rows? What does the inplace=True and axis=1 mean?
This is really basic in Pandas data frame, I guess you should take on a free tutorial.Anyways this code block removes the columns that you have stored in to_drop.
So far a data frame whose name is df we remove columns using this command
df.drop([], inplace=True), axis=1,
where in list we mention the columns we want to drop, axis =1 means to drop them columnwise and in place simply makes it a permanent change that this change will occur actually on the original dataframe.
You can also write the above command as
df.drop(['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks'], inplace=True, axis=1)
Here is quite basic guide to pandas for your future queries Introduction to pandas
I'm trying to read in a table using read_html
import requests
import pandas as pd
import numpy as np
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
resp = requests.get(url)
tables = pd.read_html(resp.text)
But I get this error
IndexError: list index out of range
Other Wiki pages work fine. What's up with this page and how do I solve the above error?
Seems like the table can't be read because of the jquery table sorter.
It's easy to read tables with the selenium library into a df when you're dealing with jquery instead of plain html. You'll still need to do some cleanup, but this will get the table into a df.
You'll need to install the selenium library and download a web browser driver too.
from selenium import webdriver
driver = r'C:\chromedriver_win32\chromedriver.exe'
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
driver = webdriver.Chrome(driver)
driver.get(url)
the_table = driver.find_element_by_xpath('//*[#id="mw-content-text"]/div/table[2]/tbody/tr/td[2]/table')
data = the_table.text
df = pd.DataFrame([x.split() for x in data.split('\n')])
driver.close()
print(df)
0 1 2 3 4 5 \
0 Country (or dependent territory, None None
1 subnational area, etc.) Region Subregion Rate
2 listed Source None None None None
3 None None None None None None
4 Burundi Africa Eastern Africa 6.02 635
5 Comoros Africa Eastern Africa 7.70 60
6 Djibouti Africa Eastern Africa 6.48 60
7 Eritrea Africa Eastern Africa 8.04 390
8 Ethiopia Africa Eastern Africa 7.56 7,552
9 Kenya Africa Eastern Africa 5.00 2,466
10 Madagascar Africa Eastern Africa 7.69 1,863
I have this:
df.loc['United Kingdom']
It is a series:
Rank 4.000000e+00
Documents 2.094400e+04
Citable documents 2.035700e+04
Citations 2.060910e+05
Self-citations 3.787400e+04
Citations per document 9.840000e+00
H index 1.390000e+02
Energy Supply NaN
Energy Supply per Capita NaN
% Renewable's NaN
2006 2.419631e+12
2007 2.482203e+12
2008 2.470614e+12
2009 2.367048e+12
2010 2.403504e+12
2011 2.450911e+12
2012 2.479809e+12
2013 2.533370e+12
2014 2.605643e+12
2015 2.666333e+12
Name: United Kingdom, dtype: float64
Now, I want to get
apply(lambda x: x['2015'] - x['2006'])
But it returned an error:
TypeError: 'float' object is not subscriptable
But if I get it separate:
df.loc['United Kingdom']['2015'] - df.loc['United Kingdom']['2006']
It worked okay.
How could I apply and lambda in here?
Thanks.
Ps: I want to apply it for a Dataframe
Rank Documents Citable documents Citations Self-citations Citations per document H index Energy Supply Energy Supply per Capita % Renewable's ... 2008 2009 2010 2011 2012 2013 2014 2015 Citation Ratio Population
Country
China 1 127050 126767 597237 411683 4.70 138 NaN NaN NaN ... 4.997775e+12 5.459247e+12 6.039659e+12 6.612490e+12 7.124978e+12 7.672448e+12 8.230121e+12 8.797999e+12 0.689313 NaN
United States 2 96661 94747 792274 265436 8.20 230 NaN NaN NaN ... 1.501149e+13 1.459484e+13 1.496437e+13 1.520402e+13 1.554216e+13 1.577367e+13 1.615662e+13 1.654857e+13 0.335031 NaN
Japan 3 30504 30287 223024 61554 7.31 134 NaN NaN NaN ... 5.558527e+12 5.251308e+12 5.498718e+12 5.473738e+12 5.569102e+12 5.644659e+12 5.642884e+12 5.669563e+12 0.275997 NaN
United Kingdom 4 20944 20357 206091 37874 9.84 139 NaN NaN NaN ... 2.470614e+12 2.367048e+12 2.403504e+12 2.450911e+12 2.479809e+12 2.533370e+12 2.605643e+12 2.666333e+12 0.183773 NaN
enter code here
If you want to apply it against all your dataframe, then just calculate it:
df['2015'] - df['2006']