When I try to load a Google Spreadsheet in pandas
from StringIO import StringIO
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv')
data = r.content
df = pd.read_csv(StringIO(data), index_col=0)
I get the following:
CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386
Why? I would think that one could identify the spreadsheet set of rows and columns with data and use the spreadsheets rows and columns as the dataframe index and columns respectively (with NaN for anything empty). Why does it fail?
This question of mine shows how Getting Google Spreadsheet CSV into A Pandas Dataframe
As one of the commentators noted you have not asked for the data in CSV format you have the "edit" request at the end of the url
You can use this code and see it work on the spreadsheet (which by the way needs to be public..) It is possible to do private sheets as well but that is another topic.
from StringIO import StringIO # got moved around in python3 if you're using that.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate'])
In [11]: df.head()
Out[11]:
City region Res_Comm \
0 Dothan South_Central-Montgomery-Auburn-Wiregrass-Dothan Residential
10 Foley South_Mobile-Baldwin Residential
12 Birmingham North_Central-Birmingham-Tuscaloosa-Anniston Commercial
38 Brent North_Central-Birmingham-Tuscaloosa-Anniston Residential
44 Athens North_Huntsville-Decatur-Florence Residential
mkt_type Quradate National_exp Alabama_exp Sales_exp \
0 Rural 2010-01-15 00:00:00 2 2 3
10 Suburban_Urban 2010-01-15 00:00:00 4 4 4
12 Suburban_Urban 2010-01-15 00:00:00 2 2 3
38 Rural 2010-01-15 00:00:00 3 3 3
44 Suburban_Urban 2010-01-15 00:00:00 4 5 4
The new Google spreadsheet url format for getting the csv output is
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id
Well they changed the url format slightly again now you need:
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet
I also found I needed to do the following to deal with Python 3 a slight revision to the above:
from io import StringIO
and to get the file:
guid=0 #for the 1st sheet
act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid)
dataact = act.content.decode('utf-8') #To convert to string for Stringio
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort()
actdf is now a full pandas dataframe with headers (column names)
Warning: this solution will make your data accessible by anyone.
In google sheet click file>publish to web. Then select what do you need to publish and select export format .csv. You'll have the link something like:
https://docs.google.com/spreadsheets/d/<your sheets key yhere>/pub?gid=1317664180&single=true&output=csv
Then simply:
import pandas as pd
pathtoCsv = r'https://docs.google.com/spreadsheets/d/<sheets key>/pub?gid=1317664180&single=true&output=csv'
dev = pd.read_csv(pathtoCsv)
print dev
Did you share the sheet?
Click the “Share” button in the top-right corner of your document.
Click on the “Get link” section and pick “Anyone with the link”.
This solved for me the problem.
If you didn't share, Google Sheet returns an errorpage what causes the Panda-error. (The fact that the URL works and returns a CSV when opening/pasting in the browser is because you are logged in)
The current Google Drive URL to export as csv is:
https://drive.google.com/uc?export=download&id=EnterIDHere
So:
import pandas as pd
pathtocsv = r'https://drive.google.com/uc?export=download&id=EnterIDHere'
df = pd.read_csv(pathtocsv)
Related
I am new to Python/Bokeh/Pandas.
I am able to plot line graph in pandas/bokeh using parse_date options.
However I have come across a dataset(.csv) where the column is like below
My code is as below which gives a blank graph if the column 'Year/Ports' is in YYYY-YY form like from 1952-53, 1953-54, 1954-55 etc.
Do I have to extract only the YYYY and plot because that works but I am sure that is not how the data is to be visualized.
If I extract only the YYYY using CSV or Notepad++ tools then there is no issue as the dates are read perfectly and I get a good meaningful line graph
#Total Cargo Handled at Mormugao Port from 1950-51 to 2019-20
import pandas as pd
from bokeh.plotting import figure,show
from bokeh.io import output_file
#read the CSV file shared by GOI
df = pd.read_csv("Cargo_Data_full.csv",parse_dates=["Year/Ports"])
# selecting rows based on condition
output_file("Cargo tracker.html")
f = figure(height=200,sizing_mode = 'scale_width',x_axis_type = 'datetime')
f.title.text = "Cargo Tracker"
f.xaxis.axis_label="Year/Ports"
f.yaxis.axis_label="Cargo handled"
f.line(df['Year/Ports'],df['OTHERS'])
show(f)
You can't use parse_dates in this case, since the format is not a valid datetime. You can use pandas string slicing to only keep the YYYY part.
df = pd.DataFrame({'Year/Ports':['1952-53', '1953-54', '1954-55'], 'val':[1,2,3]})
df['Year/Ports'] = df['Year/Ports'].str[:4]
print(df)
Year/Ports val
0 1952 1
1 1953 2
2 1954 3
From there you can turn it into a datetime if that makes sense for you.
df['Year/Ports'] = pd.to_datetime(df['Year/Ports'])
print(df)
Year/Ports val
0 1952-01-01 1
1 1953-01-01 2
2 1954-01-01 3
I am trying to scrape a table from a website using pandas. The code is shown below:
import pandas as pd
url = "http://mnregaweb4.nic.in/netnrega/state_html/empstatusnewall_scst.aspx?page=S&lflag=eng&state_name=KERALA&state_code=16&fin_year=2020-2021&source=national&Digest=s5wXOIOkT98cNVkcwF6NQA"
df1 = pd.read_html(url)[3]
df1.to_excel("combinedGP.xlsx", index=False)
In the resulting excel file, the numbers are saved as text. Since I am planning to build a file with around 1000 rows, I cannot manually change the data type. So is there another way to store them as actual values and not text? TIA
The website can be very unresponsive...
there are unwanted header rows, and two rows of column headers
simple way to manage this is to_csv(), from_csv() with appropriate parameters.
import pandas as pd
import io
url = "http://mnregaweb4.nic.in/netnrega/state_html/empstatusnewall_scst.aspx?page=S&lflag=eng&state_name=KERALA&state_code=16&fin_year=2020-2021&source=national&Digest=s5wXOIOkT98cNVkcwF6NQA"
df1 = pd.read_html(url)[3]
df1 = pd.read_csv(io.StringIO(df1.to_csv(index=False)), skiprows=3, header=[0,1])
# df1.to_excel("combinedGP.xlsx", index=False)
sample after cleaning up
S.No District HH issued jobcards No. of HH Provided Employment EMP. Provided No. of Persondays generated Families Completed 100 Days
S.No District SCs STs Others Total SCs STs Others Total No. of Women SCs STs Others Total Women SCs STs Others Total
0 1.0 ALAPPUZHA 32555 760 254085 287400 20237 565 132744 153546 157490 1104492 40209 6875586 8020287 7635748 1346 148 5840 7334
1 2.0 ERNAKULAM 36907 2529 212534 251970 15500 1517 68539 85556 82270 908035 104040 3788792 4800867 4467329 2848 301 11953 15102
I want to download the data from USDA site with custom queries. So instead of manually selecting queries in the website, I am thinking about how should I do this handier in python. To do so, I used request, http to access the url and read the content, it is not intuitive for me how should I pass the queries then make a selection and download the data as csv. Does anyone knows of doing this easily in python? Is there any workaround we could download the data from url with specific queries? Any idea?
this is my current attempt
here is the url that I am going to select data with custom queries.
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
so before reading the requested json in pandas, I need to pass following queries for correct data selection:
Category = "Retail"
Report Type = "Item"
Species = "Beef"
Region(s) = "National"
Start Dates = "2020-01-01"
End Date = "2021-02-08"
it is not intuitive for me how should I pass the queries with requested json then download the filtered data as csv. Is there any efficient way of doing this in python? Any thoughts? Thanks
A few details
simplest format is text rather that HTML. Got URL from HTML page for text download
requests(params=) is a dict. Built it up and passed, no need to deal with building complete URL string
clearly text is space delimited, found minimum of double space
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail"
p = {"repType":"summary","species":"BEEF","portal":"ls","category":"Retail","format":"text"}
r = requests.get(url, params=p)
df = pd.read_csv(io.StringIO(r.text), sep="\s\s+", engine="python")
Date
Region
Feature Rate
Outlets
Special Rate
Activity Index
0
02/05/2021
NATIONAL
69.40%
29,200
20.10%
81,650
1
02/05/2021
NORTHEAST
75.00%
5,500
3.80%
17,520
2
02/05/2021
SOUTHEAST
70.10%
7,400
28.00%
23,980
3
02/05/2021
MIDWEST
75.10%
6,100
19.90%
17,430
4
02/05/2021
SOUTH CENTRAL
57.90%
4,900
26.40%
9,720
5
02/05/2021
NORTHWEST
77.50%
1,300
2.50%
3,150
6
02/05/2021
SOUTHWEST
63.20%
3,800
27.50%
9,360
7
02/05/2021
ALASKA
87.00%
200
.00%
290
8
02/05/2021
HAWAII
46.70%
100
.00%
230
Just format the query data in the url - it's actually a REST API:
To add more query data, as #mullinscr said, you can change the values on the left and press submit, then see the query name in the URL (for example, start date is called repDate).
If you hover on the Download as XML link, you will also discover you can specify the download format using format=<format_name>. Parsing the tabular data in XML using pandas might be easier, so I would append format=xml at the end as well.
category = "Retail"
report_type = "Item"
species = "BEEF"
regions = "NATIONAL"
start_date = "01-01-2019"
end_date = "01-01-2021"
# the website changes "-" to "%2F"
start_date = start_date.replace("-", "%2F")
end_date = end_date.replace("-", "%2F")
url = f"https://www.marketnews.usda.gov/mnp/ls-report-retail?runReport=true&portal=ls&startIndex=1&category={category}&repType={report_type}&species={species}®ion={regions}&repDate={start_date}&endDate={end_date}&compareLy=No&format=xml"
# parse with pandas, etc...
I'm new to Python and I'm trying to analyse this CSV file. It has a lot of different countries (as an example below).
country iso2 iso3 iso_numeric g_whoregion year e_pop_num e_inc_100k e_inc_100k_lo
Afghanistan AF AFG 4 EMR 2000 20093756 190 123
American Samoa AS ASM 16 WPR 2003 59117 5.8 5 6.7 3 3 4
Gambia GM GMB 270 AFR 2010 1692149 178 115 254 3000 1900 4300
I want to try and obtain only specific data, so only specific countries and only specific columns (like "e_pop_numb"). How would I go about doing that?
The only basic code I have is:
import csv
import itertools
f = csv.reader(open('TB_burden_countries_2018-03-06.csv'))
for row in itertools.islice(f, 0, 10):
print (row)
Which just lets me choose specific rows I want, but not necessarily the country I want to look at, or the specific columns I want.
IF you can help me or provide me a guide so I can do my own learning, I'd very much appreciate that! Thank you.
I recommend you to use pandas python library. Please follow the article as here below there is a snippet code to iluminate your thoughts.
import pandas as pd
df1=pd.read_csv("https://pythonhow.com/wp-content/uploads/2016/01/Income_data.csv")
df2.loc["Alaska":"Arkansas","2005":"2007"]
source of this information: https://pythonhow.com/accessing-dataframe-columns-rows-and-cells/
Pandas will probably be the easiest way. https://pandas.pydata.org/pandas-docs/stable/
To get it run
pip install pandas
Then read the csv into a dataframe and filter it
import pandas as pd
df = pd.read_csv(‘TB_burden_countries_2018-03-06.csv’)
df = df[df[‘country’] == ‘Gambia’]
print(df)
with
open('file') as f:
fields = f.readline().split("\t")
print(fields)
If you supply more details about what you want to see, the answer would differ.
I am trying to load in a really messy text file into Python/Pandas. Here is an example of what the data in the file looks like
('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:24','viewed_home_page'),('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:36','viewed_search_results'),('41aa8fac-1bd8-4f95-918c-413879ed43f1','bcca257d-68d3-47e6-bc58-52c166f3b27b','Madison, WI','2014-08-16 17:42:31','visit_start')
Here is my code
import pandas as pd
cols=['ID','Visit','Market','Event Time','Event Name']
table=pd.read_table('C:\Users\Desktop\Dump.txt',sep=',', header=None,names=cols,nrows=10)
But when I look at the table, it still does not read correctly.
All of the data is mainly on one row.
You could use ast.literal_eval to parse the data into a Python tuple of tuples, and then you could call pd.DataFrame on that:
import pandas as pd
import ast
cols=['ID','Visit','Market','Event Time','Event Name']
with open(filename, 'rb') as f:
data = ast.literal_eval(f.read())
df = pd.DataFrame(list(data), columns=cols)
print(df)
yields
ID Visit \
0 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
1 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
2 41aa8fac-1bd8-4f95-918c-413879ed43f1 bcca257d-68d3-47e6-bc58-52c166f3b27b
Market Event Time Event Name
0 Seattle, WA 2014-08-05 10:06:24 viewed_home_page
1 Seattle, WA 2014-08-05 10:06:36 viewed_search_results
2 Madison, WI 2014-08-16 17:42:31 visit_start