I am trying to read the last line from a CSV file stored in GCS.
My Code -
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('my-bucket/my_file.csv') as f:
file = pd.read_csv(f)
print(file.tail(1))
Output:
John Doe 120 jefferson st. Riverside NJ 08075
5 business-name Internal 6 NaN NaN NaN
Public Sample CSV file -
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
business-name,Internal,6
I just want to get the last line - business-name,Internal,6 but that's not I'm getting. I'm not sure why tail(1) is not working.
Can anyone please help me?
The below pandas code should solve your issue. You can use the pandas read_scv function to get the csv data instead of reading the file.
import pandas as pd
df = pd.read_csv('my-bucket/my_file.csv')
df.tail(1)
By looks of it the code is working correctly. By default it is printing header column. If you want to disable header printing use the follwing.
file.tail(1).to_string(header=False))
Related
This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']
I'm new to Python and I'm trying to analyse this CSV file. It has a lot of different countries (as an example below).
country iso2 iso3 iso_numeric g_whoregion year e_pop_num e_inc_100k e_inc_100k_lo
Afghanistan AF AFG 4 EMR 2000 20093756 190 123
American Samoa AS ASM 16 WPR 2003 59117 5.8 5 6.7 3 3 4
Gambia GM GMB 270 AFR 2010 1692149 178 115 254 3000 1900 4300
I want to try and obtain only specific data, so only specific countries and only specific columns (like "e_pop_numb"). How would I go about doing that?
The only basic code I have is:
import csv
import itertools
f = csv.reader(open('TB_burden_countries_2018-03-06.csv'))
for row in itertools.islice(f, 0, 10):
print (row)
Which just lets me choose specific rows I want, but not necessarily the country I want to look at, or the specific columns I want.
IF you can help me or provide me a guide so I can do my own learning, I'd very much appreciate that! Thank you.
I recommend you to use pandas python library. Please follow the article as here below there is a snippet code to iluminate your thoughts.
import pandas as pd
df1=pd.read_csv("https://pythonhow.com/wp-content/uploads/2016/01/Income_data.csv")
df2.loc["Alaska":"Arkansas","2005":"2007"]
source of this information: https://pythonhow.com/accessing-dataframe-columns-rows-and-cells/
Pandas will probably be the easiest way. https://pandas.pydata.org/pandas-docs/stable/
To get it run
pip install pandas
Then read the csv into a dataframe and filter it
import pandas as pd
df = pd.read_csv(‘TB_burden_countries_2018-03-06.csv’)
df = df[df[‘country’] == ‘Gambia’]
print(df)
with
open('file') as f:
fields = f.readline().split("\t")
print(fields)
If you supply more details about what you want to see, the answer would differ.
I have a csv file like below
a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand
I'm trying to get output like below, by removing repeated words in each row
a v s f
0 china usa and uk france
1 india australia usa uk
2 japan south africa new zealand
for which I'm doing
import pandas as pd
from io import StringIO
data="""a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand"""
df= pd.read_csv(StringIO(data).decode('UTF-8') )
from collections import Counter
def trans(x):
d=[y for y in x]
i=0
while i<len(d):
j=i+1
item=d[i]
while j<len(d):
if item in d[j]:
d[j]=d[j].replace(item,'')
j+=1
i+=1
return d
print df.apply(lambda x:trans(x),axis=1 )
It works fine as long as I input the data into variable 'data'. But if I want to import that from csv file by doing data = pd.read_csv("trial.csv"), it doesn't work. I get an error message saying 'DataFrame' object has no attribute 'decode'. How can I read the data from a CSV file and write output to CSV file using pandas? Where am I going wrong?
I am trying to load in a really messy text file into Python/Pandas. Here is an example of what the data in the file looks like
('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:24','viewed_home_page'),('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:36','viewed_search_results'),('41aa8fac-1bd8-4f95-918c-413879ed43f1','bcca257d-68d3-47e6-bc58-52c166f3b27b','Madison, WI','2014-08-16 17:42:31','visit_start')
Here is my code
import pandas as pd
cols=['ID','Visit','Market','Event Time','Event Name']
table=pd.read_table('C:\Users\Desktop\Dump.txt',sep=',', header=None,names=cols,nrows=10)
But when I look at the table, it still does not read correctly.
All of the data is mainly on one row.
You could use ast.literal_eval to parse the data into a Python tuple of tuples, and then you could call pd.DataFrame on that:
import pandas as pd
import ast
cols=['ID','Visit','Market','Event Time','Event Name']
with open(filename, 'rb') as f:
data = ast.literal_eval(f.read())
df = pd.DataFrame(list(data), columns=cols)
print(df)
yields
ID Visit \
0 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
1 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
2 41aa8fac-1bd8-4f95-918c-413879ed43f1 bcca257d-68d3-47e6-bc58-52c166f3b27b
Market Event Time Event Name
0 Seattle, WA 2014-08-05 10:06:24 viewed_home_page
1 Seattle, WA 2014-08-05 10:06:36 viewed_search_results
2 Madison, WI 2014-08-16 17:42:31 visit_start
When I try to load a Google Spreadsheet in pandas
from StringIO import StringIO
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv')
data = r.content
df = pd.read_csv(StringIO(data), index_col=0)
I get the following:
CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386
Why? I would think that one could identify the spreadsheet set of rows and columns with data and use the spreadsheets rows and columns as the dataframe index and columns respectively (with NaN for anything empty). Why does it fail?
This question of mine shows how Getting Google Spreadsheet CSV into A Pandas Dataframe
As one of the commentators noted you have not asked for the data in CSV format you have the "edit" request at the end of the url
You can use this code and see it work on the spreadsheet (which by the way needs to be public..) It is possible to do private sheets as well but that is another topic.
from StringIO import StringIO # got moved around in python3 if you're using that.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate'])
In [11]: df.head()
Out[11]:
City region Res_Comm \
0 Dothan South_Central-Montgomery-Auburn-Wiregrass-Dothan Residential
10 Foley South_Mobile-Baldwin Residential
12 Birmingham North_Central-Birmingham-Tuscaloosa-Anniston Commercial
38 Brent North_Central-Birmingham-Tuscaloosa-Anniston Residential
44 Athens North_Huntsville-Decatur-Florence Residential
mkt_type Quradate National_exp Alabama_exp Sales_exp \
0 Rural 2010-01-15 00:00:00 2 2 3
10 Suburban_Urban 2010-01-15 00:00:00 4 4 4
12 Suburban_Urban 2010-01-15 00:00:00 2 2 3
38 Rural 2010-01-15 00:00:00 3 3 3
44 Suburban_Urban 2010-01-15 00:00:00 4 5 4
The new Google spreadsheet url format for getting the csv output is
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id
Well they changed the url format slightly again now you need:
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet
I also found I needed to do the following to deal with Python 3 a slight revision to the above:
from io import StringIO
and to get the file:
guid=0 #for the 1st sheet
act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid)
dataact = act.content.decode('utf-8') #To convert to string for Stringio
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort()
actdf is now a full pandas dataframe with headers (column names)
Warning: this solution will make your data accessible by anyone.
In google sheet click file>publish to web. Then select what do you need to publish and select export format .csv. You'll have the link something like:
https://docs.google.com/spreadsheets/d/<your sheets key yhere>/pub?gid=1317664180&single=true&output=csv
Then simply:
import pandas as pd
pathtoCsv = r'https://docs.google.com/spreadsheets/d/<sheets key>/pub?gid=1317664180&single=true&output=csv'
dev = pd.read_csv(pathtoCsv)
print dev
Did you share the sheet?
Click the “Share” button in the top-right corner of your document.
Click on the “Get link” section and pick “Anyone with the link”.
This solved for me the problem.
If you didn't share, Google Sheet returns an errorpage what causes the Panda-error. (The fact that the URL works and returns a CSV when opening/pasting in the browser is because you are logged in)
The current Google Drive URL to export as csv is:
https://drive.google.com/uc?export=download&id=EnterIDHere
So:
import pandas as pd
pathtocsv = r'https://drive.google.com/uc?export=download&id=EnterIDHere'
df = pd.read_csv(pathtocsv)