Problems importing text file into Python/Pandas - python

I am trying to load in a really messy text file into Python/Pandas. Here is an example of what the data in the file looks like
('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:24','viewed_home_page'),('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:36','viewed_search_results'),('41aa8fac-1bd8-4f95-918c-413879ed43f1','bcca257d-68d3-47e6-bc58-52c166f3b27b','Madison, WI','2014-08-16 17:42:31','visit_start')
Here is my code
import pandas as pd
cols=['ID','Visit','Market','Event Time','Event Name']
table=pd.read_table('C:\Users\Desktop\Dump.txt',sep=',', header=None,names=cols,nrows=10)
But when I look at the table, it still does not read correctly.
All of the data is mainly on one row.

You could use ast.literal_eval to parse the data into a Python tuple of tuples, and then you could call pd.DataFrame on that:
import pandas as pd
import ast
cols=['ID','Visit','Market','Event Time','Event Name']
with open(filename, 'rb') as f:
data = ast.literal_eval(f.read())
df = pd.DataFrame(list(data), columns=cols)
print(df)
yields
ID Visit \
0 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
1 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
2 41aa8fac-1bd8-4f95-918c-413879ed43f1 bcca257d-68d3-47e6-bc58-52c166f3b27b
Market Event Time Event Name
0 Seattle, WA 2014-08-05 10:06:24 viewed_home_page
1 Seattle, WA 2014-08-05 10:06:36 viewed_search_results
2 Madison, WI 2014-08-16 17:42:31 visit_start

Related

How to extract only year(YYYY) from a CSV column with data like YYYY-YY

I am new to Python/Bokeh/Pandas.
I am able to plot line graph in pandas/bokeh using parse_date options.
However I have come across a dataset(.csv) where the column is like below
My code is as below which gives a blank graph if the column 'Year/Ports' is in YYYY-YY form like from 1952-53, 1953-54, 1954-55 etc.
Do I have to extract only the YYYY and plot because that works but I am sure that is not how the data is to be visualized.
If I extract only the YYYY using CSV or Notepad++ tools then there is no issue as the dates are read perfectly and I get a good meaningful line graph
#Total Cargo Handled at Mormugao Port from 1950-51 to 2019-20
import pandas as pd
from bokeh.plotting import figure,show
from bokeh.io import output_file
#read the CSV file shared by GOI
df = pd.read_csv("Cargo_Data_full.csv",parse_dates=["Year/Ports"])
# selecting rows based on condition
output_file("Cargo tracker.html")
f = figure(height=200,sizing_mode = 'scale_width',x_axis_type = 'datetime')
f.title.text = "Cargo Tracker"
f.xaxis.axis_label="Year/Ports"
f.yaxis.axis_label="Cargo handled"
f.line(df['Year/Ports'],df['OTHERS'])
show(f)
You can't use parse_dates in this case, since the format is not a valid datetime. You can use pandas string slicing to only keep the YYYY part.
df = pd.DataFrame({'Year/Ports':['1952-53', '1953-54', '1954-55'], 'val':[1,2,3]})
df['Year/Ports'] = df['Year/Ports'].str[:4]
print(df)
Year/Ports val
0 1952 1
1 1953 2
2 1954 3
From there you can turn it into a datetime if that makes sense for you.
df['Year/Ports'] = pd.to_datetime(df['Year/Ports'])
print(df)
Year/Ports val
0 1952-01-01 1
1 1953-01-01 2
2 1954-01-01 3

How to plot bar chart(s) from csv file using python?

I have this .csv file (that I can't edit):
,Denmark,Norway,Sweden
TotalCases,"78,354 ","35,546 ","243,129 "
Deaths,"823","328","6,681"
Recovered,"61,461","20,956",N/A
I want to make 3 separate bar charts/graphs , one for each section (TotalCases, Deaths, Recovered). However most guides I found online have the data presented the other way round, where the TotalCases are the columns instead of rows like in this scenario. What is the right way to do this?
Then just transpose your data frame and follow the examples!
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO("""
country,Denmark,Norway,Sweden
TotalCases,"78,354 ","35,546 ","243,129 "
Deaths,"823","328","6,681"
Recovered,"61,461","20,956",N/A
""")
df = pd.read_csv(s)
df_t = df.transpose()
df_t.columns = df_t.iloc[0, :]
df_t = df_t.iloc[1:, :]
df_t['country'] = df_t.index
Use df_t then following their examples.
In [45]: df_t
Out[45]:
country TotalCases Deaths Recovered country
Denmark 78,354 823 61,461 Denmark
Norway 35,546 328 20,956 Norway
Sweden 243,129 6,681 NaN Sweden

Read the last line from the CSV file through Pandas

I am trying to read the last line from a CSV file stored in GCS.
My Code -
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('my-bucket/my_file.csv') as f:
file = pd.read_csv(f)
print(file.tail(1))
Output:
John Doe 120 jefferson st. Riverside NJ 08075
5 business-name Internal 6 NaN NaN NaN
Public Sample CSV file -
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
business-name,Internal,6
I just want to get the last line - business-name,Internal,6 but that's not I'm getting. I'm not sure why tail(1) is not working.
Can anyone please help me?
The below pandas code should solve your issue. You can use the pandas read_scv function to get the csv data instead of reading the file.
import pandas as pd
df = pd.read_csv('my-bucket/my_file.csv')
df.tail(1)
By looks of it the code is working correctly. By default it is printing header column. If you want to disable header printing use the follwing.
file.tail(1).to_string(header=False))

Pandas, read in file without a separator between columns

I want to read in a file that looks like this:
1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09
So, each column is denoted by a 15 character sequence, but there's no official separator. Does pandas have a way of doing this?
Yes! its called pd.read_fwf
from io import StringIO
import pandas as pd
txt = """ 1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09"""
pd.read_fwf(StringIO(txt), widths=[15] * 5, header=None)
0 1 2 3 4
0 1.499981e-01 2.499968e-01 3.999948e-01 5.999922e-01 9.999871e-01
1 1.499981e+00 2.499968e+00 5.999922e+00 9.999871e+00 1.999974e+01
2 4.999935e+01 9.999871e+01 0.000000e+00 -2.706363e+03 -6.370275e+03
3 -1.975213e+04 -4.649283e+04 -1.094354e+05 -3.393231e+05 -7.987023e+05
4 -1.879993e+06 -5.829214e+06 -1.372079e+07 -2.263858e+07 -4.254295e+07
5 -7.601675e+07 -1.254220e+08 -2.356903e+08 -3.888620e+08 -7.307020e+08
6 -1.305466e+09 -2.153480e+09 -4.044550e+09 -4.548962e+09 -5.325339e+09
Let's look at using pd.read_fwf:
df = pd.read_fwf(csv_file,widths=[15]*5,header=None)
Let's do like that:
for example: housing.data
dataset = pd.read_csv('c:/1/housing.data', engine = 'python', sep='\s+', header=None)

Loading a generic Google Spreadsheet in Pandas

When I try to load a Google Spreadsheet in pandas
from StringIO import StringIO
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv')
data = r.content
df = pd.read_csv(StringIO(data), index_col=0)
I get the following:
CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386
Why? I would think that one could identify the spreadsheet set of rows and columns with data and use the spreadsheets rows and columns as the dataframe index and columns respectively (with NaN for anything empty). Why does it fail?
This question of mine shows how Getting Google Spreadsheet CSV into A Pandas Dataframe
As one of the commentators noted you have not asked for the data in CSV format you have the "edit" request at the end of the url
You can use this code and see it work on the spreadsheet (which by the way needs to be public..) It is possible to do private sheets as well but that is another topic.
from StringIO import StringIO # got moved around in python3 if you're using that.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate'])
In [11]: df.head()
Out[11]:
City region Res_Comm \
0 Dothan South_Central-Montgomery-Auburn-Wiregrass-Dothan Residential
10 Foley South_Mobile-Baldwin Residential
12 Birmingham North_Central-Birmingham-Tuscaloosa-Anniston Commercial
38 Brent North_Central-Birmingham-Tuscaloosa-Anniston Residential
44 Athens North_Huntsville-Decatur-Florence Residential
mkt_type Quradate National_exp Alabama_exp Sales_exp \
0 Rural 2010-01-15 00:00:00 2 2 3
10 Suburban_Urban 2010-01-15 00:00:00 4 4 4
12 Suburban_Urban 2010-01-15 00:00:00 2 2 3
38 Rural 2010-01-15 00:00:00 3 3 3
44 Suburban_Urban 2010-01-15 00:00:00 4 5 4
The new Google spreadsheet url format for getting the csv output is
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id
Well they changed the url format slightly again now you need:
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet
I also found I needed to do the following to deal with Python 3 a slight revision to the above:
from io import StringIO
and to get the file:
guid=0 #for the 1st sheet
act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid)
dataact = act.content.decode('utf-8') #To convert to string for Stringio
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort()
actdf is now a full pandas dataframe with headers (column names)
Warning: this solution will make your data accessible by anyone.
In google sheet click file>publish to web. Then select what do you need to publish and select export format .csv. You'll have the link something like:
https://docs.google.com/spreadsheets/d/<your sheets key yhere>/pub?gid=1317664180&single=true&output=csv
Then simply:
import pandas as pd
pathtoCsv = r'https://docs.google.com/spreadsheets/d/<sheets key>/pub?gid=1317664180&single=true&output=csv'
dev = pd.read_csv(pathtoCsv)
print dev
Did you share the sheet?
Click the “Share” button in the top-right corner of your document.
Click on the “Get link” section and pick “Anyone with the link”.
This solved for me the problem.
If you didn't share, Google Sheet returns an errorpage what causes the Panda-error. (The fact that the URL works and returns a CSV when opening/pasting in the browser is because you are logged in)
The current Google Drive URL to export as csv is:
https://drive.google.com/uc?export=download&id=EnterIDHere
So:
import pandas as pd
pathtocsv = r'https://drive.google.com/uc?export=download&id=EnterIDHere'
df = pd.read_csv(pathtocsv)

Categories