pandas read_csv ignore separator in last column - python

I have a file with the following structure (first row is the header, filename is test.dat):
ID_OBS LAT LON ALT TP TO LT_min LT_max STATIONNAME
ALT_NOA_000 82.45 -62.52 210.0 FM 0 0.0 24.0 Alert, Nunavut, Canada
How do I instruct pandas to read the entire station name (in this example, Alert, Nunavut, Canada) as a single element? I use delim_whitespace=True in my code, but that does not work, since the station name contains whitespace characters.
Running:
import pandas as pd
test = pd.read_csv('./test.dat', delim_whitespace=True, header=1)
print(test.to_string())
Produces:
ID_OBS LAT LON ALT TP TO LT_min LT_max STATIONNAME
ALT_NOA_000 82.45 -62.52 210.0 FM 0 0.0 24.0 Alert, Nunavut, Canada
Quickly reading through the tutorials did not help. What am I missing here?

I often approach these by writing my own little parser. In general there are ways to bend pandas to your will, but I find this way is often easier:
Code:
import re
def parse_my_file(filename):
with open(filename) as f:
for line in f:
yield re.split(r'\s+', line.strip(), 8)
# build the generator
my_parser = parse_my_file('test.dat')
# first element returned is the columns
columns = next(my_parser)
# build the data frame
df = pd.DataFrame(my_parser, columns=columns)
print(df)
Results:
ID_OBS LAT LON ALT TP TO LT_min LT_max \
0 ALT_NOA_000 82.45 -62.52 210.0 FM 0 0.0 24.0
STATIONNAME
0 Alert, Nunavut, Canada

Your pasted sample file is a bit ambiguous: it's not possible to tell by eye if something that looks like a few spaces is a tab or not, for example.
In general, though, note that plain old Python is more expressive than Pandas, or CSV modules (Pandas's strength is elseswhere). E.g., there are even Python modules for recursive-descent parsers, which Pandas obviously lacks. You can use regular Python to manipulate the file into an easier form for Pandas to parse. For example:
import re
>>> ['#'.join(re.split(r'[ \t]+', l.strip(), maxsplit=8)) for l in open('stuff.tsv') if l.strip()]
['ID_OBS#LAT#LON#ALT#TP#TO#LT_min#LT_max#STATIONNAME',
'ALT_NOA_000#82.45#-62.52#210.0#FM#0#0.0#24.0#Alert, Nunavut, Canada']
changes the delimiter to '#', which, if you write back to a file, for example, you can parse using delimiter='#'.

Related

How to extract only year(YYYY) from a CSV column with data like YYYY-YY

I am new to Python/Bokeh/Pandas.
I am able to plot line graph in pandas/bokeh using parse_date options.
However I have come across a dataset(.csv) where the column is like below
My code is as below which gives a blank graph if the column 'Year/Ports' is in YYYY-YY form like from 1952-53, 1953-54, 1954-55 etc.
Do I have to extract only the YYYY and plot because that works but I am sure that is not how the data is to be visualized.
If I extract only the YYYY using CSV or Notepad++ tools then there is no issue as the dates are read perfectly and I get a good meaningful line graph
#Total Cargo Handled at Mormugao Port from 1950-51 to 2019-20
import pandas as pd
from bokeh.plotting import figure,show
from bokeh.io import output_file
#read the CSV file shared by GOI
df = pd.read_csv("Cargo_Data_full.csv",parse_dates=["Year/Ports"])
# selecting rows based on condition
output_file("Cargo tracker.html")
f = figure(height=200,sizing_mode = 'scale_width',x_axis_type = 'datetime')
f.title.text = "Cargo Tracker"
f.xaxis.axis_label="Year/Ports"
f.yaxis.axis_label="Cargo handled"
f.line(df['Year/Ports'],df['OTHERS'])
show(f)
You can't use parse_dates in this case, since the format is not a valid datetime. You can use pandas string slicing to only keep the YYYY part.
df = pd.DataFrame({'Year/Ports':['1952-53', '1953-54', '1954-55'], 'val':[1,2,3]})
df['Year/Ports'] = df['Year/Ports'].str[:4]
print(df)
Year/Ports val
0 1952 1
1 1953 2
2 1954 3
From there you can turn it into a datetime if that makes sense for you.
df['Year/Ports'] = pd.to_datetime(df['Year/Ports'])
print(df)
Year/Ports val
0 1952-01-01 1
1 1953-01-01 2
2 1954-01-01 3

Replace double backslash in Pandas csv file

Python 3.9.5/Pandas 1.1.3
I have a very large csv file with values that look like:
Ac\\Nme Products Inc.
and all the values are different company names with double backslashes in random places throughout.
I'm attempting to get rid of all the double backslashes. It's not working in Pandas. But a simple test against the standalone value just using string.replace does work.
Example:
org = "Ac\\Nme Products Inc."
result = org.replace("\\","")
print(result)
returns AcNme Products Inc. as the output, as I would expect.
However, using Pandas with the names in a csv file:
import pandas as pd
csv_input = pd.read_csv('/Users/me/file.csv')
csv_input.replace("\\", "")
csv_input.to_csv('/Users/me/file_revised.csv', index=False)
When I open the new file_revised.csv file, the value still shows as Ac\\Nme Products Inc.
EDIT 1:
Here is a snippet of file.csv as requested:
id,company_name,address,country 1000566,A1 Comm\\Nodity Traders,LEVEL 28 THREE PACIFIC PLACE 1 QUEEN'S RD EAST HK,TH 1000579,"A2 A Mf\\g. Co., Ltd.",53 YONG-AN 2ND ST. TAINAN TAIWAN,CA 1000585,"A2 Z Logisitcs Indi\\Na Pvt., Ltd.",114A/1 1ST FLOOR SOUTH RAJA ST TUTICORIN - 628 001 TAMILNADU - INDIA,PE
Pandas doesn't have a dataframe level string operation, but it can be updated per-column:
for col in csv_input.columns:
if col == 'that_int_column':
continue
csv_input[col] = csv_input[col].str.replace(r"\\N", "")

First time organizing columns in text file

I am a first time python user. I have a text file of star data that I need to sort the columns and then just take the data from the V band. I have no idea how to start. Can someone please help even if it's to just get me started?
If you can install Pandas from here then sorting on any column can be done like this:
#!/usr/bin/python
# read_stars.py
import sys
import pandas as pd
filename = sys.argv[1] # or 'star_data.txt'
sep = '\t' # or ',' or ' ', etc.
df = pd.read_csv(filename, sep)
print df.sort(['Band'])
Change the commented lines to better suit your needs. sep from your comment seems the separator may be tabs (so first try '\t' and change until parsing is successful).sys.argv[1] uses the file passed as a command line argument as such:
$ python read_stars.py star_data.txt
JD Magnitude Uncertainty HQuncertainty Band Observer Code \
28 2.456420e+06 16.400 0.073 NaN V PSD
29 2.456421e+06 16.09 0.090 NaN V DKS
... (etc) ...
42 STD NaN NaN NaN
0 STD NaN NaN NaN
[58 rows x 23 columns]
Hope this helps!

Problems importing text file into Python/Pandas

I am trying to load in a really messy text file into Python/Pandas. Here is an example of what the data in the file looks like
('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:24','viewed_home_page'),('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:36','viewed_search_results'),('41aa8fac-1bd8-4f95-918c-413879ed43f1','bcca257d-68d3-47e6-bc58-52c166f3b27b','Madison, WI','2014-08-16 17:42:31','visit_start')
Here is my code
import pandas as pd
cols=['ID','Visit','Market','Event Time','Event Name']
table=pd.read_table('C:\Users\Desktop\Dump.txt',sep=',', header=None,names=cols,nrows=10)
But when I look at the table, it still does not read correctly.
All of the data is mainly on one row.
You could use ast.literal_eval to parse the data into a Python tuple of tuples, and then you could call pd.DataFrame on that:
import pandas as pd
import ast
cols=['ID','Visit','Market','Event Time','Event Name']
with open(filename, 'rb') as f:
data = ast.literal_eval(f.read())
df = pd.DataFrame(list(data), columns=cols)
print(df)
yields
ID Visit \
0 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
1 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
2 41aa8fac-1bd8-4f95-918c-413879ed43f1 bcca257d-68d3-47e6-bc58-52c166f3b27b
Market Event Time Event Name
0 Seattle, WA 2014-08-05 10:06:24 viewed_home_page
1 Seattle, WA 2014-08-05 10:06:36 viewed_search_results
2 Madison, WI 2014-08-16 17:42:31 visit_start

Loading a generic Google Spreadsheet in Pandas

When I try to load a Google Spreadsheet in pandas
from StringIO import StringIO
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv')
data = r.content
df = pd.read_csv(StringIO(data), index_col=0)
I get the following:
CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386
Why? I would think that one could identify the spreadsheet set of rows and columns with data and use the spreadsheets rows and columns as the dataframe index and columns respectively (with NaN for anything empty). Why does it fail?
This question of mine shows how Getting Google Spreadsheet CSV into A Pandas Dataframe
As one of the commentators noted you have not asked for the data in CSV format you have the "edit" request at the end of the url
You can use this code and see it work on the spreadsheet (which by the way needs to be public..) It is possible to do private sheets as well but that is another topic.
from StringIO import StringIO # got moved around in python3 if you're using that.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate'])
In [11]: df.head()
Out[11]:
City region Res_Comm \
0 Dothan South_Central-Montgomery-Auburn-Wiregrass-Dothan Residential
10 Foley South_Mobile-Baldwin Residential
12 Birmingham North_Central-Birmingham-Tuscaloosa-Anniston Commercial
38 Brent North_Central-Birmingham-Tuscaloosa-Anniston Residential
44 Athens North_Huntsville-Decatur-Florence Residential
mkt_type Quradate National_exp Alabama_exp Sales_exp \
0 Rural 2010-01-15 00:00:00 2 2 3
10 Suburban_Urban 2010-01-15 00:00:00 4 4 4
12 Suburban_Urban 2010-01-15 00:00:00 2 2 3
38 Rural 2010-01-15 00:00:00 3 3 3
44 Suburban_Urban 2010-01-15 00:00:00 4 5 4
The new Google spreadsheet url format for getting the csv output is
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id
Well they changed the url format slightly again now you need:
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet
I also found I needed to do the following to deal with Python 3 a slight revision to the above:
from io import StringIO
and to get the file:
guid=0 #for the 1st sheet
act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid)
dataact = act.content.decode('utf-8') #To convert to string for Stringio
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort()
actdf is now a full pandas dataframe with headers (column names)
Warning: this solution will make your data accessible by anyone.
In google sheet click file>publish to web. Then select what do you need to publish and select export format .csv. You'll have the link something like:
https://docs.google.com/spreadsheets/d/<your sheets key yhere>/pub?gid=1317664180&single=true&output=csv
Then simply:
import pandas as pd
pathtoCsv = r'https://docs.google.com/spreadsheets/d/<sheets key>/pub?gid=1317664180&single=true&output=csv'
dev = pd.read_csv(pathtoCsv)
print dev
Did you share the sheet?
Click the “Share” button in the top-right corner of your document.
Click on the “Get link” section and pick “Anyone with the link”.
This solved for me the problem.
If you didn't share, Google Sheet returns an errorpage what causes the Panda-error. (The fact that the URL works and returns a CSV when opening/pasting in the browser is because you are logged in)
The current Google Drive URL to export as csv is:
https://drive.google.com/uc?export=download&id=EnterIDHere
So:
import pandas as pd
pathtocsv = r'https://drive.google.com/uc?export=download&id=EnterIDHere'
df = pd.read_csv(pathtocsv)

Categories