Pandas skipping malformed line in csv - python

I am trying to read a csv file with pandas. the file is very long and malformed in the middle like so
Date,Received Date,Tkr,Theta,Wid,Per
2007-08-03,2017/02/13 05:30:G,F,B,A,1
2007-08-06,2017/02/13 05:30:G,F,A,B,1
2007-08-07,2017/02/13 05:30:G,F,A,B,1
2007-08-,nan,,,,
2000-05-30 00:00:00,2017/02/14 05:30:F,D,10,1,1
2000-05-31 00:00:00,2017/02/14 05:30:F,D,10,1,1
My line which is failing is this:
full_frame = pd.read_csv(path, parse_dates=["Date"],error_bad_lines=False).set_index("Date").sort_index()[:date]
with the error
TypeError: unorderable types: str() > datetime.datetime()
File "/A/B/C.py", line 236, in load_ex
full_frame = pd.read_csv(path, parse_dates=["Date"],error_bad_lines=False).set_index("Date").sort_index()[:date]
date is just a variable that holds a given input date.
This is happening because of the broken line in the middle. I have tried to do
error_bad_line=False but that wont prevent my script from failing.
When i take out the bad line from my csv and run it, it works fine. This csv will be used as an input and I cant modify it at source so I was wondering if there is a way to skip a line based on length of the line in the csv in pandas or something else I can do to make it work without duplicating/modifyng the file
UPDATE
The bad line is stored in my data frame if i simply do a
read_csv
as 2007-08- NaN NaN NaN NaN NaN
UPDATE 2:
if i try to just do
full_frame = pd.read_csv(path, parse_dates=["Date"],error_bad_lines=False)
full_frame = full_frame.dropna(how="any")
# this drops the NaN row for sure
full_frame = full_frame.set_index("Date").sort_index()[:date]
still gives same error :(

So I gave this a quick shot. Your data has inconsistencies which should may be of concern to you for your analysis, and you should investigate. Analysis is only as good as that data quality is.
Here's some code (not the best, but gets the job mostly done)
First, since your data needs some work, I read it in as raw text. Then I write a function to parse the dates. I collect the columns in one list, and the rest of the data in another.
For all the data that needs to have dates, I loop over the data 1 line at a time and pass it through parse_dates.
parse_dates works by reading in a list, grabbing the first item in the list (the date part) then trying to convert it from a simple string to a date. Since not all are datetime, I only grab the first 10 bytes for just dates.
Once I have a cleaner data, I pass it through pandas and obtain a dataframe. Then I set the date to the index. This could be improved upon but given that this is not my job, I'll let you do the rest.
import pandas as pd
import datetime as dt
rawdata = []
with open("test.dat", "r") as stuff:
for line in stuff:
line1 = line[:-1]
rawdata.append(line1.split(","))
def parse_dates(line):
datepart = line[0][:10] ## get the date-time, and for the date-time, only get the date part
## since not all rows have date + time, cut it down to date
try:
result = dt.datetime.strptime(datepart, "%Y-%m-%d") ## try converting to date
except ValueError:
result = None
line[0] = result ## update
return line
cols = rawdata[0]
data = rawdata[1:]
print data
data = [parse_dates(line) for line in data]
print data
df = pd.DataFrame(data = data, columns = cols)
print df
df.index = df['Date']
Also, a simple Google search shows plenty of ways of handling dates with Python+pandas. Here is one link I found:
https://chrisalbon.com/python/strings_to_datetime.html

Related

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999).
Issues:
1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series
2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way.
3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario?
My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. Any recommendations or obvious easy tricks i'm missing?
from zipfile import ZipFile
import pandas as pd
def main():
#Driver
FILENAME_PREFIX = 'bhcf'
FILE_TYPE = '.txt'
field_headers = []
with ZipFile('reg_data.zip', 'r') as zip:
with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
headers_df = pd.read_csv(qtr_file, sep='^', header=None)
headers_df = headers_df[:1]
headers_array = headers_df.values[0]
parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)
I try with the file you linked and one i downloaded i think from 2015:
import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)
The final dataframe has the rows of both csv, and the union of columns from them.
You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns.

Reading bad csv files with garbage values

I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?
Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X
I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))
If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.

Filling Data to pandas dataframe using loop with text file that have missing data

I am working on large log files (4 Gig) with 1000 of variables (ABCD,GFHTI,AAAA,BBBB,...)but I am only interested in 50 of these variables (ABCD,GFHTI,..). The structure of the log file is as follow:
20100101_00:01:33.436-92.451 BLACKBOX ABCD ref 2183 value 24
20100101_00:01:33.638-92.651 BLACKBOX GFHTI ref 2183 value 25
20100101_00:01:33.817-92.851 BLACKBOX AAAA ref 2183 value 26 (Not interested in this one)
20100101_00:01:34.017-93.051 BLACKBOX BBBB ref 2183 value 27 (Not interested
in this one)
I am trying to make a pandas data frame out of the this log file which look like this.
Time ABCD GFHTI
20100101_00:01:33.436-92.451 24 NaN
20100101_00:01:33.638-92.651 NaN 25
I could do this by using loop and appending to pandas data frame but that is not very efficient. I can find the value and dates of the value of the interest in the log files but I don't know how to put NaN for the rest of variables for that specific date and time and at the end convert it to a data frame.
I really appreciate if anyone can help.
Here is part of my code
ListOfData={}
fruit={ABCD, GFHTI}
for file in FileList:
i=i+1
thefile = open('CleanLog'+str(i)+'.txt', 'w')
with open(file,'rt') as in_file:
i=0
for linenum, line in enumerate(in_file): # Keep track of line numbers.
if fruit.search(line) != None:# If substring search finds a match,
i=i+1
Loc=(fruit.search(line))
d = [{'Time': line[0:17], Loc.group(0): line[Loc.span()[1]:-1]}]
for word in Key:
if word == Loc.group(0):
ListOfData.append(d)
you can parse the log file and only return information of interest to the DataFrame constructor
to parse the log lines, I'm using regex here, but the actual parsing function should depend on your log format, also I assume the log file is in the path log.txt relative to where this script is run.
import pandas as pd
import re
def parse_line(line):
code_pattern = r'(?<=BLACKBOX )\w+'
value_pattern = r'(?<=value )\d+'
code = re.findall(code_pattern, line)[0]
value = re.findall(value_pattern, line)[0]
ts = line.split()[0]
return ts, code, value
def parse_filter_logfile(fname):
with open(fname) as f:
for line in f:
data = parse_line(line)
if data[1] in ['ABCD', 'GFHTI']:
# only yield rows that match the filter
yield data
Then feed that generator to construct a data frame
logparser = parse_filter_logfile('log.txt')
df = pd.DataFrame(logparser, columns = ['Time', 'Code', 'Value'])
finally, pivot the data frame using either of the two statements below
df.pivot(index='Time', columns='Code')
df.set_index(['Time', 'Code']).unstack(-1)
outputs the following:
Value
Code ABCD GFHTI
Time
20100101_00:01:33.436-92.451 24 None
20100101_00:01:33.638-92.651 None 25
Hopefully you have enough information to tackle your log file. The tricky part here is dealing with the log line parsing, and you'd have to adapt my example function to get it right.
When you work with pandas, there is no need to read the file by hand in a loop:
data = pd.read_csv('CleanLog.txt', sep='\s+', header=None)
Use time (#0) and variable name (#2) as index, keep the column with variable values (#6).
columns_of_interest = ['ABCD','GFHTI']
data.set_index([0,2])[6].unstack()[columns_of_interest].dropna(how='all')
#2 ABCD GFHTI
#0
#20100101_00:01:33.436-92.451 24.0 NaN
#20100101_00:01:33.638-92.651 NaN 25.0

Reading in header information from csv file using Pandas

I have a data file that has 14 lines of header. In the header, there is the metadata for the latitude-longitude coordinates and time. I am currently using
pandas.read_csv(filename, delimiter",", header=14)
to read in the file but this just gets the data and I can't seem to get the metadata. Would anyone know how to read in the information in the header? The header looks like:
CSD,20160315SSIO
NUMBER_HEADERS = 11
EXPOCODE = 33RR20160208
SECT_ID = I08
STNBBR = 1
CASTNO = 1
DATE = 20160219
TIME = 0558
LATITUDE = -66.6027
LONGITUDE = 78.3815
DEPTH = 462
INSTRUMENT_ID = 0401
CTDPRS,CTDPRS_FLAG,CTDTMP,CTDTMP_FLAG
DBAR,,ITS-90,,PSS-78
You have to parse your metadata header by yourself, yet you can do it in an elegant manner in one pass and even by using it on the fly so that you can extract data out it / control the correctness of the file etc.
First, open the file yourself:
f = open(filename)
Then, do the work to parse each metadata line to extract data out it. For the sake of the explanation, I'm just skipping these rows:
for i in range(13): # skip the first 13 lines that are useless for the columns definition
f.readline() # use the resulting string for metadata extraction
Now you have the file pointer ready on the unique header line you want to use to load the DataFrame. The cool thing is that read_csv accepts file objects! Thus you start loading your DataFrame right away now:
pandas.read_csv(f, sep=",")
Note that I don't use the header argument as I consider by your description you have only that one last line of header that is useful for your dataframe. You can build and adjust hearder parsing values / rows to skip from that example.
Although the following method does not use Pandas, I was able to extract the header information.
with open(fname) as csvfile:
forheader_IO2016 = csv.reader(csvfile, delimiter=',')
header_IO2016 = []
for row in forheader_IO2016:
header_IO2016.append(row[0])
date = header_IO2016[7].split(" ")[2]
time = header_IO2016[8].split(" ")[2]
lat = float(header_IO2016[9].split(" ")[2])
lon = float(header_IO2016[10].split(" ")[4])

Categories