Reading bad csv files with garbage values - python

I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?

Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X

I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))

If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.

Related

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

Parsing Dirty Text File with Pandas Header Issue

I am trying to parse a text file created back in '99 that is slightly difficult to deal with. The headers are in the first row and are delimited by '^' (the entire file is ^ delimited). The issue is that there are characters that appear to be thrown in (long lines of spaces for example appear to separate the headers from the rest of the data points in the file. (example file located at https://www.chicagofed.org/applications/bhc/bhc-home My example was referencing Q3 1999).
Issues:
1) Too many headers to manually create them and I need to do this for many files that may have new headers as we move forward or backwards throughout the time series
2) I need to recreate the headers from the file and then remove them so that I don't pollute my entire first row with header duplicates. I realize I could probably slice the dataframe [1:] after the fact and just get rid of it, but that's sloppy and i'm sure there's a better way.
3) the unreported fields by company appear to show up as "^^^^^^^^^", which is fine, but will pandas automatically populate NaNs in that scenario?
My attempt below is simply trying to isolate the headers, but i'm really stuck on the larger issue of the way the text file is structured. Any recommendations or obvious easy tricks i'm missing?
from zipfile import ZipFile
import pandas as pd
def main():
#Driver
FILENAME_PREFIX = 'bhcf'
FILE_TYPE = '.txt'
field_headers = []
with ZipFile('reg_data.zip', 'r') as zip:
with zip.open(FILENAME_PREFIX + '9909'+ FILE_TYPE) as qtr_file:
headers_df = pd.read_csv(qtr_file, sep='^', header=None)
headers_df = headers_df[:1]
headers_array = headers_df.values[0]
parsed_data = pd.read_csv(qtr_file, sep='^',header=headers_array)
I try with the file you linked and one i downloaded i think from 2015:
import pandas as pd
df = pd.read_csv('bhcf9909.txt',sep='^')
first_headers = df.columns.tolist()
df_more_actual = pd.read_csv('bhcf1506.txt',sep='^')
second_headers = df_more_actual.columns.tolist()
print(df.shape)
print(df_more_actual.shape)
# df_more_actual has more columns than first one
# Normalize column names to avoid duplicate columns
df.columns = df.columns.str.upper()
df_more_actual.columns = df_more_actual.columns.str.upper()
new_df = df.append(df_parsed2)
print(new_df.shape)
The final dataframe has the rows of both csv, and the union of columns from them.
You can do this for the csv of each quarter and appending it so finally you will have all the rows of them and the union of the columns.

Creating dictionary from CVS file using lists

I have a csv file which contains four columns and many rows, each representing different data, e.g.
OID DID HODIS BEAR
1 34 67 98
I have already opened and read the csv file, however I am unsure how I can make each column into a key. I believe the following format I have used in the code is best for the task I am creating.
Please see my code bellow, sorry if the explanation is a bit confusing.
Note that the #Values in column 1 is what I am stuck on, I am unsure how I can define each column.
for line in file_2:
the_dict = {}
OID = line.strip().split(',')
DID = line.strip().split(',')
HODIS = line.strip().split(',')
BEAR = line.strip().split(',')
the_dict['KeyOID'] = OID
the_dict['KeyDID'] = DID
the_dict['KeyHODIS'] = HODIS
the_dict['KeyBEAR'] = BEAR
dictionary_list.append(the_dict)
print(dictionary_list)
image
There is a great Python function for strings that will split strings based on a delimiter, .split(delim) where delim is the delimiter, and returns them as a list.
From the code that you have in your screenshot, you can use the following code to split on a ,, which I assume is your delimiter because you said that your file is a CSV.
...
for line in file_contents_2:
the_dict = {}
values = line.strip().split(',')
OID = values[0]
DID = values[1]
HODIS = values[2]
BEAR = values[3]
...
Also, in case you ever need to split a string based on whitespace, that is the default argument for .split() (the default argument is used when no argument is provided).
I would say this as whole code:
lod = []
with open(file,'r') as f:
l=f.readlines()
for i in l[1:]:
lod.append(dict(zip(l[0].rstrip().split(),i.split())))
split doesn't need a parameter, just use simple for loop in with open, no need for knowing keys
And if care about empty dictionaries do:
lod=list(filter(None,lod))
print(lod)
Output:
[{'OID': '1', 'DID': '34', 'HODIS': '67', 'BEAR': '98'}]
If want integers:
lod=[{k:int(v) for k,v in i.items()} for i in lod]
print(lod)
Output:
[{'OID': 1, 'DID': 34, 'HODIS': 67, 'BEAR': 98}]
Another way to do it is using libraries like Pandas, that is powerful in working with tabular data. It is fast as we avoid loops. In the example below you only need Pandas and the name of the CSV file. I used io just to transform string data to mimic csv.
import pandas as pd
from io import StringIO
data=StringIO('''
OID,DID,HODIS,BEAR\n
1,34,67,98''') #mimic csv file
df = pd.read_csv(data,sep=',')
print(df.T.to_dict()[0])
At the bottom you need only one-liner that chains commands. Read csv, transpose and tranform to dictionary:
import pandas as pd
csv_dict = pd.read_csv('mycsv.csv',sep=',').T.to_dict()[0]

Python CSV - Check if index is equal on different rows

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.
What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,

Categories