Good evening,
May I please get advice on a section of code that I have? Here is the code:
import pandas as pd
import numpy as np
import os
import logging
#logging.getLogger('tensorflow').setLevel(logging.ERROR)
os.chdir('/media/Maps/test_run/')
coasts_data='new_coastline_combined.csv'
coasts = pd.read_csv(coasts_data, header=None, encoding='latin1')
#Combine all data into a single frame
frames = [coasts]
df = pd.concat(frames)
n = 1000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
l=[]
for index, frame in enumerate(list_df):
vals = frame.iloc[:,2].values
vals = eval(vals)
# if any values in this part of the frame are wrong, store index for deletion
if np.any(vals < -100000):
l.append(index)
for x in sorted(l, reverse=True):
del list_df[x]
df1 = pd.concat(list_df)
df1.to_csv(r'/media/test_run_6/test.csv',index=False,header=False)
What the code does is take data from a csv file, break it into groups of 1000 rows each, determine if erroneous values are present, and delete that group of data. It works fine on most csv files. However, on one csv file, I'm getting the following error:
TypeError: '<' not supported between instances of 'str' and 'int'
The error begins at this section of code:
if np.any(vals < -100000):
I suspect (and please correct me if I'm wrong) that there are null (empty) cells in this particular column of values inside the csv (the csv is 6,000,000 rows deep, btw).
May I please get help on finding out what the problem is and how to fix it? Thank you,
I recommend catching the error to find out which line it is causing it and then manually check what problem has occured in your data.
Something like this:
try:
if np.any(vals < -100000):
l.append(index)
except TypeError:
print(vals)
print(index)
If it's really just a empty cells you could check that and ignore these cells.
Related
As a begginer to python and coding all together, and I have been trying for a while to make a script to clean up a CSV file I have for training.
Thanks to df.dropna I have succesfully taken out all the blank spaces, but I have from time to time parts of wrong 'text' imputs in my CSV that I just can't find the way to fix ! take this sample of row from my file as an example :
0,0.0,0.0,0.0,zero,0.0,0.0
I want for my script to be able to scan my CSV file, find those String errors, and erase the whole row they are in.
I kinda made up this code in an effort to go row by row of each column and search for my errors and erase them, but nothing I do works :
Totalrows = len(df.index)
TotalColumns = df.shape[1]
CurrentColumn = 0
while CurrentColumn < TotalColumns:
CurrentRow = 1
while CurrentRow < Totalrows:
if df.loc[CurrentRow, CurrentColumn] == str:
df.drop(CurrentRow, axis=0, inplace=True)
CurrentRow + 1
CurrentColumn + 1
The solution i find most of the time is "df['Date'] = pd.to_datetime(df['Date'])" but that doesn't work for me as my cells don't have a date value in them. so what to do ?
Ok I think i managed to do it using a simple try catch block.
I loop over the dataframe rows and convert each one of them to a float array. If the try-catch detects an exception than it deletes the row from the df.
At the end i wrote the resulting df in a .csv file
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/aless/Desktop/test.csv') #your initial csv file
print(df.values)
for id,row in enumerate(df.values):
try:
y = row.astype(np.float64)
print("Row "+str(id)+" is OK")
except:
df=df.drop(axis=0,index=id)
print(df.values)
df.to_csv('C:/Users/aless/Desktop/testRepaired.csv',index=False) #file path to write results
Initial Csv
Final Csv
I have a data set that contains information from an experiment about particles. You can find it here (hope links are ok, if not let me know and i'll remove immediately) :
http://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification
Trying to read this set in pandas and im encountering the problem of pandas reading this txt as a data frame with 130.064 lines, which is correct, but 1 column. If you check the txt file in the link, you will see that it is in a weird shape, with spaces in the beginning and then 2 spaces between each column.
I tried the command
df = pd.read_csv("path/file.txt", header = None)
and also
df = pd.read_csv("path/file.txt", sep = " ", header = None)
where I set 2 spaces as the separator. Nothing works. The file also, in the 1st line, has 2 numbers that just represent the number of rows, which I deleted. For someone who can't/doesn't want to open the link or the data set, here is a picture of some columns.
This is just a portion of it and not the whole data. In the leftmost side, there are 2 spaces between the edge of the window and the first column, as I said. When reading it using pandas this is what I get
Any advice/help would be appreciated. Thanks
EDIT
I tried doing the following and I think it worked. First I imported the .txt file using NumPy, after deleting the first row from the data frame which contains the two irrelevant numbers.
df1 = np.loadtxt("path/file.txt")
This, for some reason, worked and the resulting array was correct. Then I converted this array to data frame using the command
df = pd.DataFrame(df1)
df.columns = ['X' + str(x) for x in range(50) ]
And yeah, I think it works. Check the following picture.
I think its correct but if you guys find something wrong let me know.
Edited
columns = ['Obs1','Obs2','Obs3','Obs4','Obs5','Obs6','Obs7','Obs8','Obs9','Obs10','Obs11','Obs12','Obs13','Obs14','Obs15','Obs16','Obs17','Obs18','Obs19','Obs20','Obs21','Obs22','Obs23','Obs24','Obs25','Obs26','Obs27','Obs28','Obs29','Obs30','Obs31','Obs32','Obs33','Obs34','Obs35','Obs36','Obs37','Obs38','Obs39','Obs40','Obs41','Obs42','Obs43','Obs44','Obs45','Obs46','Obs47','Obs48','Obs49','Obs50']
df = pd.read_csv("path/file.txt", sep = " ", columns=columns , skiprows=1)
You could try creating the dataframe from lists instead of the txt file, something like the following:
#We put all the lines in a list
data = []
with open("dataset.txt") as fp:
lines = fp.read()
data = lines.split('\n')
df_data= []
for item in data:
df_data.append(item.split(' ')) #I cant see if 1 space or 2 separate the values
#df_data should be something like [[row1col1,row1col2,row1col3],[row2col1,row2col2,row3col3]]
#List to dataframe
df = pd.DataFrame(df_data)
Doing this by memory so watch out for syntax, hope this helps!
I am reading in a large csv in pandas with:
features = pd.read_csv(filename, header=None, names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort','DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'], usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'])
I get:
sys:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
%!PS-Adobe-3.0
How can I find the first line in the input which is causing this warning? I need to do this to debug the problem with the input file, which shouldn't have mixed types.
Once Pandas has finished reading the file, you can NOT figure out which lines were problematic (see this answer to know why).
This means you should find a way while you are reading the file. For example, read the file line-by-line, and check the types of each line, if any of them doesn't match the expected type then you got the wanted line.
To achieve this with Pandas, you can pass chunksize=1 to pd.read_csv() to read the file in chunks (dataframes with size N, 1 in this case). See the documentation if you want to know more about this.
The code goes something like this:
# read the file in chunks of size 1. This returns a reader rather than a DataFrame
reader = pd.read_csv(filename,chunksize=1)
# get the first chunk (DataFrame), to calculate the "true" expected types
first_row_df = reader.get_chunk()
expected_types = [type(val) for val in first_row_df.iloc[0]] # a list of the expected types.
i = 1 # the current index. Start from 1 because we've already read the first row.
for row_df in reader:
row_types = [type(val) for val in row_df.iloc[0]]
if row_types != expected_types:
print(i) # this row is the wanted one
break
i += 1
Note that this code makes an assumption that the first row has "true" types.
This code is really slow, so I recommend that you actually only check the columns which you think are problematic (though this does not give much performance gain).
for endrow in range(1000, 4000000, 1000):
startrow = endrow - 1000
rows = 1000
try:
pd.read_csv(filename, dtype={"DstPort": int}, skiprows=startrow, nrows=rows, header=None,
names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort',
'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'],
usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort',
'SrcPackets','DstPackets','SrcBytes','DstBytes'])
except ValueError:
print(f"Error is from row {startrow} to row {endrows}")
Split the file into multiple dataframes with 1000 rows each to see in which range of rows there is mixed type value that causes this problem.
I have a csv file that I want to work with using python. For that I'm run this code :
import csv
import collections
col_values = collections.defaultdict(list)
with open('list.csv', 'rU',) as f:
reader = csv.reader(f)
data = list(reader)
row_count =len(data)
print(" Number of rows ", row_count)
As a result I get 4357 but the file has only 2432 rows, I've tried to change the delimiter in the reader() function, it didn't change the result.
So my question is, does anybody has an explanation why do I get this value ?
thanks in advance
UPDATE
since the number of column is also too big , here is the output of the last row and the start of non existing rows for one columns
opening the file with excel the end looks like :
I hope it helps
try using pandas.
import pandas as pd
df = pd.read_csv('list.csv')
df.count()
check whether you are getting proper rows now
I'm really stuck on some code. Been working on it for a while now, so any guidance is appreciated. I'm trying to map many files related columns into one final column. The logic of my code will start by
identifying my desired final column names
reading the incoming file
identifying the top rows as my column headers/names
identifying all underneath rows as data for that column
based on the column header, add data from that column to the most closely related column, and
have an exit condition (if no more data, end program).
If anyone could help me with steps 3 and 4, I would greatly appreciate it because that is where I'm currently stuck.
It says I have a KeyError:0 where it says columnHeader=row[i]. Does anyone know how to solve this particular problem?
#!/usr/bin/env python
import sys #used for passing in the argument
import csv, glob
SLSDictionary={}
fieldMap = {'zipcode':['Zip5', 'zip4'],
'firstname':[],
'lastname':[],
'cust_no':[],
'user_name':[],
'status':[],
'cancel_date':[],
'reject_date':[],
'streetaddr':['address2', 'servaddr'],
'city':[],
'state':[],
'phone_home':['phone_work'],
'email':[]
}
CSVreader = csv.DictReader(open( 'N:/Individual Files/Jerry/2013 customer list qc, cr, db, gb 9-19-2013_JerrysMessingWithVersion.csv', "rb"),dialect='excel', delimiter=',')
i=0
for row in CSVreader:
if i==0:
columnHeader = row[i]
else:
columnData = row[i]
i += 1