Data cleanup in Python, removing CSV rows based on a condition

Data cleanup in Python, removing CSV rows based on a condition - python

I've came across a bit of a challenge where I need to sanitize data in a CSV file based on the following criteria:
If the data exists with a date, remove the one with an NA value from the file;
If it is a duplicate, remove it; and
If the data exists only own its own, leave it alone.
I am currently able to do both 2 and 3, however I am struggling to make a condition to capture 1 of the criteria.
Sample CSV File
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Server_B,Test,NA,NA
Current Code So Far
import csv
input_file = 'sample.csv'
output_file = 'completed_output.csv'
with open(input_file, 'r') as inputFile, open(output_file, 'w') as outputFile:
seen = set()
for line in inputFile:
if line in seen:
continue
seen.add(line)
outputFile.write(line)
Currently, this helps with duplicates and capturing the unique values. However, I cannot work the best way to remove the row that has a repeating server.
However, this may not work well because the set type is unordered, so I wasn't sure the best way to compare based on column, then filter down from there.
Any suggestions or solutions that could help me would be greatly appreciated.
Current Output So Far
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Expected Output
Name,Environment,Available,Date
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA

You can use pandas instead of manually doing all of that. I have written a short function called custom filter which takes into consideration the criteria.
One area of potential bugs can be the use of pd.NA, use other np.nan or None if this doesn't work accordingly.
import pandas as pd
df = pd.read_csv('sample.csv')
df = df.drop_duplicates()
data_present = []
def custom_filter(x):
global data_present
if x[3] == pd.NA:
data_present.append(x[0])
return True
elif x[3] == pd.NA and x[0] not in data_present:
return True
else:
return False
df = df.sort_values('Date')
df = df[df.apply(custom_filter, axis = 1)]
df.to_csv('completed_output.csv')

Related

Comparing and updating CSV files using lists

I'm writing something that will take two CSV's: #1 is a list of email's with # received for each, #2 is a catalog of every email addr on record, with a # of received emails per reporting period with date annotated at top of column.
import csv
from datetime import datetime
datestring = datetime.strftime(datetime.now(), '%m-%d')
storedEmails = []
newEmails = []
sortedList = []
holderList = []
with open('working.csv', 'r') as newLines, open('archive.csv', 'r') as oldLines: #readers to make lists
f1 = csv.reader(newLines, delimiter=',')
f2 = csv.reader(oldLines, delimiter=',')
print ('Processing new data...')
for row in f2:
storedEmails.append(list(row)) #add archived data to a list
storedEmails[0].append(datestring) #append header row with new date column
for col in f1:
if col[1] == 'email' and col[2] == 'To Address': #new list containing new email data
newEmails.append(list(col))
counter = len(newEmails)
n = len(storedEmails[0]) #using header row len to fill zeros if no email received
print(storedEmails[0])
print (n)
print ('Updating email lists and tallies, this could take a minute...')
with open ('archive.csv', 'w', newline='') as toWrite: #writer to overwrite old csv
writer = csv.writer(toWrite, delimiter=',')
for i in newEmails:
del i[:3] #strip useless identifiers from data
if int(i[1]) > 30: #only keep emails with sufficient traffic
sortedList.append(i) #add these emails to new sorted list
for i in storedEmails:
for entry in sortedList: #compare stored emails with the new emails, on match append row with new # of emails
if i[0] == entry[0]:
i.append(entry[1])
counter -=1
else:
holderList.append(entry) #if no match, it is a new email that meets criteria to land itself on the list
break #break inner loop after iteration of outer email, to move to next email and avoid multiple entries
storedEmails = storedEmails + holderList #combine lists for archived csv rewrite
for i in storedEmails:
if len(i) < n:
i.append('0') #if email on list but didnt have any activity this period, append with 0 to keep records intact
writer.writerow(i)
print('SortedList', sortedList)
print (len(sortedList))
print('storedEmails', storedEmails)
print(len(storedEmails))
print('holderList',holderList)
print(len(holderList))
print ('There are', counter, 'new emails being added to the list.')
print ('All done!')
CSV's will look similar to this.
working.csv:
1,asdf#email.com,'to address',31
2,fsda#email.com,'to address',19
3,zxcv#email.com,'to address',117
4,qwer#gmail.com,'to address',92
5,uiop#fmail.com,'to address',11
archive.csv:
date,01-sep
asdf#email.com,154
fsda#email.com,128
qwer#gmail.com,77
ffff#xmail.com,63
What I want after processing is:
date,01-sep,27-sep
asdf#email.com,154,31
fsda#email.com,128,19
qwer#gmail.com,77,92
ffff#xmail.com,63,0
zxcv#email.com,0,117
I'm not sure where I've gone wrong at - but it keeps producing duplicate entries. Some of the functionality is there but I've been at it for too long and I'm getting tunnel vision trying to figure out what I have done wrong with my loops.
I know my zero-filler section in the end is wrong as well, as it will append onto the end of a newly created record instead of populating zero's up to its first appearance.
I'm sure there are far more efficient ways to do this, I'm new to programming so its probably overly complicated and messy - initially I tried to compare CSV to CSV and realized that wasnt possible since you cant read and write at the same time, so I attempted to convert to using lists, which I also know wont work forever due to memory limitations when the list gets big.
-EDIT-
Using Trenton's Panda's solution:
I ran a script on working.csv so it instead produces the following:
asdf#email.com,1000
bsdf#gmail.com,500
xyz#fmail.com,9999
I have modified your solution to reflect this change:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
# filter original list to grab only emails of interest
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'email': 'date'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1]) # I assume usecols isnt necessery anymore, but I'm not sure
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
I apparently still have no idea how this works because I'm now getting :
Traceback (most recent call last):
File "---/agsdga.py", line 29, in <module>
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
File "---\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 5130, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'email'
Process finished with exit code 1
-UPDATE-
It is working now as:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1])
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
The errors above were caused because I changed
arch.rename(columns={'date': 'email'}, inplace=True)
to
arch.rename(columns={'email': 'date'}, inplace=True)
I ran into further complications because I stripped the header row from the test archive because I didnt think the header mattered, even with header=None I still got issues. I'm still not clear why the header is so important when we are assigning our own values to the columns for purposes of the dataframe, but its working now. Thanks for all the help!

I'd load the data with pandas.read_csv
.rename some columns
Renaming the columns in working, is dependent upon the column index, since working.csv has no column headers.
When the working dataframe is created, look at the dataframe to verify the correct columns have been loaded, and the correct column index is being used for renaming.
The date column of arch should really be email, because headers identify what's below them, not the other column headers.
Once the column name has been changed in archive.csv, then rename won't be required any longer.
pandas.merge on the email column.
Since both dataframes have a column renamed with email, the merged result will only have one email column.
If the merge occurs on two different column names, then the result will have two columns containing email addresses.
pandas: Merge, join, concatenate and compare
As long as the columns in the files are consistent, this should work without modification
import pandas as pd
from datetime import datetime
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('working.csv', header=None, usecols=[1, 3])
# rename columns
working.rename(columns={1: 'email', 3: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
# display(arch_updated)
email 01-sep 27-Aug
asdf#email.com 154.0 31.0
fsda#email.com 128.0 19.0
qwer#gmail.com 77.0 92.0
ffff#xmail.com 63.0 0.0
zxcv#email.com 0.0 117.0

So, the problem is you have two sets of data. Both have the data stored with a "key" entry (the emails) and additional piece of data that you want condensed down to one storage. Identifying that there is a similar "key" for both of these sets of data simplifies this greatly.
Imagine each key as being the name of a bucket. Each bucket needs two pieces of info, one piece from one csv and the other piece from the other csv.
Now, I must take a small detour to explain a dictionary in python. Here is a definition stolen from here
A dictionary is a collection which is unordered, changeable and indexed.
A collection is a container like a list that holds data. Unordered and indexed means that the dictionary is not accessible like a list where the data is accessible by the index. In this case, the dictionary is accessed using keys, which can be anything like a string or a number (technically the key must be hashable, but thats too indepth). And finally changeable means that the dictionary can actually have its the stored data changed (once again, oversimplified).
Example:
dictionary = dict()
key = "Something like a string or a number!"
dictionary[key] = "any kind of value can be stored here! Even lists and other dictionaries!"
print(dictionary[key]) # Would print the above string
Here is the structure that I suggest you use instead of most of your lists:
dictionary[email] = [item1, item2]
This way, you can avoid using multiple lists and massively simplifying your code. If you are still iffy on the usage of dictionaries, there are a lot of articles and videos on the usage of them. Good luck!

Count and compare occurrences across different columns in different spreadsheets

I would like to know (in Python) how to count occurrences and compare values from different columns in different spreadsheets. After counting, I would need to know if those values fulfill a condition i.e. If Ana (user) from the first spreadsheet appears 1 time in the second spreadsheet and 5 times in the third one, I would like to sum 1 to a variable X.
I am new in Python, but I have tried getting the .values() after using the Counter from collections. However, I am not sure if the real value Ana is being considered when iterating in the results of the Counter. All in all, I need to iterate each element in spreadsheet one and see if each element of it appears one time in the second spreadsheet and five times in the third spreadsheet, if such thing happens, the variable X will be added by one.
def XInputOutputs():
list1 = []
with open(file1, 'r') as fr:
r = csv.reader(fr)
for row in r:
list1.append(row[1])
number_of_occurrences_in_list_1 = Counter(list1)
list1_ocurrences = number_of_occurrences_in_list_1.values()
list2 = []
with open(file2, 'r') as fr:
r = csv.reader(fr)
for row in r:
list2.append(row[1])
number_of_occurrences_in_list_2 = Counter(list2)
list2_ocurrences = number_of_occurrences_in_list_2.values()
X = 0
for x,y in zip(list1_ocurrences, list2_ocurrences):
if x == 1 and y == 5:
X += 1
return X
I tested with small spreadsheets, but this just works for pre-ordered values. If Ana appears after 100000 rows, everything is broken. I think it is needed to iterate each value (Ana) and check simultaneously in all the spreadsheets and sum the variable X.

I am at work, so I will be able to write a full answer only later.
If you can import modules, I suggest you to try using pandas: a real super-useful tool to quickly and efficiently manage data frames.
You can easily import a .csv spreadsheet with
import pandas as pd
df = pd.read_csv()
method, then perform almost any kind of operation.
Check out this answer out: I got few time to read it, but I hope it helps
what is the most efficient way of counting occurrences in pandas?
UPDATE: then try with this
# not tested but should work
import os
import pandas as pd
# read all csv sheets from folder - I assume your folder is named "CSVs"
for files in os.walk("CSVs"):
files = files[-1]
# here it's generated a list of dataframes
df_list = []
for file in files:
df = pd.read_csv("CSVs/" + file)
df_list.append(df)
name_i_wanna_count = "" # this will be your query
columun_name = "" # here insert the column you wanna analyze
count = 0
for df in df_list:
# retrieve a series matching your query and then counts the elements inside
matching_serie = df.loc[df[columun_name] == name_i_wanna_count]
partial_count = len(matching_serie)
count = count + partial_count
print(count)
I hope it helps

Reading bad csv files with garbage values

I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?

Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X

I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))

If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.

How to split a log file into several csv files with python

I'm pretty new to python and coding in general, so sorry in advance for any dumb questions. My program needs to split an existing log file into several *.csv files (run1,.csv, run2.csv, ...) based on the keyword 'MYLOG'. If the keyword appears it should start copying the two desired columns into the new file till the keyword appears again. When finished there need to be as many csv files as there are keywords.
53.2436 EXP MYLOG: START RUN specs/run03_block_order.csv
53.2589 EXP TextStim: autoDraw = None
53.2589 EXP TextStim: autoDraw = None
55.2257 DATA Keypress: t
57.2412 DATA Keypress: t
59.2406 DATA Keypress: t
61.2400 DATA Keypress: t
63.2393 DATA Keypress: t
...
89.2314 EXP MYLOG: START BLOCK scene [specs/run03_block01.csv]
89.2336 EXP Imported specs/run03_block01.csv as conditions
89.2339 EXP Created sequence: sequential, trialTypes=9
...
[EDIT]: The output per file (run*.csv) should look like this:
onset type
53.2436 EXP
53.2589 EXP
53.2589 EXP
55.2257 DATA
57.2412 DATA
59.2406 DATA
61.2400 DATA
...
The program creates as much run*.csv as needed, but i can't store the desired columns in my new files. When finished, all I get are empty csv files. If I shift the counter variable to == 1 it creates just one big file with the desired columns.
Thanks again!
import csv
QUERY = 'MYLOG'
with open('localizer.log', 'rt') as log_input:
i = 0
for line in log_input:
if QUERY in line:
i = i + 1
with open('run' + str(i) + '.csv', 'w') as output:
reader = csv.reader(log_input, delimiter = ' ')
writer = csv.writer(output)
content_column_A = [0]
content_column_B = [1]
for row in reader:
content_A = list(row[j] for j in content_column_A)
content_B = list(row[k] for k in content_column_B)
writer.writerow(content_A)
writer.writerow(content_B)

Looking at the code there's a few things that are possibly wrong:
the csv reader should take a file handler, not a single line.
the reader delimiter should not be a single space character as it looks like the actual delimiter in your logs is a variable number of multiple space characters.
the looping logic seems to be a bit off, confusing files/lines/rows a bit.
You may be looking at something like the code below (pending clarification in the question):
import csv
NEW_LOG_DELIMITER = 'MYLOG'
def write_buffer(_index, buffer):
"""
This function takes an index and a buffer.
The buffer is just an iterable of iterables (ex a list of lists)
Each buffer item is a row of values.
"""
filename = 'run{}.csv'.format(_index)
with open(filename, 'w') as output:
writer = csv.writer(output)
writer.writerow(['onset', 'type']) # adding the heading
writer.writerows(buffer)
current_buffer = []
_index = 1
with open('localizer.log', 'rt') as log_input:
for line in log_input:
# will deal ok with multi-space as long as
# you don't care about the last column
fields = line.split()[:2]
if not NEW_LOG_DELIMITER in line or not current_buffer:
# If it's the first line (the current_buffer is empty)
# or the line does NOT contain "MYLOG" then
# collect it until it's time to write it to file.
current_buffer.append(fields)
else:
write_buffer(_index, current_buffer)
_index += 1
current_buffer = [fields] # EDIT: fixed bug, new buffer should not be empty
if current_buffer:
# We are now out of the loop,
# if there's an unwritten buffer then write it to file.
write_buffer(_index, current_buffer)

You can use pandas to simplify this problem.
Import pandas and read in log file.
import pandas as pd
df = pd.read_fwf('localizer2.log', header=None)
df.columns = ['onset', 'type', 'event']
df.set_index('onset', inplace=True)
Set Flag where third column == 'MYLOG'
df['flag'] = 0
df.loc[df.event.str[:5] == 'MYLOG', 'flag'] = 1
df.flag = df['flag'].cumsum()
Save each run as a separate run*.csv file
for i in range(1, df.flag.max()+1):
df.loc[df.flag == i, 'event'].to_csv('run{0}.csv'.format(i))
EDIT:
Looks like your format is different than I originally assumed. Changed to use pd.read_fwf. my localizer.log file was a copy and paste of your original data, hope this works for you. I assumed by the original post that it did not have headers. If it does have headers then remove header=None and df.columns = ['onset', 'type', 'event'].

Python CSV - Check if index is equal on different rows

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.

What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.