Comparing and updating CSV files using lists - python

I'm writing something that will take two CSV's: #1 is a list of email's with # received for each, #2 is a catalog of every email addr on record, with a # of received emails per reporting period with date annotated at top of column.
import csv
from datetime import datetime
datestring = datetime.strftime(, '%m-%d')
storedEmails = []
newEmails = []
sortedList = []
holderList = []
with open('working.csv', 'r') as newLines, open('archive.csv', 'r') as oldLines: #readers to make lists
f1 = csv.reader(newLines, delimiter=',')
f2 = csv.reader(oldLines, delimiter=',')
print ('Processing new data...')
for row in f2:
storedEmails.append(list(row)) #add archived data to a list
storedEmails[0].append(datestring) #append header row with new date column
for col in f1:
if col[1] == 'email' and col[2] == 'To Address': #new list containing new email data
counter = len(newEmails)
n = len(storedEmails[0]) #using header row len to fill zeros if no email received
print (n)
print ('Updating email lists and tallies, this could take a minute...')
with open ('archive.csv', 'w', newline='') as toWrite: #writer to overwrite old csv
writer = csv.writer(toWrite, delimiter=',')
for i in newEmails:
del i[:3] #strip useless identifiers from data
if int(i[1]) > 30: #only keep emails with sufficient traffic
sortedList.append(i) #add these emails to new sorted list
for i in storedEmails:
for entry in sortedList: #compare stored emails with the new emails, on match append row with new # of emails
if i[0] == entry[0]:
counter -=1
holderList.append(entry) #if no match, it is a new email that meets criteria to land itself on the list
break #break inner loop after iteration of outer email, to move to next email and avoid multiple entries
storedEmails = storedEmails + holderList #combine lists for archived csv rewrite
for i in storedEmails:
if len(i) < n:
i.append('0') #if email on list but didnt have any activity this period, append with 0 to keep records intact
print('SortedList', sortedList)
print (len(sortedList))
print('storedEmails', storedEmails)
print ('There are', counter, 'new emails being added to the list.')
print ('All done!')
CSV's will look similar to this.
1,,'to address',31
2,,'to address',19
3,,'to address',117
4,,'to address',92
5,,'to address',11
What I want after processing is:
I'm not sure where I've gone wrong at - but it keeps producing duplicate entries. Some of the functionality is there but I've been at it for too long and I'm getting tunnel vision trying to figure out what I have done wrong with my loops.
I know my zero-filler section in the end is wrong as well, as it will append onto the end of a newly created record instead of populating zero's up to its first appearance.
I'm sure there are far more efficient ways to do this, I'm new to programming so its probably overly complicated and messy - initially I tried to compare CSV to CSV and realized that wasnt possible since you cant read and write at the same time, so I attempted to convert to using lists, which I also know wont work forever due to memory limitations when the list gets big.
Using Trenton's Panda's solution:
I ran a script on working.csv so it instead produces the following:,1000,500,9999
I have modified your solution to reflect this change:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(, '%d-%b')
# filter original list to grab only emails of interest
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'email': 'date'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1]) # I assume usecols isnt necessery anymore, but I'm not sure
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
I apparently still have no idea how this works because I'm now getting :
Traceback (most recent call last):
File "---/", line 29, in <module>
working = working[(working[datestring] > 30) | (]
File "---\Python\Python38-32\lib\site-packages\pandas\core\", line 5130, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'email'
Process finished with exit code 1
It is working now as:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(, '%d-%b')
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1])
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
The errors above were caused because I changed
arch.rename(columns={'date': 'email'}, inplace=True)
arch.rename(columns={'email': 'date'}, inplace=True)
I ran into further complications because I stripped the header row from the test archive because I didnt think the header mattered, even with header=None I still got issues. I'm still not clear why the header is so important when we are assigning our own values to the columns for purposes of the dataframe, but its working now. Thanks for all the help!

I'd load the data with pandas.read_csv
.rename some columns
Renaming the columns in working, is dependent upon the column index, since working.csv has no column headers.
When the working dataframe is created, look at the dataframe to verify the correct columns have been loaded, and the correct column index is being used for renaming.
The date column of arch should really be email, because headers identify what's below them, not the other column headers.
Once the column name has been changed in archive.csv, then rename won't be required any longer.
pandas.merge on the email column.
Since both dataframes have a column renamed with email, the merged result will only have one email column.
If the merge occurs on two different column names, then the result will have two columns containing email addresses.
pandas: Merge, join, concatenate and compare
As long as the columns in the files are consistent, this should work without modification
import pandas as pd
from datetime import datetime
# get the date string
datestring = datetime.strftime(, '%d-%b')
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('working.csv', header=None, usecols=[1, 3])
# rename columns
working.rename(columns={1: 'email', 3: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
# display(arch_updated)
email 01-sep 27-Aug 154.0 31.0 128.0 19.0 77.0 92.0 63.0 0.0 0.0 117.0

So, the problem is you have two sets of data. Both have the data stored with a "key" entry (the emails) and additional piece of data that you want condensed down to one storage. Identifying that there is a similar "key" for both of these sets of data simplifies this greatly.
Imagine each key as being the name of a bucket. Each bucket needs two pieces of info, one piece from one csv and the other piece from the other csv.
Now, I must take a small detour to explain a dictionary in python. Here is a definition stolen from here
A dictionary is a collection which is unordered, changeable and indexed.
A collection is a container like a list that holds data. Unordered and indexed means that the dictionary is not accessible like a list where the data is accessible by the index. In this case, the dictionary is accessed using keys, which can be anything like a string or a number (technically the key must be hashable, but thats too indepth). And finally changeable means that the dictionary can actually have its the stored data changed (once again, oversimplified).
dictionary = dict()
key = "Something like a string or a number!"
dictionary[key] = "any kind of value can be stored here! Even lists and other dictionaries!"
print(dictionary[key]) # Would print the above string
Here is the structure that I suggest you use instead of most of your lists:
dictionary[email] = [item1, item2]
This way, you can avoid using multiple lists and massively simplifying your code. If you are still iffy on the usage of dictionaries, there are a lot of articles and videos on the usage of them. Good luck!


Load csv files with multiple columns into several dataframe

I am trying to load some large csv files which appear to have multiple columns and I am struggling with it.
I don't know who design these csv files, but they appear to have event data as well as log data in each csv. At the start of each csv file there is some initial status liens as well
Everything is in a separate rows
The Event data uses 2 columns (Data and Event comment)
The Log data has multiple columns( Date and 20+ columns.
I give an example of the type of data setup below:
Initial; [Status] The Zoo is Closed;
Initial; Status] The Sun is Down;
Initial; [Status] Monkeys ar sleeping;
06:00; 5;5;0;10
07:00; 5;5;0;10
07:10;[Event] Sun is up
08:00; 5;5;0;10
08:30; [Event] Monkey Doors open and Zoo Opens
09:00; 5;5;0;10
08:30; [Event] Monkey Goes out
09:00; 5;4;1;10
08:30; [Event] Monkey Eats Banana
09:00; 5;4;1;9
08:30; [Event] Monkey Goes out
09:00; 5;5;2;9
Now what I want to do is to put the Log data into one data frame and the Initial and Event data into another.
Now I can read the csv files with csv_reader and go row by row but this is proving very slow, especially when trying to go thorough multiple files and each file containing about 40k rows
Below is code I am using below
csv_files = [f for f in os.listdir('.') if f.endswith('.log')]
for file in csv_files:
# Open the CSV file in read mode
with open(file, 'r') as csv_file:
# Use the csv module to parse the file
csv_reader = csv.reader(csv_file, delimiter=';')
# Loop through the rows of the file
for row in csv_reader:
# If the row has event data
if len(row) == 2:
# Add the row to the Eventlog
EventLog = EventLog.append(pd.Series(row), ignore_index=True)
# If the row is separated by a single separator
elif len(row) > 2:
#First row entered into data log will be the column headers
if DataLog.empty:
# Add the row to the single_separator_df DataFrame
DataLog = DataLog.append(pd.Series(row), ignore_index=True)
Is there a better way to do this....preferably faster
IF I use pandas read_csv it seems to only load the Initial data. i.e first 3 lines of my data above.
I can use skip rows to skip down to where the data is and then it will load the rest, but I can't see to figure out how to separate out the event and log data from there
so looking for ideas before i lose what little hair I have left.
If I understood your data format corectly, I would do something like this:
# simply read data as one column data without headers and indexes
df = pd.read_csv("your_file_name.log", header=None, sep=',')
# split values in this column by ; (in each row will be list of values)
tmp_df = df[0].str.split(";")
# delete empty values in the first 3 rows (because we have ; in the end of these rows)
tmp_df = x: [y for y in x if y != ''])
# those rows which have 2 values we insert in one dataframe
EventLog = pd.DataFrame(tmp_df[tmp_df.str.len() == 2].to_list())
# other ones we inset in another dataframe (in the first row will be column names)
data_log_tmp = tmp_df[tmp_df.str.len() != 2].to_list()
DataLog = pd.DataFrame(data_log_tmp[1:], columns=data_log_tmp[0])
Here is an example of loading a CSV file, assuming that Monkeys_inside field is always NaN in Event data and assigned in log data, because I used it as a condition to retrieve the event data :
df = pd.read_csv('huge_data.csv', skiprows=3, sep=';')
log_df = df.dropna().reset_index(drop=True)
event_df = df[~df['Monkeys_inside'].notnull()].reset_index(drop=True)
And assuming also that all your CSV file contains those 3 Status lines.
Keep in mind that the dataframe will hold duplicated rows if you have some in your csv files, to remove them, you need just to call the drop_duplicates function and you good :
event_df = event_df.drop_duplicates()

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
return data

How to group csv in python without using pandas

I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.
My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.
I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
savings[username][date_col] = saved
savings[username] = {date_col: saved}

How to split a log file into several csv files with python

I'm pretty new to python and coding in general, so sorry in advance for any dumb questions. My program needs to split an existing log file into several *.csv files (run1,.csv, run2.csv, ...) based on the keyword 'MYLOG'. If the keyword appears it should start copying the two desired columns into the new file till the keyword appears again. When finished there need to be as many csv files as there are keywords.
53.2436 EXP MYLOG: START RUN specs/run03_block_order.csv
53.2589 EXP TextStim: autoDraw = None
53.2589 EXP TextStim: autoDraw = None
55.2257 DATA Keypress: t
57.2412 DATA Keypress: t
59.2406 DATA Keypress: t
61.2400 DATA Keypress: t
63.2393 DATA Keypress: t
89.2314 EXP MYLOG: START BLOCK scene [specs/run03_block01.csv]
89.2336 EXP Imported specs/run03_block01.csv as conditions
89.2339 EXP Created sequence: sequential, trialTypes=9
[EDIT]: The output per file (run*.csv) should look like this:
onset type
53.2436 EXP
53.2589 EXP
53.2589 EXP
55.2257 DATA
57.2412 DATA
59.2406 DATA
61.2400 DATA
The program creates as much run*.csv as needed, but i can't store the desired columns in my new files. When finished, all I get are empty csv files. If I shift the counter variable to == 1 it creates just one big file with the desired columns.
Thanks again!
import csv
with open('localizer.log', 'rt') as log_input:
i = 0
for line in log_input:
if QUERY in line:
i = i + 1
with open('run' + str(i) + '.csv', 'w') as output:
reader = csv.reader(log_input, delimiter = ' ')
writer = csv.writer(output)
content_column_A = [0]
content_column_B = [1]
for row in reader:
content_A = list(row[j] for j in content_column_A)
content_B = list(row[k] for k in content_column_B)
Looking at the code there's a few things that are possibly wrong:
the csv reader should take a file handler, not a single line.
the reader delimiter should not be a single space character as it looks like the actual delimiter in your logs is a variable number of multiple space characters.
the looping logic seems to be a bit off, confusing files/lines/rows a bit.
You may be looking at something like the code below (pending clarification in the question):
import csv
def write_buffer(_index, buffer):
This function takes an index and a buffer.
The buffer is just an iterable of iterables (ex a list of lists)
Each buffer item is a row of values.
filename = 'run{}.csv'.format(_index)
with open(filename, 'w') as output:
writer = csv.writer(output)
writer.writerow(['onset', 'type']) # adding the heading
current_buffer = []
_index = 1
with open('localizer.log', 'rt') as log_input:
for line in log_input:
# will deal ok with multi-space as long as
# you don't care about the last column
fields = line.split()[:2]
if not NEW_LOG_DELIMITER in line or not current_buffer:
# If it's the first line (the current_buffer is empty)
# or the line does NOT contain "MYLOG" then
# collect it until it's time to write it to file.
write_buffer(_index, current_buffer)
_index += 1
current_buffer = [fields] # EDIT: fixed bug, new buffer should not be empty
if current_buffer:
# We are now out of the loop,
# if there's an unwritten buffer then write it to file.
write_buffer(_index, current_buffer)
You can use pandas to simplify this problem.
Import pandas and read in log file.
import pandas as pd
df = pd.read_fwf('localizer2.log', header=None)
df.columns = ['onset', 'type', 'event']
df.set_index('onset', inplace=True)
Set Flag where third column == 'MYLOG'
df['flag'] = 0
df.loc[df.event.str[:5] == 'MYLOG', 'flag'] = 1
df.flag = df['flag'].cumsum()
Save each run as a separate run*.csv file
for i in range(1, df.flag.max()+1):
df.loc[df.flag == i, 'event'].to_csv('run{0}.csv'.format(i))
Looks like your format is different than I originally assumed. Changed to use pd.read_fwf. my localizer.log file was a copy and paste of your original data, hope this works for you. I assumed by the original post that it did not have headers. If it does have headers then remove header=None and df.columns = ['onset', 'type', 'event'].

Python CSV - Check if index is equal on different rows

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.
What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
# customer_id is already in dataDict, add values
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
