Count and compare occurrences across different columns in different spreadsheets - python

I would like to know (in Python) how to count occurrences and compare values from different columns in different spreadsheets. After counting, I would need to know if those values fulfill a condition i.e. If Ana (user) from the first spreadsheet appears 1 time in the second spreadsheet and 5 times in the third one, I would like to sum 1 to a variable X.
I am new in Python, but I have tried getting the .values() after using the Counter from collections. However, I am not sure if the real value Ana is being considered when iterating in the results of the Counter. All in all, I need to iterate each element in spreadsheet one and see if each element of it appears one time in the second spreadsheet and five times in the third spreadsheet, if such thing happens, the variable X will be added by one.
def XInputOutputs():
list1 = []
with open(file1, 'r') as fr:
r = csv.reader(fr)
for row in r:
list1.append(row[1])
number_of_occurrences_in_list_1 = Counter(list1)
list1_ocurrences = number_of_occurrences_in_list_1.values()
list2 = []
with open(file2, 'r') as fr:
r = csv.reader(fr)
for row in r:
list2.append(row[1])
number_of_occurrences_in_list_2 = Counter(list2)
list2_ocurrences = number_of_occurrences_in_list_2.values()
X = 0
for x,y in zip(list1_ocurrences, list2_ocurrences):
if x == 1 and y == 5:
X += 1
return X
I tested with small spreadsheets, but this just works for pre-ordered values. If Ana appears after 100000 rows, everything is broken. I think it is needed to iterate each value (Ana) and check simultaneously in all the spreadsheets and sum the variable X.

I am at work, so I will be able to write a full answer only later.
If you can import modules, I suggest you to try using pandas: a real super-useful tool to quickly and efficiently manage data frames.
You can easily import a .csv spreadsheet with
import pandas as pd
df = pd.read_csv()
method, then perform almost any kind of operation.
Check out this answer out: I got few time to read it, but I hope it helps
what is the most efficient way of counting occurrences in pandas?
UPDATE: then try with this
# not tested but should work
import os
import pandas as pd
# read all csv sheets from folder - I assume your folder is named "CSVs"
for files in os.walk("CSVs"):
files = files[-1]
# here it's generated a list of dataframes
df_list = []
for file in files:
df = pd.read_csv("CSVs/" + file)
df_list.append(df)
name_i_wanna_count = "" # this will be your query
columun_name = "" # here insert the column you wanna analyze
count = 0
for df in df_list:
# retrieve a series matching your query and then counts the elements inside
matching_serie = df.loc[df[columun_name] == name_i_wanna_count]
partial_count = len(matching_serie)
count = count + partial_count
print(count)
I hope it helps

Related

Data cleanup in Python, removing CSV rows based on a condition

I've came across a bit of a challenge where I need to sanitize data in a CSV file based on the following criteria:
If the data exists with a date, remove the one with an NA value from the file;
If it is a duplicate, remove it; and
If the data exists only own its own, leave it alone.
I am currently able to do both 2 and 3, however I am struggling to make a condition to capture 1 of the criteria.
Sample CSV File
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Server_B,Test,NA,NA
Current Code So Far
import csv
input_file = 'sample.csv'
output_file = 'completed_output.csv'
with open(input_file, 'r') as inputFile, open(output_file, 'w') as outputFile:
seen = set()
for line in inputFile:
if line in seen:
continue
seen.add(line)
outputFile.write(line)
Currently, this helps with duplicates and capturing the unique values. However, I cannot work the best way to remove the row that has a repeating server.
However, this may not work well because the set type is unordered, so I wasn't sure the best way to compare based on column, then filter down from there.
Any suggestions or solutions that could help me would be greatly appreciated.
Current Output So Far
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Expected Output
Name,Environment,Available,Date
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
You can use pandas instead of manually doing all of that. I have written a short function called custom filter which takes into consideration the criteria.
One area of potential bugs can be the use of pd.NA, use other np.nan or None if this doesn't work accordingly.
import pandas as pd
df = pd.read_csv('sample.csv')
df = df.drop_duplicates()
data_present = []
def custom_filter(x):
global data_present
if x[3] == pd.NA:
data_present.append(x[0])
return True
elif x[3] == pd.NA and x[0] not in data_present:
return True
else:
return False
df = df.sort_values('Date')
df = df[df.apply(custom_filter, axis = 1)]
df.to_csv('completed_output.csv')

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data

Pandas - Overwrite single column with new values, retain additional columns; overwrite original files

Fairly new to python, I have a csv with 2 columns, I need the code to perform a simple calculation on the first column while retaining the information in the second. code currently performs the calculation(albeit only on the first csv in the list, and there are numerous). But I haven't figured out how to overwrite the values in each file while retaining the second column unchanged. I'd like it to save over the original files with the new calculations. Additionally, originals have no header, and pandas automatically assigns a numeric value.
import os
import pandas as pd
def find_csv(topdir, suffix='.csv'):
filenames = os.listdir(topdir)
csv_list = [name for name in filesnames if name.endswith(suffix)
fp_list = []
for csv in csv_list:
fp = os.path.join(topdir, csv)
fp_list.append(fp)
return fp_list
def wn_to_um(wn):
um = 10000/wn
return um
for f in find_csv('C:/desktop/test'):
readit = pd.read_csv(f, usecols=[0])
convert = wn_to_um(readit)
df = pd.DataFram(convert)
df.to_csv('C:/desktop/test/whatever.csv')
I suppose you just have to do minor changes to your code.
def wn_to_um(wn):
wn.iloc[:,0] = 10000/wn.iloc[:,0] #performing the operation on the first column
return wn
for f in find_csv('C:/desktop/test'):
readit = pd.read_csv(f) #Here read the whole file
convert = wn_to_um(readit) #while performing operation, just call the function with the second column
os.remove(f) #if you want to replace the existing file with the updated calculation, simply delete and write
df.to_csv('C:/desktop/test/whatever.csv')
Say you have a column named 'X' which you want to divide by 10,000. You can store this as X and then divide each element in X like so:
X = df['X']
new_x = [X / 10000 for i in X]
From here, rewriting the column in the dataframe is very simple:
df['X'] = new_x
Just update your second function as:
def wn_to_um(wn):
wn.iloc[:,0] = 10000/wn.iloc[:,0]
return wn

Checking for Regular Expressions within a CSV

I'm currently trying to run through my csv file and identify the rows in a column.
The output should be something like "This column contains alpha characters only".
My code currently:
Within a method I have:
print('\nREGULAR EXPRESSIONS\n' +
'----------------------------------')
for x in range(0, self.tot_col):
print('\n' + self.file_list[0][x] +
'\n--------------') # Prints the column name
for y in range(0, self.tot_rows + 1):
if regex.re_alpha(self.file_list[y][x]) is True:
true_count += 1
else:
false_count += 1
if true_count > false_count:
percentage = (true_count / self.tot_rows) * 100
print(str(percentage) + '% chance that this column is alpha only')
true_count = 0
false_count = 0
self.file_list is the csv file in list format.
self.tot_rows & self.tot_col are the total rows and total columns respectively which has been calculated earlier within the program.
regex.re_alpha has been imported from a file and the method looks like:
def re_alpha(column):
# Checks alpha characters
alpha_valid = alpha.match(column)
if alpha_valid:
return True
else:
return False
This currently works, however I am unable to add my other regex checks such as alpha, numeric etc
I have tried to duplicate the if statement with a different regex check but it doesn't work.
I've also tried to do the counts in the regex.py file however the count stops at '1' and returns the wrong information..
I thought creating a class in the regex.py file would help however no avail.
Summary:
I would like to run multiple regex checks against my csv file and have them ordered via columns.
Thanks in advance.
From the code above, the first line of the CSV contains the column names. This means you could make a dictionary to contain your result where the keys are the column names.
from csv import DictReader
reader = DictReader(open(filename)) # filename is the name of the CSV file
results = {}
for row in reader:
for col_name, value in row.items():
results.setdefault(col_name, []).append(regex.re_alpha(value))
Now you have a dictionary called 'results' which has the output from the regex checks stored by column name. You can then output statistics. Or you could save the rows as you read them in a list and once you decide on an order you can go back and output rows to a new CSV file by outputting the items in each dictionary using the keys in the new order.
csv_writer = csv.writer(open(output_filename, 'w'))
new_order = [list of key names in the right order]
for row in saved_data:
new_row = map(row.get, new_order)
csv_writer.writerow(new_row)
Admittedly this is a bit of a sketch but it should get you going.

Python CSV - Check if index is equal on different rows

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.
What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,

Categories