Checking for Regular Expressions within a CSV - python

I'm currently trying to run through my csv file and identify the rows in a column.
The output should be something like "This column contains alpha characters only".
My code currently:
Within a method I have:
print('\nREGULAR EXPRESSIONS\n' +
'----------------------------------')
for x in range(0, self.tot_col):
print('\n' + self.file_list[0][x] +
'\n--------------') # Prints the column name
for y in range(0, self.tot_rows + 1):
if regex.re_alpha(self.file_list[y][x]) is True:
true_count += 1
else:
false_count += 1
if true_count > false_count:
percentage = (true_count / self.tot_rows) * 100
print(str(percentage) + '% chance that this column is alpha only')
true_count = 0
false_count = 0
self.file_list is the csv file in list format.
self.tot_rows & self.tot_col are the total rows and total columns respectively which has been calculated earlier within the program.
regex.re_alpha has been imported from a file and the method looks like:
def re_alpha(column):
# Checks alpha characters
alpha_valid = alpha.match(column)
if alpha_valid:
return True
else:
return False
This currently works, however I am unable to add my other regex checks such as alpha, numeric etc
I have tried to duplicate the if statement with a different regex check but it doesn't work.
I've also tried to do the counts in the regex.py file however the count stops at '1' and returns the wrong information..
I thought creating a class in the regex.py file would help however no avail.
Summary:
I would like to run multiple regex checks against my csv file and have them ordered via columns.
Thanks in advance.

From the code above, the first line of the CSV contains the column names. This means you could make a dictionary to contain your result where the keys are the column names.
from csv import DictReader
reader = DictReader(open(filename)) # filename is the name of the CSV file
results = {}
for row in reader:
for col_name, value in row.items():
results.setdefault(col_name, []).append(regex.re_alpha(value))
Now you have a dictionary called 'results' which has the output from the regex checks stored by column name. You can then output statistics. Or you could save the rows as you read them in a list and once you decide on an order you can go back and output rows to a new CSV file by outputting the items in each dictionary using the keys in the new order.
csv_writer = csv.writer(open(output_filename, 'w'))
new_order = [list of key names in the right order]
for row in saved_data:
new_row = map(row.get, new_order)
csv_writer.writerow(new_row)
Admittedly this is a bit of a sketch but it should get you going.

Related

Finding a logic to map the values of two .tsv files

I am working on a problem where I have 2 .tsv files and one has been arranged wrongly with respect to the other one.
When I scan the file , I noticed a pattern which I am unable to put it in terms of coding language. The pattern that I observed was :
For every increase in the row number of metadata file = 8 rows of increment to match in the flipped_metadata.tsv file to match the same values in the metadata file
For every increase in the flipped_metadata file = 12 rows if increment in the metadata.tsv file to match the same values in the flipped_metadata file.
For more clarity I have attached the 2 .tsv files along with this:
Metadata.tsv file and Flipped_metadata.tsv file
The openpyxl library has good functions for dealing with Excel cell locations. These can be used to convert A1 to a proper row and column.
Read each row in and convert the cell reference to a simple numeric row and column value. Use a dictionary to store each cell found with the two values for that cell. e.g. cells[(1,1)] = "123 456"
Whilst reading in, keep a track of the largest row and column.
Create an empty array (list of lists) to allow each cell to be assigned into.
Iterate over all of the dictionary items and assign each value into the array.
Finally save the array to a new CSV file.
For example:
from openpyxl.utils.cell import coordinate_from_string, column_index_from_string
import csv
def flip(input_filename, output_filename):
cells = {}
max_row = 0
max_col = 0
with open(input_filename) as f_input:
for cell, v1, v2 in csv.reader(f_input, delimiter='\t'):
col_letter, row_number = coordinate_from_string(cell)
col_number = column_index_from_string(col_letter)
cells[(row_number, col_number)] = f"{v1} {v2}"
if row_number > max_row:
max_row = row_number
if col_number > max_col:
max_col = col_number
output = [[''] * max_col for _ in range(max_row)]
for (row_number, col_number), values in cells.items():
output[row_number-1][col_number-1] = values
with open(output_filename, 'w', newline='') as f_output:
csv.writer(f_output).writerows(output)
flip('metadata.tsv', 'output_metadata.csv')
flip('flipped_metadata.tsv', 'output_flipped_metadata.csv')
This would give you:
Note: this approach correctly handles all cell references e.g. FK42. It would also handle holes in the data, if A2 was deleted it would still align correctly, as it is not 100% clear if data in cells can be missing,

How to group csv in python without using pandas

I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.
My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved.append(row[2])
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.
I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
else:
savings[username][date_col] = saved
else:
savings[username] = {date_col: saved}

Count and compare occurrences across different columns in different spreadsheets

I would like to know (in Python) how to count occurrences and compare values from different columns in different spreadsheets. After counting, I would need to know if those values fulfill a condition i.e. If Ana (user) from the first spreadsheet appears 1 time in the second spreadsheet and 5 times in the third one, I would like to sum 1 to a variable X.
I am new in Python, but I have tried getting the .values() after using the Counter from collections. However, I am not sure if the real value Ana is being considered when iterating in the results of the Counter. All in all, I need to iterate each element in spreadsheet one and see if each element of it appears one time in the second spreadsheet and five times in the third spreadsheet, if such thing happens, the variable X will be added by one.
def XInputOutputs():
list1 = []
with open(file1, 'r') as fr:
r = csv.reader(fr)
for row in r:
list1.append(row[1])
number_of_occurrences_in_list_1 = Counter(list1)
list1_ocurrences = number_of_occurrences_in_list_1.values()
list2 = []
with open(file2, 'r') as fr:
r = csv.reader(fr)
for row in r:
list2.append(row[1])
number_of_occurrences_in_list_2 = Counter(list2)
list2_ocurrences = number_of_occurrences_in_list_2.values()
X = 0
for x,y in zip(list1_ocurrences, list2_ocurrences):
if x == 1 and y == 5:
X += 1
return X
I tested with small spreadsheets, but this just works for pre-ordered values. If Ana appears after 100000 rows, everything is broken. I think it is needed to iterate each value (Ana) and check simultaneously in all the spreadsheets and sum the variable X.
I am at work, so I will be able to write a full answer only later.
If you can import modules, I suggest you to try using pandas: a real super-useful tool to quickly and efficiently manage data frames.
You can easily import a .csv spreadsheet with
import pandas as pd
df = pd.read_csv()
method, then perform almost any kind of operation.
Check out this answer out: I got few time to read it, but I hope it helps
what is the most efficient way of counting occurrences in pandas?
UPDATE: then try with this
# not tested but should work
import os
import pandas as pd
# read all csv sheets from folder - I assume your folder is named "CSVs"
for files in os.walk("CSVs"):
files = files[-1]
# here it's generated a list of dataframes
df_list = []
for file in files:
df = pd.read_csv("CSVs/" + file)
df_list.append(df)
name_i_wanna_count = "" # this will be your query
columun_name = "" # here insert the column you wanna analyze
count = 0
for df in df_list:
# retrieve a series matching your query and then counts the elements inside
matching_serie = df.loc[df[columun_name] == name_i_wanna_count]
partial_count = len(matching_serie)
count = count + partial_count
print(count)
I hope it helps

Write list to specific column in csv

I'm trying to write the data from my list to just column 4
namelist = ['PEAR']
for name in namelist:
for man_year in yearlist:
for man_month in monthlist:
with open('{2}\{0}\{1}.csv'.format(man_year,man_month,name),'w') as filename:
writer = csv.writer(filename)
writer.writerow(name)
time.sleep(0.01)
it outputs to a csv like this
P E A R
4015854 234342 2442343 234242
How can I get it to go on just the 4th column?
PEAR
4015854 234342 2442343 234242
Replace the line writer.writerow(name) with,
writer.writerow(['', '', '', name])
When you pass the name to csvwriter it assumes the name as an iterable and write each character in a column.
So, for getting ride of this problem change the following line:
writer.writerow(name)
With:
writer.writerow([''] * (len(other_row)-1) + [name])
Here other_row can be one of the rest rows, but if you are sure about the length you can do something like:
writer.writerow([''] * (length-1) + [name])
Instead of writing '' to cells you don't want to touch, you could use df.at instead. For example, you could write df.at[index, ColumnName] = 10 which would change only the value of that specific cell.
You can read more about it here: Set value for particular cell in pandas DataFrame using index

Python CSV - Check if index is equal on different rows

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.
What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,

Categories