Arranging columns python - python

I'm really stuck on some code. Been working on it for a while now, so any guidance is appreciated. I'm trying to map many files related columns into one final column. The logic of my code will start by
identifying my desired final column names
reading the incoming file
identifying the top rows as my column headers/names
identifying all underneath rows as data for that column
based on the column header, add data from that column to the most closely related column, and
have an exit condition (if no more data, end program).
If anyone could help me with steps 3 and 4, I would greatly appreciate it because that is where I'm currently stuck.
It says I have a KeyError:0 where it says columnHeader=row[i]. Does anyone know how to solve this particular problem?
#!/usr/bin/env python
import sys #used for passing in the argument
import csv, glob
SLSDictionary={}
fieldMap = {'zipcode':['Zip5', 'zip4'],
'firstname':[],
'lastname':[],
'cust_no':[],
'user_name':[],
'status':[],
'cancel_date':[],
'reject_date':[],
'streetaddr':['address2', 'servaddr'],
'city':[],
'state':[],
'phone_home':['phone_work'],
'email':[]
}
CSVreader = csv.DictReader(open( 'N:/Individual Files/Jerry/2013 customer list qc, cr, db, gb 9-19-2013_JerrysMessingWithVersion.csv', "rb"),dialect='excel', delimiter=',')
i=0
for row in CSVreader:
if i==0:
columnHeader = row[i]
else:
columnData = row[i]
i += 1

Related

How to group csv in python without using pandas

I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.
My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved.append(row[2])
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.
I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
else:
savings[username][date_col] = saved
else:
savings[username] = {date_col: saved}

Python and Excel Formula

Complete beginner here but have a specific need to try and make my life easier with automating Excel.
I have a weekly report that contains a lot of useless columns and using Python I can delete these and rename them, with the code below.
from openpyxl import Workbook, load_workbook
wb = load_workbook('TestExcel.xlsx')
ws = wb.active
ws.delete_cols(1,3)
ws.delete_cols(3,8)
ws.delete_cols(4,3)
ws.insert_cols(3,1)
ws['A1'].value = "Full Name"
ws['C1'].value = "Email Address"
ws['C2'].value = '=B2&"#testdomain.com"'
wb.save('TestExcelUpdated.xlsx')
This does the job but I would like the formula to continue from B2 downwards (since the top row are headings).
ws['C2'].value = '=B2&"#testdomain.com"'
Obviously, in Excel it is just a case of dragging the formula down to the end of the column but I'm at a loss to get this working in Python. I've seen similar questions asked but the answers are over my head.
Would really appreciate a dummies guide.
Example of Excel report after Python code
one way to do this is by iterating over the rows in your worksheet.
for row in ws.iter_rows(min_row=2): #min_row ensures you skip your header row
row[2].value = '=B' + str(row[0].row) + '&"#testdomain.com"'
row[2].value selects the third column due to zero based indexing. row[0].row gets the number corresponding to the current row

Possible Null Value Cells In csv File

Good evening,
May I please get advice on a section of code that I have? Here is the code:
import pandas as pd
import numpy as np
import os
import logging
#logging.getLogger('tensorflow').setLevel(logging.ERROR)
os.chdir('/media/Maps/test_run/')
coasts_data='new_coastline_combined.csv'
coasts = pd.read_csv(coasts_data, header=None, encoding='latin1')
#Combine all data into a single frame
frames = [coasts]
df = pd.concat(frames)
n = 1000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
l=[]
for index, frame in enumerate(list_df):
vals = frame.iloc[:,2].values
vals = eval(vals)
# if any values in this part of the frame are wrong, store index for deletion
if np.any(vals < -100000):
l.append(index)
for x in sorted(l, reverse=True):
del list_df[x]
df1 = pd.concat(list_df)
df1.to_csv(r'/media/test_run_6/test.csv',index=False,header=False)
What the code does is take data from a csv file, break it into groups of 1000 rows each, determine if erroneous values are present, and delete that group of data. It works fine on most csv files. However, on one csv file, I'm getting the following error:
TypeError: '<' not supported between instances of 'str' and 'int'
The error begins at this section of code:
if np.any(vals < -100000):
I suspect (and please correct me if I'm wrong) that there are null (empty) cells in this particular column of values inside the csv (the csv is 6,000,000 rows deep, btw).
May I please get help on finding out what the problem is and how to fix it? Thank you,
I recommend catching the error to find out which line it is causing it and then manually check what problem has occured in your data.
Something like this:
try:
if np.any(vals < -100000):
l.append(index)
except TypeError:
print(vals)
print(index)
If it's really just a empty cells you could check that and ignore these cells.

Difficulty creating annotation script in python

I was trying to make an annotation tool in python to edit a large csv file and generate a json output, but being a newer programmer, I have been facing a lot of difficulties.
I have two csv files that I generated and then got a list of rows and columns that I wanted to match from one file:
filtered list of columns and rows
And the other file has a list of outputs that I wanted to match it with and gather the specified data entry: raw outputs
For example, I would want to print the data in row 6 from column Q5.3 alongside the original question and then specify if it is good or bad. If it is bad I want to be able to add a comment.
I would like to generate a json file that compiles this in the end. I tried to write the code but it was complete garbage, I guess I was hoping to be able to understand how to properly implement and just became really confused.
Any help would be really appreciated!
The output should go through all the specified data and print:
Question Number,
Question,
Response,
Annotate as Good or Bad,
If Bad then able to add comment,
Continue to next data piece,
When done generate a json for data
Thank you :)
My attempt:
import csv
from csv import reader
import json
csv_results_path = 'results.csv'
categories = { '1':'Unacceptable', '2':'Edit'}
# Checking the outputs (comment out)
''''
print(rows)
print(columns)
'''
'''
# Acquire data from specifed row, column in responses
# Example row 6, column '12.2'
'''
def get_annotation_input():
while True:
try:
annotation = int(input("Annotation: "))
if annotation not in range (1,3):
raise ValueError
return annotation, edited_comment
except ValueError:
print("Enter 1 or 2")
def annotate():
annotator = input("What is your name? ")
print(''.join(["-"]*50))
print("Annotate the following answer as 1-Unacceptable, 2-Edit")
with open("annotations.json", mode='a') as annotation_file:
annotation_data = ('annotator': annotator, 'row_number':, 'question_number': , 'annotation' : categories[str(annotation)], 'edited_response': }
json.dump(annotation_data, annotation_file)
annotation_file.write('\n')
if __name__ == "__main__":
annotate()
with open('annotations_full.csv', 'rU') as infile:
response_reader = csv.DictReader(infile)
responses = {}
for row in response_reader:
for header, response in row.items():
try:
responses[header].append(response)
except KeyError:
responses[header] = [response]
rows = responses['row_number']
columns = responses['question_number']
print(rows)
print(columns)
I was successfully able to get a list of the rows and columns that I wanted, however, I am having difficultly accessing the data in another csv file using the row and corresponding column to display and annotate. Also, when I attempted to write code to allow a field for an edited response if '2' is specified, I faced many output errors.

CSV in python getting wrong number of rows

I have a csv file that I want to work with using python. For that I'm run this code :
import csv
import collections
col_values = collections.defaultdict(list)
with open('list.csv', 'rU',) as f:
reader = csv.reader(f)
data = list(reader)
row_count =len(data)
print(" Number of rows ", row_count)
As a result I get 4357 but the file has only 2432 rows, I've tried to change the delimiter in the reader() function, it didn't change the result.
So my question is, does anybody has an explanation why do I get this value ?
thanks in advance
UPDATE
since the number of column is also too big , here is the output of the last row and the start of non existing rows for one columns
opening the file with excel the end looks like :
I hope it helps
try using pandas.
import pandas as pd
df = pd.read_csv('list.csv')
df.count()
check whether you are getting proper rows now

Categories