Difficulty creating annotation script in python

Difficulty creating annotation script in python - python

I was trying to make an annotation tool in python to edit a large csv file and generate a json output, but being a newer programmer, I have been facing a lot of difficulties.
I have two csv files that I generated and then got a list of rows and columns that I wanted to match from one file:
filtered list of columns and rows
And the other file has a list of outputs that I wanted to match it with and gather the specified data entry: raw outputs
For example, I would want to print the data in row 6 from column Q5.3 alongside the original question and then specify if it is good or bad. If it is bad I want to be able to add a comment.
I would like to generate a json file that compiles this in the end. I tried to write the code but it was complete garbage, I guess I was hoping to be able to understand how to properly implement and just became really confused.
Any help would be really appreciated!
The output should go through all the specified data and print:
Question Number,
Question,
Response,
Annotate as Good or Bad,
If Bad then able to add comment,
Continue to next data piece,
When done generate a json for data
Thank you :)
My attempt:
import csv
from csv import reader
import json
csv_results_path = 'results.csv'
categories = { '1':'Unacceptable', '2':'Edit'}
# Checking the outputs (comment out)
''''
print(rows)
print(columns)
'''
'''
# Acquire data from specifed row, column in responses
# Example row 6, column '12.2'
'''
def get_annotation_input():
while True:
try:
annotation = int(input("Annotation: "))
if annotation not in range (1,3):
raise ValueError
return annotation, edited_comment
except ValueError:
print("Enter 1 or 2")
def annotate():
annotator = input("What is your name? ")
print(''.join(["-"]*50))
print("Annotate the following answer as 1-Unacceptable, 2-Edit")
with open("annotations.json", mode='a') as annotation_file:
annotation_data = ('annotator': annotator, 'row_number':, 'question_number': , 'annotation' : categories[str(annotation)], 'edited_response': }
json.dump(annotation_data, annotation_file)
annotation_file.write('\n')
if __name__ == "__main__":
annotate()
with open('annotations_full.csv', 'rU') as infile:
response_reader = csv.DictReader(infile)
responses = {}
for row in response_reader:
for header, response in row.items():
try:
responses[header].append(response)
except KeyError:
responses[header] = [response]
rows = responses['row_number']
columns = responses['question_number']
print(rows)
print(columns)
I was successfully able to get a list of the rows and columns that I wanted, however, I am having difficultly accessing the data in another csv file using the row and corresponding column to display and annotate. Also, when I attempted to write code to allow a field for an edited response if '2' is specified, I faced many output errors.

Related

Comparing and updating CSV files using lists

I'm writing something that will take two CSV's: #1 is a list of email's with # received for each, #2 is a catalog of every email addr on record, with a # of received emails per reporting period with date annotated at top of column.
import csv
from datetime import datetime
datestring = datetime.strftime(datetime.now(), '%m-%d')
storedEmails = []
newEmails = []
sortedList = []
holderList = []
with open('working.csv', 'r') as newLines, open('archive.csv', 'r') as oldLines: #readers to make lists
f1 = csv.reader(newLines, delimiter=',')
f2 = csv.reader(oldLines, delimiter=',')
print ('Processing new data...')
for row in f2:
storedEmails.append(list(row)) #add archived data to a list
storedEmails[0].append(datestring) #append header row with new date column
for col in f1:
if col[1] == 'email' and col[2] == 'To Address': #new list containing new email data
newEmails.append(list(col))
counter = len(newEmails)
n = len(storedEmails[0]) #using header row len to fill zeros if no email received
print(storedEmails[0])
print (n)
print ('Updating email lists and tallies, this could take a minute...')
with open ('archive.csv', 'w', newline='') as toWrite: #writer to overwrite old csv
writer = csv.writer(toWrite, delimiter=',')
for i in newEmails:
del i[:3] #strip useless identifiers from data
if int(i[1]) > 30: #only keep emails with sufficient traffic
sortedList.append(i) #add these emails to new sorted list
for i in storedEmails:
for entry in sortedList: #compare stored emails with the new emails, on match append row with new # of emails
if i[0] == entry[0]:
i.append(entry[1])
counter -=1
else:
holderList.append(entry) #if no match, it is a new email that meets criteria to land itself on the list
break #break inner loop after iteration of outer email, to move to next email and avoid multiple entries
storedEmails = storedEmails + holderList #combine lists for archived csv rewrite
for i in storedEmails:
if len(i) < n:
i.append('0') #if email on list but didnt have any activity this period, append with 0 to keep records intact
writer.writerow(i)
print('SortedList', sortedList)
print (len(sortedList))
print('storedEmails', storedEmails)
print(len(storedEmails))
print('holderList',holderList)
print(len(holderList))
print ('There are', counter, 'new emails being added to the list.')
print ('All done!')
CSV's will look similar to this.
working.csv:
1,asdf#email.com,'to address',31
2,fsda#email.com,'to address',19
3,zxcv#email.com,'to address',117
4,qwer#gmail.com,'to address',92
5,uiop#fmail.com,'to address',11
archive.csv:
date,01-sep
asdf#email.com,154
fsda#email.com,128
qwer#gmail.com,77
ffff#xmail.com,63
What I want after processing is:
date,01-sep,27-sep
asdf#email.com,154,31
fsda#email.com,128,19
qwer#gmail.com,77,92
ffff#xmail.com,63,0
zxcv#email.com,0,117
I'm not sure where I've gone wrong at - but it keeps producing duplicate entries. Some of the functionality is there but I've been at it for too long and I'm getting tunnel vision trying to figure out what I have done wrong with my loops.
I know my zero-filler section in the end is wrong as well, as it will append onto the end of a newly created record instead of populating zero's up to its first appearance.
I'm sure there are far more efficient ways to do this, I'm new to programming so its probably overly complicated and messy - initially I tried to compare CSV to CSV and realized that wasnt possible since you cant read and write at the same time, so I attempted to convert to using lists, which I also know wont work forever due to memory limitations when the list gets big.
-EDIT-
Using Trenton's Panda's solution:
I ran a script on working.csv so it instead produces the following:
asdf#email.com,1000
bsdf#gmail.com,500
xyz#fmail.com,9999
I have modified your solution to reflect this change:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
# filter original list to grab only emails of interest
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'email': 'date'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1]) # I assume usecols isnt necessery anymore, but I'm not sure
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
I apparently still have no idea how this works because I'm now getting :
Traceback (most recent call last):
File "---/agsdga.py", line 29, in <module>
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
File "---\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 5130, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'email'
Process finished with exit code 1
-UPDATE-
It is working now as:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1])
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
The errors above were caused because I changed
arch.rename(columns={'date': 'email'}, inplace=True)
to
arch.rename(columns={'email': 'date'}, inplace=True)
I ran into further complications because I stripped the header row from the test archive because I didnt think the header mattered, even with header=None I still got issues. I'm still not clear why the header is so important when we are assigning our own values to the columns for purposes of the dataframe, but its working now. Thanks for all the help!

I'd load the data with pandas.read_csv
.rename some columns
Renaming the columns in working, is dependent upon the column index, since working.csv has no column headers.
When the working dataframe is created, look at the dataframe to verify the correct columns have been loaded, and the correct column index is being used for renaming.
The date column of arch should really be email, because headers identify what's below them, not the other column headers.
Once the column name has been changed in archive.csv, then rename won't be required any longer.
pandas.merge on the email column.
Since both dataframes have a column renamed with email, the merged result will only have one email column.
If the merge occurs on two different column names, then the result will have two columns containing email addresses.
pandas: Merge, join, concatenate and compare
As long as the columns in the files are consistent, this should work without modification
import pandas as pd
from datetime import datetime
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('working.csv', header=None, usecols=[1, 3])
# rename columns
working.rename(columns={1: 'email', 3: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
# display(arch_updated)
email 01-sep 27-Aug
asdf#email.com 154.0 31.0
fsda#email.com 128.0 19.0
qwer#gmail.com 77.0 92.0
ffff#xmail.com 63.0 0.0
zxcv#email.com 0.0 117.0

So, the problem is you have two sets of data. Both have the data stored with a "key" entry (the emails) and additional piece of data that you want condensed down to one storage. Identifying that there is a similar "key" for both of these sets of data simplifies this greatly.
Imagine each key as being the name of a bucket. Each bucket needs two pieces of info, one piece from one csv and the other piece from the other csv.
Now, I must take a small detour to explain a dictionary in python. Here is a definition stolen from here
A dictionary is a collection which is unordered, changeable and indexed.
A collection is a container like a list that holds data. Unordered and indexed means that the dictionary is not accessible like a list where the data is accessible by the index. In this case, the dictionary is accessed using keys, which can be anything like a string or a number (technically the key must be hashable, but thats too indepth). And finally changeable means that the dictionary can actually have its the stored data changed (once again, oversimplified).
Example:
dictionary = dict()
key = "Something like a string or a number!"
dictionary[key] = "any kind of value can be stored here! Even lists and other dictionaries!"
print(dictionary[key]) # Would print the above string
Here is the structure that I suggest you use instead of most of your lists:
dictionary[email] = [item1, item2]
This way, you can avoid using multiple lists and massively simplifying your code. If you are still iffy on the usage of dictionaries, there are a lot of articles and videos on the usage of them. Good luck!

Can't read .txt file with pandas because it's in a weird shape

I have a data set that contains information from an experiment about particles. You can find it here (hope links are ok, if not let me know and i'll remove immediately) :
http://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification
Trying to read this set in pandas and im encountering the problem of pandas reading this txt as a data frame with 130.064 lines, which is correct, but 1 column. If you check the txt file in the link, you will see that it is in a weird shape, with spaces in the beginning and then 2 spaces between each column.
I tried the command
df = pd.read_csv("path/file.txt", header = None)
and also
df = pd.read_csv("path/file.txt", sep = " ", header = None)
where I set 2 spaces as the separator. Nothing works. The file also, in the 1st line, has 2 numbers that just represent the number of rows, which I deleted. For someone who can't/doesn't want to open the link or the data set, here is a picture of some columns.
This is just a portion of it and not the whole data. In the leftmost side, there are 2 spaces between the edge of the window and the first column, as I said. When reading it using pandas this is what I get
Any advice/help would be appreciated. Thanks
EDIT
I tried doing the following and I think it worked. First I imported the .txt file using NumPy, after deleting the first row from the data frame which contains the two irrelevant numbers.
df1 = np.loadtxt("path/file.txt")
This, for some reason, worked and the resulting array was correct. Then I converted this array to data frame using the command
df = pd.DataFrame(df1)
df.columns = ['X' + str(x) for x in range(50) ]
And yeah, I think it works. Check the following picture.
I think its correct but if you guys find something wrong let me know.

Edited
columns = ['Obs1','Obs2','Obs3','Obs4','Obs5','Obs6','Obs7','Obs8','Obs9','Obs10','Obs11','Obs12','Obs13','Obs14','Obs15','Obs16','Obs17','Obs18','Obs19','Obs20','Obs21','Obs22','Obs23','Obs24','Obs25','Obs26','Obs27','Obs28','Obs29','Obs30','Obs31','Obs32','Obs33','Obs34','Obs35','Obs36','Obs37','Obs38','Obs39','Obs40','Obs41','Obs42','Obs43','Obs44','Obs45','Obs46','Obs47','Obs48','Obs49','Obs50']
df = pd.read_csv("path/file.txt", sep = " ", columns=columns , skiprows=1)

You could try creating the dataframe from lists instead of the txt file, something like the following:
#We put all the lines in a list
data = []
with open("dataset.txt") as fp:
lines = fp.read()
data = lines.split('\n')
df_data= []
for item in data:
df_data.append(item.split(' ')) #I cant see if 1 space or 2 separate the values
#df_data should be something like [[row1col1,row1col2,row1col3],[row2col1,row2col2,row3col3]]
#List to dataframe
df = pd.DataFrame(df_data)
Doing this by memory so watch out for syntax, hope this helps!

Defining user input values and appending to specific CSV columns in Python 3

I'm working on a program that allows a user to add new books to their collection and then generate a random suggestion. Right now I'm stuck trying to add the new books to the csv file. I've looked at other answers on writing and appending to csv's, but they all seemed to want to just produce output organized by columns. My goal is to let a user input a title and an author, then have those inputs added into a new row within the csv file.
Here's where I am:
import csv
import random
class Book(object):
def __init__(self):
self.csvfile =
r'/Users/anthonymandelli/Repos/nextbook/Books.csv'
def add_book(self):
with open(self.csvfile, 'ab') as library:
writer = csv.writer(library, delimiter = ',')
tcolumn = [column for column in writer if column == 'title'.lower()]
acolumn = [column for column in writer if column == 'author'.lower()]
writer.writerows(zip(nbt, author)
return "Added {}".format(newbook)
NewBook = Book ()
if __name__ == '__main__':
while True:
print("\nAdd a new book to the library:")
print()
nbt = [input("Title: ").title()]
author = [input("Author: ").title()]
newbook = '{} by {}'.format(nbt, author)
print()
print(NewBook.add_book())
break
Ideally, a new row will be created with nbt in the title column and author in the author column. This answer seems to be closest to what I want, but I'm missing something and can't connect the dots between those answers and my problem. This is only part of the code, what I assume to be the relevant part, but you can see the whole program here.
Any tips would be greatly appreciated!

Arranging columns python

I'm really stuck on some code. Been working on it for a while now, so any guidance is appreciated. I'm trying to map many files related columns into one final column. The logic of my code will start by
identifying my desired final column names
reading the incoming file
identifying the top rows as my column headers/names
identifying all underneath rows as data for that column
based on the column header, add data from that column to the most closely related column, and
have an exit condition (if no more data, end program).
If anyone could help me with steps 3 and 4, I would greatly appreciate it because that is where I'm currently stuck.
It says I have a KeyError:0 where it says columnHeader=row[i]. Does anyone know how to solve this particular problem?
#!/usr/bin/env python
import sys #used for passing in the argument
import csv, glob
SLSDictionary={}
fieldMap = {'zipcode':['Zip5', 'zip4'],
'firstname':[],
'lastname':[],
'cust_no':[],
'user_name':[],
'status':[],
'cancel_date':[],
'reject_date':[],
'streetaddr':['address2', 'servaddr'],
'city':[],
'state':[],
'phone_home':['phone_work'],
'email':[]
}
CSVreader = csv.DictReader(open( 'N:/Individual Files/Jerry/2013 customer list qc, cr, db, gb 9-19-2013_JerrysMessingWithVersion.csv', "rb"),dialect='excel', delimiter=',')
i=0
for row in CSVreader:
if i==0:
columnHeader = row[i]
else:
columnData = row[i]
i += 1

Python CSV - Check if index is equal on different rows

I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.

What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.