Write last three entries per name in a file - python

I have the following data in a file:
Sarah,10
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
I would like to keep the last three rows for each person. The output would be:
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
In the example, the first row for Sarah was removed since there where three later rows. The rows in the output also maintain the same order as the rows in the input. How can I do this?
Additional Information
You are all amazing - Thank you so much. Final code which seems to have been deleted from this post is -
import collections
with open("Class2.txt", mode="r",encoding="utf-8") as fp:
count = collections.defaultdict(int)
rev = reversed(fp.readlines())
rev_out = []
for line in rev:
name, value = line.split(',')
if count[name] >= 3:
continue
count[name] += 1
rev_out.append((name, value))
out = list(reversed(rev_out))
print (out)

Since this looks like csv data, use the csv module to read and write it. As you read each line, store the rows grouped by the first column. Store the line number along with the row so that they can be written out maintaining the same order as the input. Use a bound deque to keep only the last three rows for each name. Finally, sort the rows and write them out.
import csv
by_name = defaultdict(lambda x: deque(x, maxlen=3))
with open('my_data.csv') as f_in
for i, row in enumerate(csv.reader(f_in)):
by_name[row[0]].append((i, row))
# sort the rows for each name by line number, discarding the number
rows = sorted(row[1] for value in by_name.values() for row in value, key=lambda row: row[0])
with open('out_data.csv', 'w') as f_out:
csv.writer(f_out).writerows(rows)

Related

How to print csv rows in ascending order Python

I am trying to read a csv file, and parse the data and return on row (start_date) only if the date is before September 6, 2010. Then print the corresponding values from row (words) in ascending order. I can accomplish the first half using the following:
import csv
with open('sample_data.csv', 'rb') as f:
read = csv.reader(f, delimiter =',')
for row in read:
if row[13] <= '1283774400':
print(row[13]+"\t \t"+row[16])
It returns the correct start_date range, and corresponding word column values, but they are not returning in ascending order which would display a message if done correctly.
I have tried to use the sort() and sorted() functions, after creating an empty list to populate then appending it to the rows, but I am just not sure where or how to incorporate that into the existing code, and have been terribly unsuccessful. Any help would be greatly appreciated.
Just read the list, filter the list according to the < date criteria and sort it according to the 13th row as integer
Note that the common mistake would be to filter as ASCII (which may appear to work), but integer conversion is relaly required to avoid sort problems.
import csv
with open('sample_data.csv', 'r') as f:
read = csv.reader(f, delimiter =',')
# csv has a title, we have to skip it (comment if no title)
title_row = next(read)
# read csv and filter out to keep only earlier rows
lines = filter(lambda row : int(row[13]) < 1283774400,read)
# sort the filtered list according to the 13th row, as numerical
slist = sorted(lines,key=lambda row : int(row[13]))
# print the result, including title line
for row in title_row+slist:
#print(row[13]+"\t \t"+row[16])
print(row)

Remove duplicate rows in CSV comparing data in only two columns with Python

There are likely many ways to go about this, but here's the gist when it comes down to it:
I have two databases full of people, both exported into csv files. One of the databases is being decommissioned. I need to compare each csv file (or a combined version of the two) and filter out all non-unique people in the soon-to-be decommissioned server. This way I can import only unique people from the decommissioned database into the current database.
I only need to compare FirstName and LastName (which are two separate columns). Part of the problem is they are not precise duplicates, the names are all capitalized in one database, and very in the other.
Here is an example of the data when I combine the two csv files into one. The all CAPS names are from the current database (which is how the csv is currently formatted):
FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor ,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023
Would be parsed into:
Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
Parsing the other columns is irrelevant, but obviously the data is very relevant, so it must remain untouched. (There are actually dozens of other columns, not just three).
To get an idea of how many duplicates I actually had, I ran this script (taken from a previous post):
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
Too simple for my needs though.
Using a set is no good unless you actually want to keep one unique line of with recurring values not only keep lines that are unique, you need to find the unique values looking through all the file first which a Counter dict will do:
with open("test.csv", encoding="utf-8") as f, open("file_out.csv", "w") as out:
from collections import Counter
from csv import reader, writer
wr = writer(out)
header = next(f) # get header
# get count of each first/last name pair lowering each string
counts = Counter((a.lower(), b.lower()) for a, b, *_ in reader(f))
f.seek(0) # reset counter
out.write(next(f)) # write header ?
# iterate over the file again, only keeping rows which have
# unique first and second names
wr.writerows(row for row in reader(f)
if counts[row[0].lower(),row[1].lower()] == 1)
Input:
FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023
file_out:
FirstName,LastName,id,id2,id3
Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
counts counts how many times each of the names appear after being lowered. We then reset the pointer and only write lines whose first two column values are only seen once in the whole file.
Or without the csv module which may be faster if you have namy columns:
with open("test.csv") as f, open("file_out.csv","w") as out:
from collections import Counter
header = next(f) # get header
next(f) # skip blank line
counts = Counter(tuple(map(str.lower,line.split(",", 2)[:2])) for line in f)
f.seek(0) # back to start of file
next(f), next(f) # skip again
out.write(header) # write original header ?
out.writelines(line for line in f
if counts[map(str.lower,line.split(",", 2)[:2])] == 1)
You could use the pandas package for this
import pandas as pd
import StringIO
replace the StringIO with path to your csv files
df1 = pd.read_table(StringIO.StringIO('''FirstName LastName id id2 id3
John Doe 123 432 645
Jacob Smith 456 372 383
Susy Saucy 9999 12 8r83
Contractor #1 8dh 28j 153s
Testing2 Contrator 7463 99999 0283'''), delim_whitespace=True)
df2 = pd.read_table(StringIO.StringIO('''FirstName LastName id id2 id3
JOHN DOE 999 888 999
SUSY SAUCY 8373 08j 9023'''), delim_whitespace=True)
Concatenate and uppercase the names
df1['name'] = (df1.FirstName + df1.LastName).str.upper()
df2['name'] = (df2.FirstName + df2.LastName).str.upper()
Select rows from df1 that do not match names from df2
df1[~df1.name.isin(df2.name)]
You can keep the idea of using a set. Just define a function that will return what you are interested in:
def name(line):
line = line.split(',')
n = ' '.join(line[:2])
return n.lower()
Without concatenating the two databases, read the names in the current database into a set.
with open('current.csv') as f:
next(f)
current_db = {name(line) for line in f}
Check the names in the decommissioned db and write them if not seen.
with open('decommissioned.csv') as old, open('unique.csv', 'w') as out:
next(old)
for line in old:
if name(line) not in current_db:
out.write(line)
You need to operate on a case-insignificant concatenation of the names. For instance:
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
field_list = line.split(' ')
key_name = (field_list[0] + "_" + filed_list[1]).lower()
if key_name in seen: continue # skip duplicate
seen.add(key_name)
out_file.write(line)
changed since it is in csv format
from collections import defaultdict
dd = defaultdict(list)
d = {}
import re
with open("data") as f:
for line in f:
line = line.strip().lower()
mobj = re.match('(\w+),(\w+|#\d),(.*)',line)
firstf, secondf, rest = mobj.groups()
key = firstf + "_" + secondf
d[key] = rest
dd[key].append(rest)
for k, v in d.items():
print(k, v)
output
jacob_smith 456,372,383
testing2_contrator 7463,99999,0283
john_doe 999,888,999
susy_saucy 8373,08j,9023
contractor_#1 8dh,28j,153s
jacob_smith 456 372 383
output
for k, v in dd.items():
print(k,v)
jacob_smith ['456,372,383']
testing2_contrator ['7463,99999,0283']
john_doe ['123,432,645', '999,888,999']
susy_saucy ['9999,12,8r83', '8373,08j,9023']
contractor_#1 ['8dh,28j,153s']

Add to Values in An Array in a CSV File

I imported my CSV File and made the data into an array. Now I was wondering, what can I do so that I'm able to print a specific value in the array? For instance if I wanted the value in the 2nd row, 2nd column.
Also how would I go about adding the two values together? Thanks.
import csv
import numpy as np
f = open("Test.csv")
csv_f = csv.reader(f)
for row in csv_f:
print(np.array(row))
f.close()
There is no need to use csv module.
This code reads csv file and prints value of cell in second row and second column. I am assuming that fields are separated by commas.
with open("Test.csv") as fo:
table = [row.split(",") for row in fo.read().replace("\r", "").split("\n")]
print table[1][1]
So, I grabbed a dataset ("Company Funding Records") from here. Then, I just rewrote a little...
#!/usr/bin/python
import csv
#import numpy as np
csvaslist = []
f = open("TechCrunchcontinentalUSA.csv")
csv_f = csv.reader(f)
for row in csv_f:
# print(np.array(row))
csvaslist.append(row)
f.close()
# Now your data is in a dict. Everything past this point is just playing
# Add together a couple of arbitrary values...
print int(csvaslist[2][7]) + int(csvaslist[11][7])
# Add using a conditional...
print "\nNow let's see what Facebook has received..."
fbsum = 0
for sublist in csvaslist:
if sublist[0] == "facebook":
print sublist
fbsum += int(sublist[7])
print "Facebook has received", fbsum
I've commented lines at a couple points to show what's being used and what was unneeded. Notice at the end that referring to a particular datapoint is simply a matter of referencing what is, effectively, original_csv_file[line_number][field_on_that_line], and then recasting as int, float, whatever you need. This is because the csv file has been changed to a list of lists.
To get specific values within your array/file, and add together:
import csv
f = open("Test.csv")
csv_f = list(csv.reader(f))
#returns the value in the second row, second column of your file
print csv_f[1][1]
#returns sum of two specific values (in this example, value of second row, second column and value of first row, first column
sum = int(csv_f[1][1]) + int(csv_f[0][0])
print sum

calculation then and insert results into a csv in python

this is my first post but I am hoping you can tell me how to perform a calculation and insert the value within a csv data file.
For each row I want to be able to be able to take each 'uniqueclass' and sum the scores achieved in column 12. See example data below;
text1,Data,Class,Uniqueclass1,data1,data,2,data2,data3,data4,data5,175,12,data6,data7
text1,Data,Class,Uniqueclass1,data1,data,2,data2,data3,data4,data5,171,18,data6,data7
text1,Data,Class,Uniqueclass2,data1,data,4,data2,data3,data4,data5,164,5,data6,data7
text1,Data,Class,Uniqueclass2,data1,data,4,data2,data3,data4,data5,121,21.5,data6,data7
text2,Data,Class,Uniqueclass2,data1,data,4,data2,data3,data4,data5,100,29,data6,data7
text2,Data,Class,Uniqueclass2,data1,data,4,data2,data3,data4,data5,85,21.5,data6,data7
text3,Data,Class,Uniqueclass3,data1,data,3,data2,data3,data4,data5,987,35,data6,data7
text3,Data,Class,Uniqueclass3,data1,data,3,data2,data3,data4,data5,286,18,data6,data7
text3,Data,Class,Uniqueclass3,data1,data,3,data2,data3,data4,data5,003,5,data6,data7
So for instance the first Uniqueclass lasts for the first two rows. I would like to be able to therefore insert a subsquent value on that row which would be '346'(the sum of both 175 & 171.) The resultant would look like this:
text1,Data,Class,Uniqueclass1,data1,data,2,data2,data3,data4,data5,175,12,data6,data7,346
text1,Data,Class,Uniqueclass1,data1,data,2,data2,data3,data4,data5,171,18,data6,data7,346
I would like to be able to do this for each of the uniqueclass'
Thanks SMNALLY
I always like the defaultdict class for this type of thing.
Here would be my attempt:
from collections import defaultdict
class_col = 3
data_col = 11
# Read in the data
with open('path/to/your/file.csv', 'r') as f:
# if you have a header on the file
# header = f.readline().strip().split(',')
data = [line.strip().split(',') for line in f]
# Sum the data for each unique class.
# assuming integers, replace int with float if needed
count = defaultdict(int)
for row in data:
count[row[class_col]] += int(row[data_col])
# Append the relevant sum to the end of each row
for row in xrange(len(data)):
data[row].append(str(count[data[row][class_col]]))
# Write the results to a new csv file
with open('path/to/your/new_file.csv', 'w') as nf:
nf.write('\n'.join(','.join(row) for row in data))

How to find min/max values from rows and columns in Python?

I was wondering how can I find minimum and maximum values from a dataset, which is basically a text file. It has 50 rows, 50 columns.
I know I can set up a control loop (for loop to be specific) to have it read each row and column, and determine the min/max values. But, I'm not sure how to do that.
I think the rows and columns need to be converted to list first and then I need to use the split() function. I tried setting something up as follows, but it doesn't seem to work:
for x in range(4,50): # using that range as an example
x.split()
max(4,50)
print x
New to Python. Please excuse my mistakes.
Try something like this:
data = []
with open('data.txt') as f:
for line in f: # loop over the rows
fields = line.split() # parse the columns
rowdata = map(float, fields) # convert text to numbers
data.extend(rowdata) # accumulate the results
print 'Minimum:', min(data)
print 'Maximum:', max(data)
Note that split() takes an optional argument if you want to split on something other than whitespace (commas for example).
If the file contains a regular (rectangular) matrix, and you know how many lines of header info it contains, then you can skip over the header info and use NumPy to do this particularly easily:
import numpy as np
f = open("file.txt")
# skip over header info
X = np.loadtxt(f)
max_per_col = X.max(axis=0)
max_per_row = X.max(axis=1)
Hmmm...are you sure that homework doesn't apply here? ;) Regardless:
You need to not only split the input lines, you need to convert the text values into numbers.
So assuming you've read the input line into in_line, you'd do something like this:
...
row = [float(each) for each in in_line.split()]
rows.append(row) # assuming you have a list called rows
...
Once you have a list of rows, you need to get columns:
...
columns = zip(*rows)
Then you can just iterate through each row and each column calling max():
...
for each in rows:
print max(each)
for eac in columns:
print max(each)
Edit: Here's more complete code showing how to open a file, iterate through the lines of the file, close the file, and use the above hints:
in_file = open('thefile.txt', 'r')
rows = []
for in_line in in_file:
row = [float(each) for each in in_line.split()]
rows.append(row)
in_file.close() # this'll happen at the end of the script / function / method anyhow
columns = zip(*rows)
for index, row in enumerate(rows):
print "In row %s, Max = %s, Min = %s" % (index, max(row), min(row))
for index, column in enumerate(columns):
print "In column %s, Max = %s, Min = %s" % (index, max(column), min(column))
Edit: For new-school goodness, don't use my old, risky file handling. Use the new, safe version:
rows = []
with open('thefile.txt', 'r') as in_file:
for in_line in in_file:
row = ....
Now you've got a lot of assurances that you don't accidentally do something bad like leave that file open, even if you throw an exception while reading it. Plus, you can entirely skip in_file.close() without feeling even a little guilty.
Will this work for you?
infile = open('my_file.txt', 'r')
file_lines = file.readlines(infile)
for line in file_lines[6:]:
items = [int(x) for x in line.split()]
max_item = max(items)
min_item = min(items)

Categories