I am trying to write code that will handle my input file of numbers, and then perform various operations on them. For example, The first column is a name. The second is an hourly rate, and the third is hours. The File looks like this,
John 15 8
Sam 10 4
Mike 16 10
John 19 15
I want to go through and if a name is a duplicate (John in the example) it will average the 2nd number (hourly rate), get the sum the 3rd number (hours), and delete the duplicate leaving 1 John with average wage and total hours. If not a duplicate it will just output the original entry.
I cannot figure out how to keep track of the duplicate, and then move on to the next line in the row. Is there any way to do this without using line.split()?
This problem is easier if you break it up into parts.
First, you want to read through the file and parse each line into three variables, the name, the hourly rate, and the hours.
Second, you need to handle the matching on the first value (the name). You need some kind of data structure to store values in; a dict is probably the right thing here.
Thirdly, you need to compute the average at the end (you can't compute it along the way because you need the count of values).
Putting it together, I would do something like this:
class PersonRecord:
def __init__(self, name):
self.name = name
self.hourly_rates = []
self.total_hours = 0
def add_record(self, hourly_rate, hours):
self.hourly_rates.append(hourly_rate)
self.total_hours += hours
def get_average_hourly_rate(self):
return sum(self.hourly_rates) / len(self.hourly_rates)
def compute_person_records(data_file_path):
person_records = {}
with open(data_file_path, 'r') as data_file:
for line in data_file:
parts = line.split(' ')
name = parts[0]
hourly_rate = int(parts[1])
hours = int(parts[2])
person_record = person_records.get(name)
if person_record is None:
person_record = PersonRecord(name)
person_records[name] = person_record
person_record.add_record(hourly_rate, hours)
return person_records
def main():
person_records = compute_person_records()
for person_name, person_record in person_records.items():
print('{name} {average_hourly_rate} {total_hours}'.format(
name=person_name,
average_hourly_rate=person_record.get_average_hourly_rate(),
total_hours=person_record.total_hours))
if __name__ == '__main__':
main()
Here we go. Just groupby the name and aggregate on the rate and hours taking the mean and sum as shown below.
#assume d is the name of your DataFrame.
d.groupby(by =['name']).agg({'rate': "mean", 'hours':'sum'})
Here's a version that's not particularly efficient. I wouldn't run it on lots of data, but it's easy to read and returns your data to its original form, which is apparently what you want...
from statistics import mean
input = '''John 15 8
Sam 10 4
Mike 16 10
John 19 15'''
lines = input.splitlines()
data = [line.split(' ') for line in lines]
names = set([item[0] for item in data])
processed = [(name, str(mean([int(i[1]) for i in data if i[0] == name])), str(sum([int(i[2]) for i in data if i[0] == name]))) for name in names]
joined = [' '.join(p) for p in processed]
line_joined = '\n'.join(joined)
a=[] #list to store all the values
while(True): #infinite while loop to take any number of values given
try: #for giving any number of inputs u want
l=input().split()
a.append(l)
except(EOFError):
break;
for i in a:
m=[i] #temperory list which will contain duplicate values
for j in range(a.index(i)+1,len(a)):
if(i[0]==a[j][0]):
m.append(a[j]) #appending duplicates
a.pop(j) #popping duplicates from main list
hr=0 #initializing hourly rate and hours with 0
hrs=0
if(len(m)>1):
for k in m:
hr+=int(k[1])
hrs+=int(k[2])# calculating total hourly rate and hours
i[1]=hr/len(m)
i[2]=hrs/len(m)#finding average
for i in a:
print(i[0],i[1],i[2]) # printing the final list
Read comments in the code for code explanation
You can do:
from collections import defaultdict
with open('file_name') as fd:
data = fd.read().splitlines()
line_elems = []
for line in data:
line_elems.append(line.split())
a_dict = defaultdict(list)
for e in line_elems:
a_dict[e[0]].append((e[1], e[2]))
final_dict = {}
for key in a_dict:
if len(a_dict[key]) > 1:
hour_rates = [float(x[0]) for x in a_dict[key]]
hours = [float(x[1]) for x in a_dict[key]]
ave_rate = sum(hour_rates) / len(hour_rates)
total_hours = sum(hours)
final_dict[key] = (ave_rate, total_hours)
else:
final_dict[key] = a_dict[key]
print(final_dict)
# write to file or do whatever
I have a file with a list of positions (columns 1 + 2) and values associated with those positions:
File1.txt:
1 20 A G
4 400 T C
1 12 A T
2 500 G C
And another file with some of the same positions. There may be multiple rows with the same positions as in File1.txt
File2.txt
#CHR POS Count_A Count_C Count_G Count_T
1 20 0 18 2 0
4 400 0 0 0 1
1 12 0 7 0 40
4 400 0 1 0 1
5 50 16 0 0 0
2 500 9 0 4 0
I need to output a version of File1.txt excluding any rows that ever meet both these two conditions:
1: If the positions (columns 1+2) match in File1.txt and File2.txt
2: If the count is > 0 in the column in File2.txt that matches the letter(A,G,C,T) in column 4 of File1.txt for that position.
So for the example above the first row of File1.txt would not be output because in file2.txt for the matching row (based on the first 2 columns: 1 20), the 4th column has the letter G and for this row in File2.txt the Count_G column is >0.
The only line that would be output for this example would be:
2 500 G C
To me the particularly tricky part is that there can be multiple matching rows in file2.txt and I want to exclude rows in File1.txt if the appropriate column in File2.txt is >0 in even just one row in File2.txt. Meaning that in the example above line 2 of File1.txt would not be included because Count_C is > 0 the second time that position appears in File2.txt (Count_C = 1).
I am not sure if that kind of filtering is possible in a single step. Would it be easier to output a list of rows in File1.txt where the count in File2.txt for the letter in the 4th column in File1.txt is >0. Then use this list to compare to File1.txt and remove any rows that appear in both files?
I've filtered one file based on values in another before with the code below, but this was for when there was only one column of values to filter for in file2.txt. I am not sure how to do the conditional filtering so that I check the right column based on the letter in column 4 of file1.txt
My current code is in python but any solution is welcome:
f2 = open('file2.txt', 'r')
d2 = {}
for line in f2.split('\n'):
line = line.rstrip()
fields = line.split("\t")
key = (fields[0], fields[1])
d2[key] = int(fields[2])
f1 = open('file1.txt', 'r')
for line in file1.split('\n'):
line = line.rstrip()
fields = line.split("\t")
key = (fields[0], fields[1])
if d2[key] > 1000:
print line
I think my previous solution is already very verbose and feel there might be a simple tool for this kind of problem of which I am not aware.
I used Perl to solve the problem. First, it loads File2 into a hash table keyed by the char, pos, and nucleotide, the value is the number for the nucleotide. Then, the second file is processed. If there's non-zero value in the hash table for its char, pos, and nucleotide, it's not printed.
#!/usr/bin/perl
use warnings;
use strict;
my %gt0;
open my $in2, '<', 'File2.txt' or die $!;
<$in2>; # Skip the header.
while (<$in2>) {
my %count;
(my ($chr, $pos), #count{qw{ A C G T }}) = split;
$gt0{$chr}{$pos}{$_} = $count{$_} for qw( A C G T );
}
open my $in1, '<', 'File1.txt' or die $!;
while (<$in1>) {
my ($chr, $pos, undef, $c4) = split;
print unless $gt0{$chr}{$pos}{$c4};
}
Your code seems pretty good to me. You can perhaps edit
d2[key] = int(fields[2])
and
if d2[key] > 1000:
print line
As they puzzle me a little bit.
I would do it like this:
f2 = open('file2.txt', 'r')
d2 = {}
for line in f2.split('\n'):
fields = line.rstrip().split("\t")
key = (fields[0], fields[1])
d2[key] = {'A':int(fields[2]),'C':int(fields[3]),'G':int(fields[4]),
'T':int(fields[5])}
f1 = open('file1.txt', 'r')
for line in f1.split('\n'):
line = line.rstrip()
fields = line.split("\t")
key = (fields[0], fields[1])
if (key not in d2) or (d2[key][str(fields[2])] == 0 and d2[key][str(fields[3])] == 0):
print(line)
Edit:
If you have an arbitrary number of letters (and columns in file 2) just generalize the dictionary inside d2 which I have hard coded. Easy. LEt's add 2 letters:
col_names = ['A','C','G','T','K','L']
for a,i in zip(fields[2:],range(len(fields[2:]))):
d2[key][col_names.index(i)] = a
I have a data file with similar special structure as below:
#F A 1 1 1 3 3 2
2 1 0.002796 0.000005 0.000008 -4.938531 1.039083
3 1 0.002796 0.000005 0.000007 -4.938531 1.039083
4 0 0.004961 -0.000008 -0.000002 -4.088534 0.961486
5 0 0.004961 0.000006 -0.000002 -4.079798 0.975763
First column is only a description (no need to be considered)and I want to (1)separate all data that their second column is 1 from the ones that their second column is 0 and then (2)extract the data lines that their 5th number(for example in first data line, it will be 0.000008) is in a specific range and then took the 6th number of that line (for our example it would be -4.938531), then take average of all of them( captured 6th values) and finally write them in a new file. For that I wrote this code that although does not include the first task, also it is not working. could anyone please help me with debugging or suggest me a new method?
A=0.0 #to be used for separating data as mentioned in the first task
B=0.0 #to be used for separating data as mentioned in the first task
with open('inputdatafile') as fin, open('outputfile','w') as fout:
for line in fin:
if line.startswith("#"):
continue
else:
col = line.split()
6th_val=float(col[-2])
2nd_val=int(col[1])
if (str(float(col[6])) > 0.000006 and str(float(col[6])) < 0.000009):
fout.write(" ".join(col) + "\n")
else:
del line
Varaible names in python can't start with a number, so change 6th_val to val_6 and 2nd_val to val_2.
str(float(col[6])) produces string, which can't be compared with float '0.000006', so change any str(float(...)) > xxx to float(...) > xxx .
You don't have to delete line, garabage collector does it for you, so remove 'del line'
A=0.000006
B=0.000009
S=0.0
C=0
with open('inputdatafile') as fin, open('outputfile','w') as fout:
for line in fin:
if line.startswith("#"):
continue
else:
col = line.split()
if col[1] == '1':
val_6=float(col[-2])
val_5=int(col[-3])
if val_5 > A and val_5 < B:
fout.write(" ".join(col) + "\n")
s += val_6
c += 1
fout.write("Average 6th: %f\n" % (S/C))
i am reading a file which is in the format below:
0.012281001 00:1c:c4:c2:1f:fe 1 30
0.012285001 00:1c:c4:c2:1f:fe 3 40
0.012288001 00:1c:c4:c2:1f:fe 2 50
0.012292001 00:1c:c4:c2:1f:fe 4 60
0.012295001 24:1c:c4:c2:2f:ce 5 70
I intend to make column 2 entities as keys and columns 3 and 4 as separate values. For each line I encounter, for that particular key, their respective values must add up (value 1 and value 2 should aggregate separately for that key). In the above example mentioned, I need to get the output like this:
'00:1c:c4:c2:1f:fe': 10 : 180, '24:1c:c4:c2:2f:ce': 5 : 70
The program i have written for simple 1 key 1 value is as below:
#!/usr/bin/python
import collections
result = collections.defaultdict(int)
clienthash = dict()
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = float(hashes[2])
result[ckey] += val1
print result
How can I extend this for 2 values and how can I print them as the output mentioned above. I am not getting any ideas. Please help! BTW i am using python2.6
You can store all of the values in a single dictionary, using a tuple as the stored value:
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = int(hashes[2])
val2 = int(hashes[3])
a,b = result[ckey]
result[ckey] = (a+val1, b+val2)
print result
My data set a list of people either working together or alone.
I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.
My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.
From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.
I know of a 'pretty' quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.
An example of the data, in a comma delimited fashion:
A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A
Hope that helps!
Brice.
F,D
B
F
F,X,C
C,F,D
Maybe something like this would get you started?
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r", newline="") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w", newline="") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
[edit: modified to work under Python 3.2]
The key modules are (1)csv, which makes reading and writing csv files much simpler; (2) collections, which provides an object called a Counter -- like a defaultdict(int), which you could use if your Python doesn't have Counter, it's a dictionary which automatically generates default values so you don't have to, and here the default count is 0; and (3) itertools, which has a combinations function to get all the pairs.
which produces
,A,B,C,D,E,F,X
A,1,2,1,1,0,0,1
B,2,1,3,1,2,1,0
C,1,3,0,1,2,2,1
D,1,1,1,0,0,3,1
E,0,2,2,0,0,0,0
F,0,1,2,3,0,1,1
X,1,0,1,1,0,1,0
You could use itertools.product to make building the array a little more compact, but since it's only a line or two I figured it was as simple to do it manually.
If I were to keep this project around for a while, I'd implement a database and then create the matrix you're talking about from a query against that database.
You have a Project table (let's say) with one record per project, an Actor table with one row per person, and a Participant table with a record per project for each actor that was in that project. (Each record would have an ID, a ProjectID, and an ActorID.)
From your example, you'd have 14 Project records, 7 Actor records (A through F, and X), and 31 Participant records.
Now, with this set up, each cell is a query against this database.
To reconstruct the matrix, first you'd add/update/remove the appropriate records in your database, and then rerun the query.
I guess that you don't have thousands of people working together in these projects. This implementation is pretty simple.
fp = open('projects.cvs')
# counts how many times each pair worked together
pairs = {}
# each element of `project` is a person
for project in (p[:-1].split(',') for p in fp):
project.sort()
# someone is alone here
if len(project) == 1:
continue
# iterate over each pair
for i in range(len(project)):
for j in range(i+1, len(project)):
pair = (project[i], project[j])
# increase `pairs` counter
pairs[pair] = pairs.get(pair, 0) + 1
from pprint import pprint
pprint(pairs)
It outputs:
{('A', 'B'): 1,
('B', 'C'): 2,
('B', 'D'): 1,
('B', 'E'): 1,
('B', 'F'): 2,
('C', 'E'): 1,
('C', 'F'): 1,
('D', 'F'): 1}
I suggest using Python Pandas for this. It enables a slick solutions for formatting your adjacency matrix, and it will make any statistical calculations much easier too. You can also directly extract the matrix of values into a NumPy array, for doing eigenvalue decompositions or other graph-theoretical procedures on the group clusters if needed later.
I assume that the example data you listed is saved into a file called projects_data.csv (it doesn't need to actually be a .csv file though). I also assume no blank lines between each observations, but this is all just file organization details.
Here's my code for this:
# File I/O part
import itertools, pandas, numpy as np
with open("projects_data.csv") as tmp:
lines = tmp.readlines()
lines = [line.split('\n')[0].split(',') for line in lines]
# Unique letters
s = set(list(itertools.chain(*lines)))
# Actual work.
df = pandas.DataFrame(
np.zeros((len(s),len(s))),
columns=sorted(list(s)),
index=sorted(list(s))
)
for line in lines:
if len(line) == 1:
df.ix[line[0],line[0]] += 1 # Single-person projects
elif len(line) > 1:
# Get all pairs in multi-person project.
tmp_pairs = list(itertools.combinations(line, 2))
# Append pair reversals to update (i,j) and (j,i) for each pair.
tmp_pairs = tmp_pairs + [pair[::-1] for pair in tmp_pairs]
for pair in tmp_pairs:
df.ix[pair[0], pair[1]] +=1
# Uncomment below if you don't want the list
# comprehension method for getting the reverals.
#df.ix[pair[1], pair[0]] +=1
# Final product
print df.to_string()
A B C D E F X
A 1 2 1 1 0 0 1
B 2 1 3 1 2 1 0
C 1 3 0 1 2 2 1
D 1 1 1 0 0 3 1
E 0 2 2 0 0 0 0
F 0 1 2 3 0 1 1
X 1 0 1 1 0 1 0
Now you can do a lot of stuff for free, like see the total number of project partners (repeats included) for each participant:
>>> df.sum()
A 6
B 10
C 10
D 7
E 4
F 8
X 4