I have this data in a CSV:
first, middle, last, id, fte
Alexander,Frank,Johnson,460700,1
Ashley,Jane,Smith,470000,.5
Ashley,Jane,Smith,470000,.25
Ashley,Jane,Smith,470000,.25
Steve,Robert,Brown,460001,1
I need to find the rows of people with the same ID numbers, and then combine the FTEs of those rows into the same line. I'll also need to add 0s for the rows that don't have duplicates. For example (using data above):
first, middle, last, id, fte1, fte2, fte3, fte4
Alexander,Frank,Johnson,460700,1,0,0,0
Ashley,Jane,Smith,470000,.5,.25,.25,0
Steve,Robert,Brown,460001,1,0,0,0
Basically, we're looking at the jobs people hold. Some people work one 40-hour per week job (1.0 FTE), some work two 20-hour per week jobs (.5 and .5 FTEs), some might work 4 10-hour per week jobs (.25, .25, .25, and .25 FTEs), and some may have other combinations. We only get one row of data per employee, so we need the FTEs on the same row.
This is what we have so far. Right now, our current code only works if they have two FTEs. If they have three or four, it just overwrites them with the last two (so if they have 3, it gives us 2 and 3. If they have 4, it gives us 3 and 4).
f = open('data.csv')
csv_f = csv.reader(f)
dataset = []
for row in csv_f:
dictionary = {}
dictionary["first"] = row[2]
dictionary["middle"] = row[3]
dictionary["last"] = row[4]
dictionary["id"] = row[10]
dictionary["fte"] = row[12]
dataset.append(dictionary)
def is_match(dict1, dict2):
return (dict1["id"] == dict2["id"])
def find_match(dictionary, dict_list):
for index in range(0, len(dict_list)):
if is_match(dictionary, dict_list[index]):
return index
return -1
def process_data(dataset):
result = []
for index in range(1, len(dataset)):
data_dict = dataset[index]
match_index = find_match(data_dict, result)
id = str(data_dict["id"])
if match_index == -1:
result.append(data_dict)
else:
(result[match_index])["fte2"] = data_dict["fte"]
return result
f.close()
for row in process_data(dataset):
print(row)
Any help would be extremely appreciated! Thanks!
I would say to use the pandas library to make it simple. You can use group by along with aggregation. The below is an example following the aggregation example provided here https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm.
import pandas as pd
import numpy as np
df = pd.read_csv('filename.csv')
grouped = df.groupby('id')
print grouped['fte'].agg(np.sum)
Related
I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.
The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.
# Initial call
init_df = init_df.assign(HighestHigh = lambda x:
highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))
# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
if date1.size == ticker.size:
temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
temp_df['DateIdentified'] = date1
temp_df['Ticker'] = ticker
else:
print("dates and tickers size mismatching")
sys.exit(1)
counter = itertools.count(0)
high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
return high_list
# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
print("trade %s" % count)
currDate = datetime.datetime.now().date()
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
hh = allpm_df.iloc[:,0].max()
hd = allpm_df.loc[(allpm_df['high']==hh),'date']
hh = round(hh,2)
h_list = [hh,hd]
return h_list
# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")
# The function to split the list elements in to different columns
def split_columns(orig_df,col):
split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col+"Mod", col+"Date"])
split_df[col+"Date"] = split_df[col+"Date"].apply(lambda x: x.squeeze())
orig_df = pd.concat([orig_df,split_df], axis=1)
orig_df = orig_df.drop(col,axis=1)
orig_df = orig_df.rename(columns={col+"Mod": col})
return orig_df
There are a couple of obvious solutions that would help reduce your runtime.
First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:
hh, hd = allpm_df[allpm_df['high'].idxmax()]
This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.
Edit
In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:
index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)
This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:
allp_df[ticker, (dateidentified: currDate)]
which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.
I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
#gold_cy 's answer has helped me and i made some changes to it to get the output i like:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
the issue is that when there are more instances of the same intent, i run into the error:
ValueError: too many values to unpack (expected 2)
and I do not know how to handle that for many more examples that i have in my dataset
Do you want this?
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")
I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
print(df_new)
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
etc....
Thanks for your help, I appreciate that.
You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:
df['alt'].add((df.index+1)*400)
I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)
EDIT: See end of my post for working code, obtained from zeekay here.
I have a CSV file with two columns (voltage and current). Because the voltage is recorded to many significant digits and the current only has 2, there are many identical current values as the value of the voltage changes. This isn't important to the programming but I'm just explaining how the data is physically obtained. I want to perform the following action:
For as long as the value of the second column (current) does not change, collect the values of the first column (voltage) into a list and average them. Then write a row into a new CSV file which is this averaged value of the voltage in the first column and the constant current value which did not change in the second column. In other words, if there are 20 rows for which the current did not change (say it is 6 uA), the 20 corresponding voltage values are averaged (say this average comes out to be 600 mV) and a row is generated in a new csv file which reads ('0.6','0.000006'). Then I want to continue iterating through the csv which is being read, repeating the above procedure for each set of fixed currents.
I've got the following code so far, but I'm not sure if I'm on the right track:
import sys, csv
with open('filetowriteto.csv','w') as avg:
loadeddata = open('filetoreadfrom.csv','r')
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1]
for row in readloaded:
newcurrent = row[1]
biaslist = []
if newcurrent == oldcurrent:
biaslist.append(row[0])
else :
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg,newcurrent])
newcurrent = row[1]
and then I'm not sure where to go.
Edit: It seems that zeekay is on the right track for what I want to do. I'm trying to implement his itertools.groupby() method but I'm currently getting a blank file generated. Here's my new code so far:
import sys, csv, itertools
with open('VI_avg(12).csv','w') as avg: # this is the file which gets written
loadeddata = open('VI(12).csv','r') # this is the file which is read
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1] # looks like this is no longer required
for current, row in itertools.groupby(readloaded, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
# write it out
writer.writerow(biasavg, current)
Suppose the CSV file being opened is something like this (shortened example):
0.595417,0.000065
0.595177,0.000065
0.594937,0.000065
0.594697,0.000065
0.594457,0.000065
0.594217,0.000065
0.593977,0.000065
0.593737,0.000065
0.593497,0.000064
0.593017,0.000064
0.592777,0.000064
0.592537,0.000064
0.592297,0.000064
0.587018,0.000064
0.586778,0.000064
0.586538,0.000063
0.586299,0.000063
0.586059,0.000063
0.585579,0.000063
0.585339,0.000063
0.585099,0.000063
0.584859,0.000063
0.584619,0.000063
0.584379,0.000063
0.584139,0.000063
0.583899,0.000063
0.583659,0.000063
Final update: Here's the working version, obtained from zeekay:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
You can use itertools.groupby to group results as you read through the csv, which would simplify things a lot. Given your updated example:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
Maybe you can try using pandas:
import pandas
voltage = [1.1, 1.2, 1.3, 2.1, 2.2, 2.3]
current = [1.0, 1.0, 1.1, 1.3, 1.2, 1.3]
df = pandas.DataFrame({'voltage': voltage, 'current': current})
result = df.groupby('current').mean()
# Output
voltage
current
1.0 1.15
1.1 1.30
1.2 2.20
1.3 2.20
result.to_csv('grouped_data.csv')
One way:
curDict = {}
for row in loaded row:
if row[1] not in curDict.keys(): # if not already there create key/value pair
curDict[str(row[1])] = [row[0]]
else: # already exists, add to key/value pair
curDict[str(row[1])].append(row[0])
#You'll end up with:
# {'0.6': [599, 600, 601...], ...}
# write the rows
for k,v in curDict.values():
avgValue = reduce(lambda a,b: a+b, v)/len(v) # calculate the avg of the voltages
writer.writerow([k,avgValue])
This version will do what you describe, but it will average all values with the same voltage, regardless of whether they are consecutive or not. Apologies if that's not what you want, but maybe it can help you along the way:
import csv
from collections import defaultdict
def f(acc, row):
acc[row[1]].append(float(row[0]))
return acc
with open('out.csv', 'w') as out:
writer = csv.writer(out)
data = open('in.csv', 'r')
r = csv.reader(data)
reduced = reduce(f, r, defaultdict(list))
for v, c in reduced.items():
writer.writerow([v, sum(c)/len(c)])
Yet another way using some very small test data (haven't included the csv stuff as you appear to have a handle on that):
#!/usr/bin/python3
test_data = [ # Only 3 currents in testdata:
(0.00030,5), # 5 : Only one entry, total 0.00030 - so should give 0.00030 as the average
(0.00012,6), # 6 : Two entries, total 0.00048 - so should give 0.00024 as the average
(0.00036,6),
(0.00001,7), # 7 : Four entries, total 0.00008 - so should give 0.00002 as the average
(0.00001,7),
(0.00001,7),
(0.00007,7)]
currents = dict()
for row in test_data:
if not row[1] in currents:
matching_currents = list((each[0] for each in test_data if each[1] == row[1]))
current_average = sum(matching_currents) / len(matching_currents)
currents[row[1]] = current_average
print("There were {0} unique currents found:\n".format(len(currents)))
for current,bias in currents.items():
print("Current: {0:2d} ( Average: {1:1.5f} )".format(current,bias))