I have a pivot table that I have created (pivotTable) using:
pivotTable= dayData.pivot_table(index=['sector'], aggfunc='count')
which has produced the following pivot table:
sector id
broad_sector
Communications 2 2
Utilities 3 3
Media 3 3
Could someone just let me know if there is a way to loop through the pivot table assigning the index value and sector total to respective variables sectorName and sectorCount
I have tried:
i=0
while i <= lenPivotTable:
sectorName = sectorPivot.index.get_level_values(0)
sectorNumber = sectorPivot.index.get_level_values(1)
i=i+1
to return for the first loop iteration:
sectorName = 'Communications'
sectorCount = 2
for the second loop iteration:
sectorName = 'Utilities'
sectorCount = 3
for the third loop iteration:
sectorName = 'Media'
sectorCount = 3
But can't get it to work.
This snippet will get you the values as asked.
for sector_name, sector_count, _ in pivotTable.to_records():
print(sector_name, sector_count)
well, i don't understand why do you need this (because looping through DF is very slow), but you can do it this way:
In [403]: for idx, row in pivotTable.iterrows():
.....: sectorName = idx
.....: sectorCount = row['sector']
.....: print(sectorName, sectorCount)
.....:
Communications 2
Utilities 3
Media 3
Related
I need to sum the value contained in a column (column 9) if a condition is satisfied: the condition is that it needs to be a pair of individuals (column 1 and column 3), whether they are repeated or not.
My input file is made this way:
Sindhi_HGDP00171 0 Tunisian_39T 0 1 120437718 147097266 3.02 7.111
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 3.468
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 4.468
IBS_HG01768 2 Moroccan_MRA46 1 1 34186193 36027711 30.46 3.108
IBS_HG01710 1 Sardinian_HGDP01065 2 1 246117191 249120684 7.53 3.258
IBS_HG01768 2 Moroccan_MRA46 2 1 34186193 37320967 43.4 4.418
Therefore for instance, I would need the value of column 9 for each pair to be summed. Some of these pairs appear multiple time, in this case I would need the sum of value in column 9 betweem IBS_HG01768 and Moroccan_MRA46, and the sum of the value between Sindhi_HGDP00183 and Sindhi_HGDP00206. Some of these pairs are not repeated but I still need them to appear in the final results.
What I manage so far is to sum by group (population), so I sum column 9 value by pair of population like Sindhi and Tunisian for instance. I need to do the sum by pairs of Individuals.
My script is this:
import pandas as pd
import numpy as np
import itertools
# defines columns names
cols = ['ID1', 'HAP1', 'ID2', 'HAP2', 'CHR', 'STARTPOS', 'ENDPOS', 'LOD', 'IBDLENGTH']
# loads data (the file needs to be in the same folder where the script is)
data = pd.read_csv("./Roma_Ref_All_sorted.txt", sep = '\t', names = cols)
# removes the sample ID for ID1/ID2 columns and places it in two dedicated columns
data[['ID1', 'ID1_samples']] = data['ID1'].str.split('_', expand = True)
data[['ID2', 'ID2_samples']] = data['ID2'].str.split('_', expand = True)
# gets the groups list from both ID columns...
groups_id1 = list(data.ID1.unique())
groups_id2 = list(data.ID2.unique())
groups = list(set(groups_id1 + groups_id2))
# ... and all the possible pairs
group_pairs = [i for i in itertools.combinations(groups, 2)]
# subsets the pairs having Roma
group_pairs_roma = [x for x in group_pairs if ('Roma' in x[0] and x[0] != 'Romanian') or
('Roma' in x[1] and x[1] != 'Romanian')]
# preapres output df
result = pd.DataFrame(columns = ['ID1', 'ID2', 'IBD_sum'])
# loops all the possible pairs and computes the sum of IBD length
for idx, group_pair in enumerate(group_pairs_roma):
id1 = group_pair[0]
id2 = group_pair[1]
ibd_sum = round(data.loc[((data['ID1'] == id1) & (data['ID2'] == id2)) |
((data['ID1'] == id2) & (data['ID2'] == id1)), 'IBDLENGTH'].sum(),3)
result.loc [idx, ['ID1', 'ID2', 'IBD_sum']] = [id1, id2, ibd_sum]
# saves results
result.to_csv("./groups_pairs_sum_IBD.txt", sep = '\t', index = False)
My current output is something like this:
ID1 ID2 IBD_sum
Sindhi IBS 3.275
Sindhi Moroccan 74.201
Sindhi Sindhi 119.359
While I need something like:
ID1 ID2 IBD_sum
Sindhi_individual1 Moroccan_individual1 3.275
Sindhi_individual2 Moroccan_individual2 5.275
Sindhi_individual3 IBS_individual1 4.275
I have tried by substituting one line in my code, by writing
groups_id1 = list(data.ID1_samples.unique())
groups_id2 = list(data.ID2_samples.unique())
and later
ibd_sum = round(data.loc[((data['ID1_samples'] == id1) & (data['ID2_samples'] == id2)) |
((data['ID1_samples'] == id2) & (data['ID2_samples'] == id1)), 'IBDLENGTH'].sum(),3)
Which in theory should work because I set the individuals as pairs instead of populations as pairs, but the output was empty. What could I do to edit the code for what I need?
I have solved the problem on my own but using R language.
This is the code:
ibd <- read.delim("input.txt", sep='\t')
ibd_sum_indv <- ibd %>%
group_by(ID1, ID2) %>%
summarise(SIBD = sum(IBDLENGTH),
NIBD = n()) %>%
ungroup()
I have 2 columns in a dataframe, one named "day_test" and one named "Temp Column". Some of my values in Temp Column are negative, and I want them to be 1 or 2. I've made a for loop with 2 if statements:
for (i,j) in zip(df['day_test'].astype(int), df['Temp Column'].astype(int)):
if i == 2 and j < 0:
j = 2
if i == 1 and j < 0:
j = 1
I tried printing j so I know the loops are working properly, but the values that I want to change in the dataframe are staying negative.
Thanks
Your code doesn't change the values inside the dataframe, it only changes the j value temporarily.
One way to do it is this:
df['day_test'] = df['day_test'].astype(int)
df['Temp Column'] = df['Temp Column'].astype(int)
df.loc[(df['day_test']==1) & (df['Temp Column']<0),'Temp Column'] = 1
df.loc[(df['day_test']==2) & (df['Temp Column']<0),'Temp Column'] = 2
I want to build a scheduling app in python using pandas.
The following DataFrame is initialised where 0 denotes if a person is busy and 1 if a person is available.
import pandas as pd
df = pd.DataFrame({'01.01.': [1,1,0], '02.01.': [0,1,1], '03.01.': [1,0,1]}, index=['Person A', 'Person B', 'Person C'])
>>> df
01.01. 02.01. 03.01.
Person A 1 0 1
Person B 1 1 0
Person C 0 1 1
I now want to randomly schedule n number of people per day if they are available. In other words, for every day, if people are available (1), randomly set n number of people to scheduled (2).
I tried something as follows:
# Required number of people across time / columns
required_number = [0, 1, 2]
# Iterate through time / columns
for col in range(len(df.columns)):
# Current number of scheduled people
current_number = (df.iloc[:, [col]].values==2).sum()
# Iterate through indices / rows / people
for ind in range(len(df.index)):
# Check if they are available (1) and
# if the required number of people has not been met yet
if (df.iloc[ind, col]==1 and
current_number<required_number[col]):
# Change "free" / 1 person to "scheduled" / 2
df.iloc[ind, col] = 2
# Increment scheduled people by one
current_number += 1
>>> df
01.01. 02.01. 03.01.
Person A 1 0 2
Person B 1 2 0
Person C 0 1 2
This works as intended but – because I'm simply looping, I have no way of adding randomness (ie. that Person A / B / C) are randomly selected so long as they are available. Is there a way of directly doing so in pandas?
Thanks. BBQuercus
You can randomly choose proper indices in a series and then change values corresponding to the chosen indices:
for i in range(len(df.columns)):
if sum(df.iloc[:,i] == 1) >= required_number[i]:
column = df.iloc[:,i].reset_index(drop=True)
#We are going to store indices in a list
a = [j for j in column.index if column[j] == 1]
random_indexes = np.random.choice(a, required_number[i], replace = False)
df.iloc[:,i] = [column[j] if j not in random_indexes else 2 for j in column.index]
Now df is the wanted result.
I'm having a hard time figuring out how to create a data frame within a for loop.
df = pd.DataFrame()
for sym in sorted(snapshot):
for lp in sorted(snapshot[sym]):
df['trader'] = lp
df['bid'] = snapshot[sym][lp][":b"]["LUC"]["price"] if ":b" in snapshot[sym][lp] else "0"
df['ask'] = snapshot[sym][lp][":a"]["LUC"]["price"] if ":a" in snapshot[sym][lp] else "0"
print df
print df['trader']
Printing 'df' results in Columns: [trader, bid, ask] Index: []
Printing 'df['trader'] results in Series([], Name: bid, dtype: object)
If I change the df[column headings] to assignments, everything prints fine.
I'm trying to create a df that look like this:
trader bid ask
0 MM2 1.25 1.26
1 MM5 1.23 1.27
2 MM3 1.25 1.28
....
Thanks for all the help
It's hard to understand from your question what's going on and what data do you have. Hovewer from your code you overwriting your columns in each step of for loop. You could add loc with indices to avoid that:
df = pd.DataFrame()
sym_len = len(snapshot[sym])
for i, sym in enumerate(sorted(snapshot)):
for j, lp in enumerate(sorted(snapshot[sym])):
idx = i*sym_len + j
df.loc[idx, 'trader'] = lp
df.loc[idx, 'bid'] = snapshot[sym][lp][":b"]["LUC"]["price"] if ":b" in snapshot[sym][lp] else "0"
df.loc[idx, 'ask'] = snapshot[sym][lp][":a"]["LUC"]["price"] if ":a" in snapshot[sym][lp] else "0"
i have a hw assignment i just finished up but it looks pretty horrendous knowing that theres a much simpler and efficient way to get the correct output but i just cant seem to figure it out.
Heres the objective of the assignment.
Write a program that stores the following values in a 2D list (these will be hardcoded):
2.42 11.42 13.86 72.32
56.59 88.52 4.33 87.70
73.72 50.50 7.97 84.47
The program should determine the maximum and average of each column
Output looks like
2.42 11.42 13.86 72.32
56.59 88.52 4.33 87.70
73.72 50.50 7.97 84.47
============================
73.72 88.52 13.86 87.70 column max
44.24 50.15 8.72 81.50 column average
The printing of the 2d list was done below, my problem is calculating the max, and averages.
data = [ [ 2.42, 11.42, 13.86, 72.32],
[ 56.59, 88.52, 4.33, 87.70],
[ 73.72, 50.50, 7.97, 84.47] ]
emptylist = []
r = 0
while r < 3:
c = 0
while c < 4 :
print "%5.2f" % data[r][c] ,
c = c + 1
r = r + 1
print
print "=" * 25
This prints the top half but the code i wrote to calculate the max and average is bad. for max i basically comapred all indexes in columns to each other with if, elif, statements and for the average i added the each column indency together and averaged, then printed. IS there anyway to calculate the bottom stuff with some sort of loop. Maybe something like the following
for numbers in data:
r = 0 #row index
c = 0 #column index
emptylist= []
while c < 4 :
while r < 3 :
sum = data[r][c]
totalsum = totalsum + sum
avg = totalsum / float(rows)
emptylist.append(avg) #not sure if this would work? here im just trying to
r = r + 1 #dump averages into an emptylist to print the values
c = c + 1 #in it later?
or something like that where im not manually adding each index number to each column and row. The max one i have no clue how to do in a loop . also NO LIST METHODS can be used. only append and len() can be used. Any help?
Here is what you're looking for:
num_rows = len(data)
num_cols = len(data[0])
max_values = [0]*num_cols # Assuming the numbers in the array are all positive
avg_values = [0]*num_cols
for row_data in data:
for col_idx, col_data in enumerate(row):
max_values[col_idx] = max(max_values[col_idx],col_data) # Max of two values
avg_values[col_idx] += col_data
for i in range(num_cols):
avg_values[i] /= num_rows
Then the max_values will contain the maximum for each column, while avg_values will contain the average for each column. Then you can print it like usual:
for num in max_values:
print num,
print
for num in avg_values:
print num
or simply (if allowed):
print ' '.join(max_values)
print ' '.join(avg_values)
I would suggest making a two new lists, each of the same size of each of your rows, and keeping a running sum in one, and a running max in the second one:
maxes = [0] * 4 # equivalent to [0, 0, 0, 0]
avgs = [0] * 4
for row in data: # this gives one row at a time
for c in range(4): # equivalent to for c in [0,1,2,3]:
#first, check if the max is big enough:
if row[c] > maxes[c]:
maxes[c] = row[c]
# next, add that value to the sum:
avgs[c] += row[c]/4.
You can print them like so:
for m in maxes:
print "%5.2f" % m,
for s in sums:
print "%5.2f" % s,
If you are allowed to use the enumerate function, this can be done a little more nicely:
for i, val in enumerate(row):
print i, val
0 2.42
1 11.42
2 13.86
3 72.32
So it gives us the values and the index, so we can use it like this:
maxes = [0] * 4
sums = [0] * 4
for row in data:
for c, val in enumerate(row):
#first, check if the max is big enough:
if val > maxes[c]:
maxes[c] = val
# next, add that value to the sum:
sums[c] += val