I am looking for a faster way to compute some kind of function across multiple columns.
my dataframe looks like:
c = 12*1000
b = int(c/2)
d = int(b/2)
newdf = {'Class': ['c1']*c+['c2']*c+['c3']*c,
'Section': ['A']*b+['B']*b+['C']*b+['D']*b+['E']*b+['F']*b,
'Time': [1,2,3,4,5,6]*d+[3,1,3,4,5,7]*d}
test = pd.DataFrame(newdf)
test['f_x'] = test['Time']**2/5
test['f_x_2'] = test['Time']**2/5+test['f_x']
#working with 1 column
test['section_mean'] = test.groupby(['Class','Section'])['f_x'].transform(lambda x: x.mean())
test['two_col_sum'] = test[['Time','f_x']].apply(lambda x: x.Time+x.f_x,axis=1)
cols = ['f_x','f_x_2']
and I know how to calculate for example a value for a series of columns for groups:
test['section_mean'] = test.groupby(['Class','Section'])['f_x'].transform(lambda x: x.mean())
Or eventually do simple operations between more columns:
test['two_col_sum'] = test[['Time','f_x']].apply(lambda x: x.Time+x.f_x,axis=1)
However, what I'm trying to do is some kind of computation over the full column of a grouped instance:
%%time
slopes_df = pd.DataFrame()
grouped = test.groupby(['Class','Section'])
for name, group in grouped:
nd=[]
for col in cols:
ntest = group[['Time',col]]
x = ntest.Time
y = ntest[col]
f=np.polyfit(x,y, deg=1).round(2)
data = [name[0],name[1],col,f[0],f[1]]
nd.append(data)
slopes_df=pd.concat([slopes_df,pd.DataFrame(nd)])
slopes_df.columns=['Class','Section','col','slope','intercept']
slopes_df_p = pd.pivot_table(data=slopes_df,index=['Class','Section'], columns=['col'], values=['slope','intercept']).reset_index()
slopes_df_p.columns = pd.Index(e[0] if e[0] in ['Class','Section'] else e[0]+'_'+e[1] for e in slopes_df_p.columns)
fdf = pd.merge(test, slopes_df_p, on=['Class','Section'])
I tried the solution proposed in this way:
%%time
for col in cols:
df1 = (test.groupby(['Class','Section'])
.apply(lambda x: np.polyfit(x['Time'],x[col], deg=1).round(2)[0])
.rename('slope_'+str(col)))
df2 = (test.groupby(['Class','Section'])
.apply(lambda x: np.polyfit(x['Time'],x[col], deg=1).round(2)[1])
.rename('intercept_'+str(col)))
df1['col']=col
df2['col']=col
test = pd.merge(test,df1, on=['Class','Section'])
test = pd.merge(test,df2, on=['Class','Section'])
but it seems slower, on my pc first loop takes 150ms and second code 300 ms
Andrea
Your loop solution not working by data of groups, so I think you need GroupBy.apply:
def f(x):
for col in cols:
x[f'slope_{col}'], x[f'intercept_{col}'] = np.polyfit(x['Time'],x[col], deg=1).round(2)
return x
df1 = test.groupby(['Class','Section']).apply(f)
Related
I have a dataframe and need to drop all the columns that contain more than 55% of repeated/duplicate values in each column.
Would anyone be able to assist me on how to do this?
Please try this:
Let df1 be your dataframe:
drop_columns = []
drop_threshold = 0.55 #define the percentage criterion for drop
for cols in df1.columns:
df_count = df1[cols].value_counts().reset_index()
df_count['drop_percentage'] = df_count[cols]/df1.shape[0]
df_count['drop_criterion'] = df_count['drop_percentage'] > drop_threshold
if True in df_count.drop_criterion.values:
drop_columns.append(cols)
df1 = df1.drop(columns=drop_columns,axis=1)
Let's use pd.Series.duplciated:
cols_to_keep=df.columns[df.apply(pd.Series.duplicated).mean() <= .55]
df[cols_to_keep]
If you're referring to columns in which the most common value is repeated in more than 55% or rows, here's a solution
from collections import Counter
# assuming some DataFrame named df
bool_idx = df.apply(lambda x: max(Counter(x).values()) < len(x) * .55, axis=0)
df = df.loc[:, bool_idx]
if you're talking about non-unique values, this works:
bool_idx = df.apply(
lambda x: sum(
y for y in Counter(x).values() if y > 1
) < .55 * len(x),
axis=0
)
df = df.loc[:, bool_idx]
So I have some dataframes (df0, df1, df2) with various numbers of rows. I wanted to split any dataframe which has a number of rows more than 30 to several dataframes consists of 30 rows only. So for example my dataframe df0 has 156 rows, then I would separated this dataframe into several dataframes like this:
if len(df0) > 30:
df0_A = df0[0:30]
df0_B = df0[31:60]
df0_C = df0[61:90]
df0_D = df0[91:120]
df0_E = df0[121:150]
df0_F = df0[151:180]
else:
df0= df0
The problem with this code is that I need to repeat the code exhaustively many times for the next code like this:
df0= pd.DataFrame(df0)
df0_A = pd.DataFrame(df0_A)
df0_B = pd.DataFrame(df0_B)
df0_C = pd.DataFrame(df0_C)
df0_D = pd.DataFrame(df0_D)
df0_E = pd.DataFrame(df0_E)
df0_F = pd.DataFrame(df0_F)
df0= df0.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_A = df0_A.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_B = df0_B.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_C = df0_C.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_D = df0_D.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_E = df0_E.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_F = idUGS0_F.to_string(header=False,
index=False,
index_names=False).split('\n')
df0= [','.join(ele.split()) for ele in df0]
df0_A = [','.join(ele.split()) for ele in df0_A]
df0_B = [','.join(ele.split()) for ele in df0_B]
df0_C = [','.join(ele.split()) for ele in df0_C]
df0_D = [','.join(ele.split()) for ele in df0_D]
df0_E = [','.join(ele.split()) for ele in df0_E]
df0_F = [','.join(ele.split()) for ele in df0_F]
now imagine I have ten dataframes that I need to split each into five dataframes. Then I need to make the same code for 50 times!
I'm quite new to Python. So, can anyone help me with how to simplify this code, maybe with simple for loop? thanks
You could probably automate it a little bit more, but this should be enough!
import copy
import numpy as np
df0 = pd.DataFrame({'Test' : np.random.randint(100000,999999,size=180)})
len(df0)
if len(df0) > 30:
df_dict = {}
x=0
y=30
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df0_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df0[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 30
y += 30
df_name
else:
df0
for df in df_dict:
print(df)
print('--------------------------------------------------------------------')
print(f'length: {len(df_dict[df])}')
print('--------------------------------------------------------------------')
print(df_dict[df])
print('--------------------------------------------------------------------')
Assuming you have one column for identification,
def split_df(idf, idcol, nsize):
g = idf.groupby(idcol)
# Compute the size for each value of identification column
size = g.size()
dflist = []
for _id,_idcount in size.iteritems():
if _idcount > nsize:
# print(_id, ' = ', _idcount)
idx = idf[ idf[idcol].eq(_id) ].index
# print(idx)
# lets split the array into equal parts of `nsize`
# e.g. [1,2,3,4,5] with nsize = 2 will split into ([1,2], [3,4], [5])
ilist = np.array_split(idx, round(idx.shape[0]/nsize + 0.5))
dflist += ilist
return [idf.loc[idx].copy(deep=True) for idx in dflist]
df = pd.DataFrame(data=np.hstack((np.random.choice(np.arange(1,3), 10).reshape(10, -1), np.random.rand(10,3))), columns=['id', 'a', 'b', 'c'])
df = df.astype({'id': np.int64})
split(df, 'id', 2)
This is a great problem, you can use this (data is the DataFrame here):
# Create subsets of size 30 for the DataFrame
subsets = list(range(0, len(data), 30))
# Create start cutoffs for subsets of the DataFrame
start_cutoff = subsets
# Create end cutoffs for subsets of the DataFrame
end_cutoff = subsets[1:] + [len(data)]
# Zip the start cutoffs and end cutoffs into a List of Cutoffs
cutoffs = list(zip(start_cutoff, end_cutoff))
# List containing Splitted Dataframes
list_dfs = [data.iloc[cutoff[0]: cutoff[-1]] for cutoff in cutoffs]
# convert list to string DFs
string_dfs = [df.to_string(header=False, index=False, index_names=False).split('\n') for df in list_dfs]
final_df_list = [','.join(ele.split()) for string_df in string_dfs for ele in string_df]
Now you can access the DataFrames by:
print(final_df_list[0])
print(final_df_list[1])
data = {"index":{"0":1692,"1":1771,"2":1007,"3":2915,"4":1416},
"item_number":{"0":"123","1":"123","2":"124","3":"124","4":"125"},
"brand":{"0":"brand1","1":"brand1","2":"brand2","3":"brand2","4":"brand3"},
"price":{"0":20.00,"1":20.00,"2":25.00,"3":25.00,"4":30.00},
"comp_id":{"0":1,"1":2,"2":1,"3":3,"4":2},
"comp":{"0":"comp1","1":"comp2","2":"comp1","3":"comp3","4":"comp2"},
"comp_price":{"0":21.00,"1":20.99,"2":16.00,"3":15.99,"4":29.99}}
df1 = pd.DataFrame(data=data)
g = df1.groupby('brand')
v = df1[df1['price']>df1['comp_price']].groupby('brand')
#number of skus within each brand
brand_sku_count = g.apply(lambda x: len(x['item_number'].unique()))
#number of skus violated within each brand
brand_vio_count = v.apply(lambda x: len(x['item_number'].unique()))
#number of sellers within each brand
total_sellers = g.apply(lambda x: len(x['comp_id'].unique()))
#number of violators within each brand
total_violators = v.apply(lambda x: len(x['comp_id'].unique()))
brand_report = pd.concat([brand_sku_count, brand_vio_count,
total_sellers, total_violators], axis=1)
brand_report.columns = ['sku_count','vio_count','total_comps','total_vios']
The above is my old code, I recently discovered transform and the agg function. I'm trying to learn how to cut down on doing these functions one at a time and then piecing them all back together using concat. I feel there's an opportunity to greatly reduce the amount of lines of code here.
I've read the questions where you can do the following:
df1.groupby('brand')['item_number'].agg(['sum','count'])
I've tried doing:
f1 = lambda x: len(x['item_number'].unique())
f2 = lambda x: len(x['comp_id'].unique())
f = {'item_number':f1, 'comp_id':f2}
df1.groupby('brand').agg(f)
This returns:
KeyError: 'item_number'
So I tried:
f1 = lambda x: len(x.get_group('item_number').unique())
f2 = lambda x: len(x.get_group('comp_id').unique())
f = {'item_number':f1, 'comp_id':f2}
df1.groupby('brand').agg(f)
This returned an error saying a Series object does not have get_group
Try this
f1 = lambda x: len(x.unique())
f = {'item_number':f1, 'comp_id':f1}
df1.groupby('brand').agg(f)
Out[881]:
item_number comp_id
brand
brand1 1 2
brand2 1 2
brand3 1 1
Pivot tables will also work:
lst= ['brand', 'comp_id','item_number']
df1Pivot = df1[lst].pivot_table(index = 'brand',aggfunc = lambda x: len(x.unique()))
df1Pivot.rename(columns ={'item_number':'sku_count','comp_id':'total_comps'}, inplace = True)
df2Pivot = df1[df1['price']>df1['comp_price']][lst].pivot_table(index = 'brand',aggfunc = lambda x: len(x.unique()))
df2Pivot.rename(columns ={'item_number':'vio_count','comp_id':'total_vios'}, inplace = True)
df3 = df1Pivot.join(df2Pivot)
I have data in a pandas DataFrame that requires considerable clean up with functions applied to the DataFrame's 'ID' groups. How does one apply any arbitrary function to manipulate Pandas DataFrame groups? A simplified example of the DataFrame is below:
import pandas as pd
import numpy as np
waypoint_time_string = ['0.5&3.0&6.0' for x in range(10)]
moving_string = ['0 0 0&0 0.1 0&1 1 1.2' for x in range(10)]
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,2,2], 'time':[1,2,3,4,5,1,2,3,4,5],
'X':[0,0,0,0,0,1,1,1,1,1],'Y':[0,0,0,0,0,1,1,1,1,1],'Z':[0,0,0,0,0,1,1,1,1,1],
'waypoint_times':waypoint_time_string,
'moving':moving_string})
I would like to apply the function set_group_positions (defined below) to each 'ID' group of df. I have only been successful looping through the DataFrame. It seems that there must be a more 'Pandas.groupby' way of doing this. Here is an example of my implementation that I'm looking to replace:
sub_frames = []
unique_IDs = df['ID'].unique()
for unique_ID in unique_IDs:
working_df = df.loc[df['ID']==unique_ID]
working_df = set_group_positions(working_df)
sub_frames.append(working_df)
final_df = pd.concat(sub_frames)
And to complete a working example, here are additional helper functions:
def set_x_vel(row):
return(row['X'] + row['x_movement'])
def set_y_vel(row):
return(row['Y'] + row['y_movement'])
def set_z_vel(row):
return(row['Z'] + row['z_movement'])
output_time_list = df['time'].unique().tolist()
#main function to apply to each ID group in the data frame:
def set_group_positions(df): #pass the combined df here
working_df = df
times_string = working_df['waypoint_times'].iloc[0]
times_list = times_string.split('&')
times_list = [float(x) for x in times_list]
points_string = working_df['moving']
points_string = points_string.iloc[0]
points_list = points_string.split('&')
points_x = []
points_y = []
points_z = []
for point in points_list:
point_list = point.split(' ')
points_x.append(point_list[0])
points_y.append(point_list[1])
points_z.append(point_list[2])
#get corresponding positions for HPAC times,
#since there could be mismatches
points_x = np.cumsum([float(x) for x in points_x])
points_y = np.cumsum([float(x) for x in points_x])
points_z = np.cumsum([float(x) for x in points_x])
x_interp = np.interp(output_time_list,times_list,points_x).tolist()
y_interp = np.interp(output_time_list,times_list,points_y).tolist()
z_interp = np.interp(output_time_list,times_list,points_z).tolist()
working_df.loc[:,('x_movement')] = x_interp
working_df.loc[:,('y_movement')] = y_interp
working_df.loc[:,('z_movement')] = z_interp
working_df.loc[:,'x_pos'] = working_df.apply(set_x_vel, axis = 1)
working_df.loc[:,'y_pos'] = working_df.apply(set_y_vel, axis = 1)
working_df.loc[:,'z_pos'] = working_df.apply(set_z_vel, axis = 1)
return(working_df)
While my current implementation works, on my real data set, it takes about 20 minutes for me to run, where a simple groupby.apply lambda call on my DataFrame takes only seconds to a minute.
Instead of looping, you can use apply with groupby and a function call:
df = df.groupby('ID').apply(set_group_positions)
I want to add make a pandas dataframe with two columns : read_id and score
I am using the following code :
reads_array = []
for x in Bio.SeqIO.parse("inp.fasta","fasta"):
reads_array.append(x)
columns = ["read_id","score"]
df = pd.DataFrame(columns = columns)
df = df.fillna(0)
for x in reads_array:
alignments=pairwise2.align.globalms("ACTTGAT",str(x.seq),2,-1,-.5,-.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2),reverse = True)
read_id = x.name
score = sorted_alignments[0][2]
df['read_id'] = read_id
df['score'] = score
But this does not work. Can you suggest a way of generating the dataframe df
At the top make sure you have
import numpy as np
Then replace the code you shared with
reads_array = []
for x in Bio.SeqIO.parse("inp.fastq", "fastq"):
reads_array.append(x)
df = pd.DataFrame(np.zeros((len(reads_array), 2)), columns=["read_id", "score"])
for index, x in enumerate(reads_array):
alignments = pairwise2.align.globalms("ACTTGAT", str(x.seq), 2, -1, -.5, -.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2), reverse=True)
read_id = x.name
score = sorted_alignments[0][2]
df.loc[index, 'read_id'] = read_id
df.loc[index, 'score'] = score
The main problem with your original code was two things:
1) Your dataframe had 0 rows
2) df['column_name'] refers to the entire column, not a single cell, so when you execute df['column_name'] = value, all cells in that column get set to that value
df['read_id'] and df['score'] is Series. So if you want to iterate reads_array and calculate some value, then assign it to df's columns, try following:
for i, x in enumerate(reads_array):
...
df.ix[i]['read_id'] = read_id
df.ix[i]['score'] = score