simplify splitting a dataframe to several dataframes - python

So I have some dataframes (df0, df1, df2) with various numbers of rows. I wanted to split any dataframe which has a number of rows more than 30 to several dataframes consists of 30 rows only. So for example my dataframe df0 has 156 rows, then I would separated this dataframe into several dataframes like this:
if len(df0) > 30:
df0_A = df0[0:30]
df0_B = df0[31:60]
df0_C = df0[61:90]
df0_D = df0[91:120]
df0_E = df0[121:150]
df0_F = df0[151:180]
else:
df0= df0
The problem with this code is that I need to repeat the code exhaustively many times for the next code like this:
df0= pd.DataFrame(df0)
df0_A = pd.DataFrame(df0_A)
df0_B = pd.DataFrame(df0_B)
df0_C = pd.DataFrame(df0_C)
df0_D = pd.DataFrame(df0_D)
df0_E = pd.DataFrame(df0_E)
df0_F = pd.DataFrame(df0_F)
df0= df0.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_A = df0_A.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_B = df0_B.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_C = df0_C.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_D = df0_D.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_E = df0_E.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_F = idUGS0_F.to_string(header=False,
index=False,
index_names=False).split('\n')
df0= [','.join(ele.split()) for ele in df0]
df0_A = [','.join(ele.split()) for ele in df0_A]
df0_B = [','.join(ele.split()) for ele in df0_B]
df0_C = [','.join(ele.split()) for ele in df0_C]
df0_D = [','.join(ele.split()) for ele in df0_D]
df0_E = [','.join(ele.split()) for ele in df0_E]
df0_F = [','.join(ele.split()) for ele in df0_F]
now imagine I have ten dataframes that I need to split each into five dataframes. Then I need to make the same code for 50 times!
I'm quite new to Python. So, can anyone help me with how to simplify this code, maybe with simple for loop? thanks

You could probably automate it a little bit more, but this should be enough!
import copy
import numpy as np
df0 = pd.DataFrame({'Test' : np.random.randint(100000,999999,size=180)})
len(df0)
if len(df0) > 30:
df_dict = {}
x=0
y=30
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df0_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df0[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 30
y += 30
df_name
else:
df0
for df in df_dict:
print(df)
print('--------------------------------------------------------------------')
print(f'length: {len(df_dict[df])}')
print('--------------------------------------------------------------------')
print(df_dict[df])
print('--------------------------------------------------------------------')

Assuming you have one column for identification,
def split_df(idf, idcol, nsize):
g = idf.groupby(idcol)
# Compute the size for each value of identification column
size = g.size()
dflist = []
for _id,_idcount in size.iteritems():
if _idcount > nsize:
# print(_id, ' = ', _idcount)
idx = idf[ idf[idcol].eq(_id) ].index
# print(idx)
# lets split the array into equal parts of `nsize`
# e.g. [1,2,3,4,5] with nsize = 2 will split into ([1,2], [3,4], [5])
ilist = np.array_split(idx, round(idx.shape[0]/nsize + 0.5))
dflist += ilist
return [idf.loc[idx].copy(deep=True) for idx in dflist]
df = pd.DataFrame(data=np.hstack((np.random.choice(np.arange(1,3), 10).reshape(10, -1), np.random.rand(10,3))), columns=['id', 'a', 'b', 'c'])
df = df.astype({'id': np.int64})
split(df, 'id', 2)

This is a great problem, you can use this (data is the DataFrame here):
# Create subsets of size 30 for the DataFrame
subsets = list(range(0, len(data), 30))
# Create start cutoffs for subsets of the DataFrame
start_cutoff = subsets
# Create end cutoffs for subsets of the DataFrame
end_cutoff = subsets[1:] + [len(data)]
# Zip the start cutoffs and end cutoffs into a List of Cutoffs
cutoffs = list(zip(start_cutoff, end_cutoff))
# List containing Splitted Dataframes
list_dfs = [data.iloc[cutoff[0]: cutoff[-1]] for cutoff in cutoffs]
# convert list to string DFs
string_dfs = [df.to_string(header=False, index=False, index_names=False).split('\n') for df in list_dfs]
final_df_list = [','.join(ele.split()) for string_df in string_dfs for ele in string_df]
Now you can access the DataFrames by:
print(final_df_list[0])
print(final_df_list[1])

Related

Best Practice for Adding Lots of Columns to Pandas DataFrame

I am trying to add many columns to a pandas dataframe as follows:
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out_name = 'sum_' + col_name_base
df[out_name] = 0.0
for i in range(1, 6):
col_name = col_name_base + str(i)
if col_name in df:
df[out_name] += df[col_name]
else:
logger.error('Col %s not in df' % col_name)
for col in sum_cols_list:
create_sum_rounds(df, col)
Where sum_cols_list is a list of ~200 base column names (e.g. "foo"), and df is a pandas dataframe which includes the base columns extended with 1 through 5 (e.g. "foo_1", "foo_2", ..., "foo_5").
I'm getting a performance warning when I run this snippet:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
I believe this is because creating a new column is actually calling an insert operation behind the scenes. What's the right way to use pd.concat in this case?
You can use your same approach, but instead of operating directly on the DataFrame, you'll need to store each output as its own pd.Series. Then when all of the computations are done, use pd.concat to glue everything back to your original DataFrame.
(untested, but should work)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out = pd.Series(0, name='sum_' + col_name_base, index=df.index)
for i in range(1, 6):
col_name = col_name_base + str(i)
if col_name in df:
out += df[col_name]
else:
logger.error('Col %s not in df' % col_name)
return out
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)
Additionally, you can simplify your existing code (if you're willing to forego your logging)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] + ...
'''
return df.filter(regex=f'{col_name_base}_\d+').sum(axis=1)
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)
Simplify :-)
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out_name = 'sum_' + col_name_base
df[out_name] = df.loc[:,[x for x in df.columns if x.startswith(col_name_base)]].sum(axis=1)
Would this get you the results you are expecting?
df = pd.DataFrame({
'Foo_1' : [1, 2, 3, 4, 5],
'Foo_2' : [10, 20, 30, 40, 50],
'Something' : ['A', 'B', 'C', 'D', 'E']
})
df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)

Is there a pandas way to compute a function between 2 columns?

I am looking for a faster way to compute some kind of function across multiple columns.
my dataframe looks like:
c = 12*1000
b = int(c/2)
d = int(b/2)
newdf = {'Class': ['c1']*c+['c2']*c+['c3']*c,
'Section': ['A']*b+['B']*b+['C']*b+['D']*b+['E']*b+['F']*b,
'Time': [1,2,3,4,5,6]*d+[3,1,3,4,5,7]*d}
test = pd.DataFrame(newdf)
test['f_x'] = test['Time']**2/5
test['f_x_2'] = test['Time']**2/5+test['f_x']
#working with 1 column
test['section_mean'] = test.groupby(['Class','Section'])['f_x'].transform(lambda x: x.mean())
test['two_col_sum'] = test[['Time','f_x']].apply(lambda x: x.Time+x.f_x,axis=1)
cols = ['f_x','f_x_2']
and I know how to calculate for example a value for a series of columns for groups:
test['section_mean'] = test.groupby(['Class','Section'])['f_x'].transform(lambda x: x.mean())
Or eventually do simple operations between more columns:
test['two_col_sum'] = test[['Time','f_x']].apply(lambda x: x.Time+x.f_x,axis=1)
However, what I'm trying to do is some kind of computation over the full column of a grouped instance:
%%time
slopes_df = pd.DataFrame()
grouped = test.groupby(['Class','Section'])
for name, group in grouped:
nd=[]
for col in cols:
ntest = group[['Time',col]]
x = ntest.Time
y = ntest[col]
f=np.polyfit(x,y, deg=1).round(2)
data = [name[0],name[1],col,f[0],f[1]]
nd.append(data)
slopes_df=pd.concat([slopes_df,pd.DataFrame(nd)])
slopes_df.columns=['Class','Section','col','slope','intercept']
slopes_df_p = pd.pivot_table(data=slopes_df,index=['Class','Section'], columns=['col'], values=['slope','intercept']).reset_index()
slopes_df_p.columns = pd.Index(e[0] if e[0] in ['Class','Section'] else e[0]+'_'+e[1] for e in slopes_df_p.columns)
fdf = pd.merge(test, slopes_df_p, on=['Class','Section'])
I tried the solution proposed in this way:
%%time
for col in cols:
df1 = (test.groupby(['Class','Section'])
.apply(lambda x: np.polyfit(x['Time'],x[col], deg=1).round(2)[0])
.rename('slope_'+str(col)))
df2 = (test.groupby(['Class','Section'])
.apply(lambda x: np.polyfit(x['Time'],x[col], deg=1).round(2)[1])
.rename('intercept_'+str(col)))
df1['col']=col
df2['col']=col
test = pd.merge(test,df1, on=['Class','Section'])
test = pd.merge(test,df2, on=['Class','Section'])
but it seems slower, on my pc first loop takes 150ms and second code 300 ms
Andrea
Your loop solution not working by data of groups, so I think you need GroupBy.apply:
def f(x):
for col in cols:
x[f'slope_{col}'], x[f'intercept_{col}'] = np.polyfit(x['Time'],x[col], deg=1).round(2)
return x
df1 = test.groupby(['Class','Section']).apply(f)

Apply function to manipulate Python Pandas DataFrame group

I have data in a pandas DataFrame that requires considerable clean up with functions applied to the DataFrame's 'ID' groups. How does one apply any arbitrary function to manipulate Pandas DataFrame groups? A simplified example of the DataFrame is below:
import pandas as pd
import numpy as np
waypoint_time_string = ['0.5&3.0&6.0' for x in range(10)]
moving_string = ['0 0 0&0 0.1 0&1 1 1.2' for x in range(10)]
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,2,2], 'time':[1,2,3,4,5,1,2,3,4,5],
'X':[0,0,0,0,0,1,1,1,1,1],'Y':[0,0,0,0,0,1,1,1,1,1],'Z':[0,0,0,0,0,1,1,1,1,1],
'waypoint_times':waypoint_time_string,
'moving':moving_string})
I would like to apply the function set_group_positions (defined below) to each 'ID' group of df. I have only been successful looping through the DataFrame. It seems that there must be a more 'Pandas.groupby' way of doing this. Here is an example of my implementation that I'm looking to replace:
sub_frames = []
unique_IDs = df['ID'].unique()
for unique_ID in unique_IDs:
working_df = df.loc[df['ID']==unique_ID]
working_df = set_group_positions(working_df)
sub_frames.append(working_df)
final_df = pd.concat(sub_frames)
And to complete a working example, here are additional helper functions:
def set_x_vel(row):
return(row['X'] + row['x_movement'])
def set_y_vel(row):
return(row['Y'] + row['y_movement'])
def set_z_vel(row):
return(row['Z'] + row['z_movement'])
output_time_list = df['time'].unique().tolist()
#main function to apply to each ID group in the data frame:
def set_group_positions(df): #pass the combined df here
working_df = df
times_string = working_df['waypoint_times'].iloc[0]
times_list = times_string.split('&')
times_list = [float(x) for x in times_list]
points_string = working_df['moving']
points_string = points_string.iloc[0]
points_list = points_string.split('&')
points_x = []
points_y = []
points_z = []
for point in points_list:
point_list = point.split(' ')
points_x.append(point_list[0])
points_y.append(point_list[1])
points_z.append(point_list[2])
#get corresponding positions for HPAC times,
#since there could be mismatches
points_x = np.cumsum([float(x) for x in points_x])
points_y = np.cumsum([float(x) for x in points_x])
points_z = np.cumsum([float(x) for x in points_x])
x_interp = np.interp(output_time_list,times_list,points_x).tolist()
y_interp = np.interp(output_time_list,times_list,points_y).tolist()
z_interp = np.interp(output_time_list,times_list,points_z).tolist()
working_df.loc[:,('x_movement')] = x_interp
working_df.loc[:,('y_movement')] = y_interp
working_df.loc[:,('z_movement')] = z_interp
working_df.loc[:,'x_pos'] = working_df.apply(set_x_vel, axis = 1)
working_df.loc[:,'y_pos'] = working_df.apply(set_y_vel, axis = 1)
working_df.loc[:,'z_pos'] = working_df.apply(set_z_vel, axis = 1)
return(working_df)
While my current implementation works, on my real data set, it takes about 20 minutes for me to run, where a simple groupby.apply lambda call on my DataFrame takes only seconds to a minute.
Instead of looping, you can use apply with groupby and a function call:
df = df.groupby('ID').apply(set_group_positions)

Summarizing data in pandas dataframe

I have a dataframe that looks like:
respondent_id,group_number,member_id
1,1,3
1,1,4
1,2,1
....
My goal is to output two counts for each respondent ID; the number of groups that include themselves as a member ID, and those which don't.
For example, the above table would output:
respondent_id,my_groups,other_groups
1,1,1
My best guess is to do something like:
rg_g = df.groupby(['respondent_id','group_number'])
rg_g.apply(lambda g: g.respondent_id in g.id.values)
But I don't know where to go from there.
Updated answer(it is not the best code but it works):
Initialization:
test_data = pd.DataFrame(np.random.randint(5, size=(10, 3)),columns=['respondent_id','group_number','member_id'])
test_data['member_id'][3]=None
test_data['member_id'][5]=None
test_data['member_id'][7]=None
test_data['member_id'][8]=None
test_data['member_id'][9]=None
test_data['member_id'][10]=None
Code:
# calculate the groups where respondent have the member_id
d_nn = test_data[test_data.member_id.notnull()]
# or for example: test_data[test_data.member_id != 0]
d_is_n = test_data[test_data.member_id.isnull()]
d_nn = pd.DataFrame({'count' : d_nn.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_is_n = pd.DataFrame({'count' : d_is_n.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_nn['is_member'] = 1
d_is_n['is_member'] = 0
# merge
result = d_nn.copy()
for idx1 in range(len(d_is_n)):
merge = True
for idx2 in range(len(d_nn)):
if d_nn.iloc[idx2].respondent_id == d_is_n.iloc[idx1].respondent_id and \
d_nn.iloc[idx2].group_number == d_is_n.iloc[idx1].group_number:
merge = False
if merge:
temp_d = d_is_n.iloc[idx1]
result = result.append(temp_d, ignore_index=True)
#group by respondent_id and is_member
result = pd.DataFrame({'group_number' : result.groupby( [ "respondent_id", "is_member"] ).size()}).reset_index()
print result
So, here's what I ended up doing. Maybe not ideal, but it seems to work. :)
import pandas as pd
rg = pd.read_csv('./in_file.csv')
rg_g = rg.groupby(['respondent_id','group_number'])
in_g = rg_g.filter(lambda g: g.respondent_id in g.id.values)
out_g = rg_g.filter(lambda g: g.respondent_id not in g.id.values)
my_count = in_g.groupby('respondent_id').group_number.nunique()
other_count = out_g.groupby('respondent_id').group_number.nunique()
pd.concat([my_count,other_count], axis=1).to_csv('./out_file.csv')

appending to a pandas dataframe

I want to add make a pandas dataframe with two columns : read_id and score
I am using the following code :
reads_array = []
for x in Bio.SeqIO.parse("inp.fasta","fasta"):
reads_array.append(x)
columns = ["read_id","score"]
df = pd.DataFrame(columns = columns)
df = df.fillna(0)
for x in reads_array:
alignments=pairwise2.align.globalms("ACTTGAT",str(x.seq),2,-1,-.5,-.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2),reverse = True)
read_id = x.name
score = sorted_alignments[0][2]
df['read_id'] = read_id
df['score'] = score
But this does not work. Can you suggest a way of generating the dataframe df
At the top make sure you have
import numpy as np
Then replace the code you shared with
reads_array = []
for x in Bio.SeqIO.parse("inp.fastq", "fastq"):
reads_array.append(x)
df = pd.DataFrame(np.zeros((len(reads_array), 2)), columns=["read_id", "score"])
for index, x in enumerate(reads_array):
alignments = pairwise2.align.globalms("ACTTGAT", str(x.seq), 2, -1, -.5, -.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2), reverse=True)
read_id = x.name
score = sorted_alignments[0][2]
df.loc[index, 'read_id'] = read_id
df.loc[index, 'score'] = score
The main problem with your original code was two things:
1) Your dataframe had 0 rows
2) df['column_name'] refers to the entire column, not a single cell, so when you execute df['column_name'] = value, all cells in that column get set to that value
df['read_id'] and df['score'] is Series. So if you want to iterate reads_array and calculate some value, then assign it to df's columns, try following:
for i, x in enumerate(reads_array):
...
df.ix[i]['read_id'] = read_id
df.ix[i]['score'] = score

Categories