Best Practice for Adding Lots of Columns to Pandas DataFrame

Best Practice for Adding Lots of Columns to Pandas DataFrame - python

I am trying to add many columns to a pandas dataframe as follows:
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out_name = 'sum_' + col_name_base
df[out_name] = 0.0
for i in range(1, 6):
col_name = col_name_base + str(i)
if col_name in df:
df[out_name] += df[col_name]
else:
logger.error('Col %s not in df' % col_name)
for col in sum_cols_list:
create_sum_rounds(df, col)
Where sum_cols_list is a list of ~200 base column names (e.g. "foo"), and df is a pandas dataframe which includes the base columns extended with 1 through 5 (e.g. "foo_1", "foo_2", ..., "foo_5").
I'm getting a performance warning when I run this snippet:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
I believe this is because creating a new column is actually calling an insert operation behind the scenes. What's the right way to use pd.concat in this case?

You can use your same approach, but instead of operating directly on the DataFrame, you'll need to store each output as its own pd.Series. Then when all of the computations are done, use pd.concat to glue everything back to your original DataFrame.
(untested, but should work)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out = pd.Series(0, name='sum_' + col_name_base, index=df.index)
for i in range(1, 6):
col_name = col_name_base + str(i)
if col_name in df:
out += df[col_name]
else:
logger.error('Col %s not in df' % col_name)
return out
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)
Additionally, you can simplify your existing code (if you're willing to forego your logging)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] + ...
'''
return df.filter(regex=f'{col_name_base}_\d+').sum(axis=1)
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)

Simplify :-)
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] + df['foo_2'] + df['foo_3'] + \
df['foo_4'] + df['foo_5'] +
'''
out_name = 'sum_' + col_name_base
df[out_name] = df.loc[:,[x for x in df.columns if x.startswith(col_name_base)]].sum(axis=1)

Would this get you the results you are expecting?
df = pd.DataFrame({
'Foo_1' : [1, 2, 3, 4, 5],
'Foo_2' : [10, 20, 30, 40, 50],
'Something' : ['A', 'B', 'C', 'D', 'E']
})
df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)

Related

How to create new columns by dividing all columns in a loop?

i'm having a trouble in my code. i just want to create new columns by dividing all columns / survival_time and after i need to add new columns as (clv_mean_xxx) to main dataframe.
here is my code.
list_ib = ['actor_masterylevel', 'churn_yn', 'old_value2_num', 'old_value3_num','old_value4_num', 'time']
for i in list_ib:
for j in list_ib:
if i == j:
break
else:
df = df[i] * df['survival_time']
df['clv_' + str(i) + '_' + str(j)] = df

If I understand the requirement, this should work
for i in list_ib:
df['clv_mean_'+i] = df[i]/df['survival_time']

Pandas add a new column with a string where the cell match a particular condition

I'm trying to apply Pandas style to my dataset and add a column with a string with the matching result.
This is what I want to achieve:
Link
Below is my code, an expert from stackflow assisted me to apply the df.style so I believe for the df.style is correct based on my test. However, how can I run iterrows() and check the cell for each column and return/store a string to the new column 'check'? Thank you so much. I'm trying to debug but not able to display what I want.
df = pd.DataFrame([[10,3,1], [3,7,2], [2,4,4]], columns=list("ABC"))
df['check'] = None
def highlight(x):
c1 = 'background-color: yellow'
m = pd.concat([(x['A'] > 6), (x['B'] > 2), (x['C'] < 3)], axis=1)
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
return df1.mask(m, c1)
def check(v):
for index, row in v[[A]].iterrows():
if row[A] > 6:
A_check = f'row:{index},' + '{0:.1f}'.format(row[A]) + ">6"
return A_check
for index, row in v[[B]].iterrows():
if row[B] > 2:
B_check = f'row:{index}' + '{0:.1f}'.format(row[B]) + ">2"
return B_check
for index, row in v[[C]].iterrows():
if row[C] < 3:
C_check = f'row:{index}' + '{0:.1f}'.format(row[C]) + "<3"
return C_check
df['check'] = df.apply(lambda v: check(v), axis=1)
df.style.apply(highlight, axis=None)
This is the error message I got:
NameError: name 'A' is not defined

My understanding is that the following produces what you are trying to achieve with the check function:
def check(v):
row_str = 'row:{}, '.format(v.name)
checks = []
if v['A'] > 6:
checks.append(row_str + '{:.1f}'.format(v['A']) + ">6")
if v['B'] > 2:
checks.append(row_str + '{:.1f}'.format(v['B']) + ">2")
if v['C'] < 3:
checks.append(row_str + '{:.1f}'.format(v['C']) + "<3")
return '\n'.join(checks)
df['check'] = df.apply(check, axis=1)
Result (print(df)):
A B C check
0 10 3 1 row:0, 10.0>6\nrow:0, 3.0>2\nrow:0, 1.0<3
1 3 7 2 row:1, 7.0>2\nrow:1, 2.0<3
2 2 4 4 row:2, 4.0>2
(Replace \n with ' ' if you don't want the line breaks in the result.)
The axis=1 option in apply gives the function check one row of df as a Series with the column names of df as index (-> v). With v.name you'll get the corresponding row index. Therefore I don't see the need to use .iter.... Did I miss something?

There are few mistakes in program which we will fix one by one
Import pandas
import pandas as pd
In function check(v): var A, B, C are not defined, replace them with 'A', 'B', 'C'. Then v[['A']] will become a series, and to iterate in series we use iteritems() and not iterrows, and also index will be column name in series. Replacing will give
def check(v):
truth = []
for index, row in v[['A']].iteritems():
if row > 6:
A_check = f'row:{index},' + '{0:.1f}'.format(row) + ">6"
truth.append(A_check)
for index, row in v[['B']].iteritems():
if row > 2:
B_check = f'row:{index}' + '{0:.1f}'.format(row) + ">2"
truth.append(B_check)
for index, row in v[['C']].iteritems():
if row < 3:
C_check = f'row:{index}' + '{0:.1f}'.format(row) + "<3"
truth.append(C_check)
return '\n'.join(truth)
This should give expected output, although you need to also add additional logic so that check column doesnt get yellow color. This answer has minimal changes, but I recommend trying axis=1 to apply style columnwise as it seems more convenient. Also you can refer to style guide

simplify splitting a dataframe to several dataframes

So I have some dataframes (df0, df1, df2) with various numbers of rows. I wanted to split any dataframe which has a number of rows more than 30 to several dataframes consists of 30 rows only. So for example my dataframe df0 has 156 rows, then I would separated this dataframe into several dataframes like this:
if len(df0) > 30:
df0_A = df0[0:30]
df0_B = df0[31:60]
df0_C = df0[61:90]
df0_D = df0[91:120]
df0_E = df0[121:150]
df0_F = df0[151:180]
else:
df0= df0
The problem with this code is that I need to repeat the code exhaustively many times for the next code like this:
df0= pd.DataFrame(df0)
df0_A = pd.DataFrame(df0_A)
df0_B = pd.DataFrame(df0_B)
df0_C = pd.DataFrame(df0_C)
df0_D = pd.DataFrame(df0_D)
df0_E = pd.DataFrame(df0_E)
df0_F = pd.DataFrame(df0_F)
df0= df0.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_A = df0_A.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_B = df0_B.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_C = df0_C.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_D = df0_D.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_E = df0_E.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_F = idUGS0_F.to_string(header=False,
index=False,
index_names=False).split('\n')
df0= [','.join(ele.split()) for ele in df0]
df0_A = [','.join(ele.split()) for ele in df0_A]
df0_B = [','.join(ele.split()) for ele in df0_B]
df0_C = [','.join(ele.split()) for ele in df0_C]
df0_D = [','.join(ele.split()) for ele in df0_D]
df0_E = [','.join(ele.split()) for ele in df0_E]
df0_F = [','.join(ele.split()) for ele in df0_F]
now imagine I have ten dataframes that I need to split each into five dataframes. Then I need to make the same code for 50 times!
I'm quite new to Python. So, can anyone help me with how to simplify this code, maybe with simple for loop? thanks

You could probably automate it a little bit more, but this should be enough!
import copy
import numpy as np
df0 = pd.DataFrame({'Test' : np.random.randint(100000,999999,size=180)})
len(df0)
if len(df0) > 30:
df_dict = {}
x=0
y=30
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df0_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df0[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 30
y += 30
df_name
else:
df0
for df in df_dict:
print(df)
print('--------------------------------------------------------------------')
print(f'length: {len(df_dict[df])}')
print('--------------------------------------------------------------------')
print(df_dict[df])
print('--------------------------------------------------------------------')

Assuming you have one column for identification,
def split_df(idf, idcol, nsize):
g = idf.groupby(idcol)
# Compute the size for each value of identification column
size = g.size()
dflist = []
for _id,_idcount in size.iteritems():
if _idcount > nsize:
# print(_id, ' = ', _idcount)
idx = idf[ idf[idcol].eq(_id) ].index
# print(idx)
# lets split the array into equal parts of `nsize`
# e.g. [1,2,3,4,5] with nsize = 2 will split into ([1,2], [3,4], [5])
ilist = np.array_split(idx, round(idx.shape[0]/nsize + 0.5))
dflist += ilist
return [idf.loc[idx].copy(deep=True) for idx in dflist]
df = pd.DataFrame(data=np.hstack((np.random.choice(np.arange(1,3), 10).reshape(10, -1), np.random.rand(10,3))), columns=['id', 'a', 'b', 'c'])
df = df.astype({'id': np.int64})
split(df, 'id', 2)

This is a great problem, you can use this (data is the DataFrame here):
# Create subsets of size 30 for the DataFrame
subsets = list(range(0, len(data), 30))
# Create start cutoffs for subsets of the DataFrame
start_cutoff = subsets
# Create end cutoffs for subsets of the DataFrame
end_cutoff = subsets[1:] + [len(data)]
# Zip the start cutoffs and end cutoffs into a List of Cutoffs
cutoffs = list(zip(start_cutoff, end_cutoff))
# List containing Splitted Dataframes
list_dfs = [data.iloc[cutoff[0]: cutoff[-1]] for cutoff in cutoffs]
# convert list to string DFs
string_dfs = [df.to_string(header=False, index=False, index_names=False).split('\n') for df in list_dfs]
final_df_list = [','.join(ele.split()) for string_df in string_dfs for ele in string_df]
Now you can access the DataFrames by:
print(final_df_list[0])
print(final_df_list[1])

Is there a way to optimize this code in order to run faster?

Hi there I am working in an application and I am using this piece of code to create new columns in a data frame so I can make some calculations, however it is really slow and I would like to try a new approach.
I have read about Multiprocessing, but I am not sure how and where to use it, so I am asking for your help.
def create_exposed_columns(df):
df['MONTH_INITIAL_DATE'] = df['INITIAL_DATE'].dt.to_period(
'M')
df['MONTH_FINAL_DATE'] = df['FINAL_DATE'].dt.to_period(
'M')
df['Diff'] = df['MONTH_FINAL_DATE'] - df['MONTH_INITIAL_DATE']
list_1 = []
for index, row in df.iterrows():
valor = 1
initial_date = row['INITIAL_DATE']
diff = row['Diff']
temporal_list = {}
list_1.append(temporal_list)
for i in range(meses_iterables + 1):
date = initial_date + relativedelta(months=+1 * i)
if len(str(date.month)) == 1:
value = {str(date.year) + '-0' + str(date.month): valor}
temporal_list.update(value)
else:
value = {str(date.year) + '-' + str(date.month): valor}
temporal_list.update(value)
df_2 = pd.DataFrame(list_1)
df = df.reset_index()
df = pd.concat([df, df_2], axis=1)
return df
I have no idea where to start, so any kind of help will be useful.
Thanks

appending to a pandas dataframe

I want to add make a pandas dataframe with two columns : read_id and score
I am using the following code :
reads_array = []
for x in Bio.SeqIO.parse("inp.fasta","fasta"):
reads_array.append(x)
columns = ["read_id","score"]
df = pd.DataFrame(columns = columns)
df = df.fillna(0)
for x in reads_array:
alignments=pairwise2.align.globalms("ACTTGAT",str(x.seq),2,-1,-.5,-.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2),reverse = True)
read_id = x.name
score = sorted_alignments[0][2]
df['read_id'] = read_id
df['score'] = score
But this does not work. Can you suggest a way of generating the dataframe df

At the top make sure you have
import numpy as np
Then replace the code you shared with
reads_array = []
for x in Bio.SeqIO.parse("inp.fastq", "fastq"):
reads_array.append(x)
df = pd.DataFrame(np.zeros((len(reads_array), 2)), columns=["read_id", "score"])
for index, x in enumerate(reads_array):
alignments = pairwise2.align.globalms("ACTTGAT", str(x.seq), 2, -1, -.5, -.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2), reverse=True)
read_id = x.name
score = sorted_alignments[0][2]
df.loc[index, 'read_id'] = read_id
df.loc[index, 'score'] = score
The main problem with your original code was two things:
1) Your dataframe had 0 rows
2) df['column_name'] refers to the entire column, not a single cell, so when you execute df['column_name'] = value, all cells in that column get set to that value

df['read_id'] and df['score'] is Series. So if you want to iterate reads_array and calculate some value, then assign it to df's columns, try following:
for i, x in enumerate(reads_array):
...
df.ix[i]['read_id'] = read_id
df.ix[i]['score'] = score

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best Practice for Adding Lots of Columns to Pandas DataFrame - python

Would this get you the results you are expecting? df = pd.DataFrame({ 'Foo_1' : [1, 2, 3, 4, 5], 'Foo_2' : [10, 20, 30, 40, 50], 'Something' : ['A', 'B', 'C', 'D', 'E'] }) df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)

Related

How to create new columns by dividing all columns in a loop?

Pandas add a new column with a string where the cell match a particular condition

simplify splitting a dataframe to several dataframes

Is there a way to optimize this code in order to run faster?

appending to a pandas dataframe

Categories

Resources