Pandas: use apply to create 2 new columns

Pandas: use apply to create 2 new columns - python

I have a dataset where col a represent the number of total values in values e,i,d,t which are in string format separated by a "-"
a e i d t
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
I want to create 8 new columns, 4 representing the SUM of (e-i-d-t), 4 the product.
For example:
def funct_two_outputs(E, I, d, t, d_calib = 50):
return E+i+d+t, E*i*d*t
OUT first 2 values:
SUM_0, row0 = 40+0.5+30+1 SUM_1 = 80+0.3+32+1
The sum and product are example functions substituting my functions which are a bit more complicated.
I have written out a function **expand_on_col ** that creates separates all the e,i,d,t values into new columns:
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
Now i need to create 4 new columsn that are the sum of eidt, and 4 that are the prodct.
Example output for SUM:
index a e i d t a-0 e-0 e-1 e-2 e-3 i-0 i-1 i-2 i-3 d-0 d-1 d-2 d-3 t-0 t-1 t-2 t-3 sum-0 sum-1 sum-2 sum-3
0 0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
1 1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
2 3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
3 5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
If i run the code with funct_one_output(only returns sum) it works, but wit the funct_two_outputs(suma and product) I get an error.
Here is the code:
import pandas as pd
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
def funct_two_outputs(E, I, d, t, d_calib = 50): #the function i want to pass
return E+i+d+t, E*i*d*t
def funct_one_outputs(E, I, d, t, d_calib = 50): #for now i can olny use this one, cant use 2 return values.
return E+i+d+t
for col in columns:
df = expand_on_col (df_=df, col_to_split = col, sep='-', prefix=f"{col}-")
cols_ = df.columns.drop(columns)
df[cols_]= df[cols_].apply(pd.to_numeric, errors="coerce")
df["a"] = df["a"].apply(pd.to_numeric, errors="coerce")
df.reset_index(inplace=True)
for i in range (max(df["a"])):
name_1, name_2 = f"sum-{i}", f"mult-{i}"
df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
#if i try and fill 2 outputs it wont work
df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
OUT:
ValueError Traceback (most recent call last)
<ipython-input-306-85157b89d696> in <module>()
68 df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
69 #if i try and fill 2 outputs it wont work
---> 70 df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
71
72
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __setitem__(self, key, value)
3039 self._setitem_frame(key, value)
3040 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3041 self._setitem_array(key, value)
3042 else:
3043 # set column
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in _setitem_array(self, key, value)
3074 )[1]
3075 self._check_setitem_copy()
-> 3076 self.iloc._setitem_with_indexer((slice(None), indexer), value)
3077
3078 def _setitem_frame(self, key, value):
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
1751 if len(ilocs) != len(value):
1752 raise ValueError(
-> 1753 "Must have equal len keys and value "
1754 "when setting with an iterable"
1755 )
ValueError: Must have equal len keys and value when setting with an iterable

Don't Use apply
If you can help it
s = pd.to_numeric(
df[['e', 'i', 'd', 't']]
.stack()
.str.split('-', expand=True)
.stack()
)
sums = s.sum(level=[0, 2]).rename('Sum')
prods = s.prod(level=[0, 2]).rename('Prod')
sums_prods = pd.concat([sums, prods], axis=1).unstack()
sums_prods.columns = [f'{o}-{i}' for o, i in sums_prods.columns]
df.join(sums_prods)
a e i d t Sum-0 Sum-1 Sum-2 Sum-3 Prod-0 Prod-1 Prod-2 Prod-3
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0

Related

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

I have a pandas data frame that looks like this:
id age weight group
1 12 45 [10-20]
1 18 110 [10-20]
1 25 25 [20-30]
1 29 85 [20-30]
1 32 49 [30-40]
1 31 70 [30-40]
1 37 39 [30-40]
I am looking for a data frame that would look like this: (sd=standard deviation)
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
[10-20]
[20-30]
[30-40]
Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.

Here's a way to do it:
res = df.group.to_frame().groupby('group').count()
for group in res.index:
mask = df.group==group
srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
An alternative way to get the same result is:
res = ( pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,'weight'].mean(),
df.loc[df.group==x.group,'weight'].std(),
df.loc[df.group!=x.group,'weight'].mean(),
df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
.reset_index().rename(columns={'index':'group'}) )
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
UPDATE:
OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"
To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.
wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,wgtCol].mean(),
df.loc[df.group==x.group,wgtCol].std(),
df.loc[df.group!=x.group,wgtCol].mean(),
df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
for wgtCol in wgtCols], axis=1)
.reset_index().rename(columns={'index':'group'}) )
Input:
id age weight weight2 group
0 1 12 45 55 [10-20]
1 1 18 110 120 [10-20]
2 1 25 25 35 [20-30]
3 1 29 85 95 [20-30]
4 1 32 49 59 [30-40]
5 1 31 70 80 [30-40]
6 1 37 39 49 [30-40]
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight group_mean_weight2 group_sd_weight2 rest_mean_weight2 rest_sd_weight2
0 [10-20] 77.500000 45.961941 53.60 24.016661 87.500000 45.961941 63.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411 65.000000 42.426407 72.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596 62.666667 15.821926 76.25 38.378596

Merging computed file contents and display previous computed data in output

I am working to 2 files, oldFile.txt and newFile.txt and compute some changes between them. The newFile.txt is updated constantly and any updates will be written to oldFile.txt
I am trying to improve the snippet below by saving previous computed values and add it to a finalOutput.txt. Any idea will be very helpful to accomplish the needed output. Thank you in advance.
import pandas as pd
from time import sleep
def read_file(fn):
data = {}
with open(fn, 'r') as f:
for lines in f:
line = lines.rstrip()
pname, cnt, cat = line.split(maxsplit=2)
data.update({pname: {'pname': pname, 'cnt': int(cnt), 'cat': cat}})
return data
def process_data(oldfn, newfn):
old = read_file(oldfn)
new = read_file(newfn)
u_data = {}
for ko, vo in old.items():
if ko in new:
n = new[ko]
old_cnt = vo['cnt']
new_cnt = n['cnt']
u_cnt = old_cnt + new_cnt
tmp_old_cnt = 1 if old_cnt == 0 else old_cnt
cnt_change = 100 * (new_cnt - tmp_old_cnt) / tmp_old_cnt
u_data.update({ko: {'pname': n['pname'], 'cnt': new_cnt, 'cat': n['cat'],
'curr_change%': round(cnt_change, 0)}})
for kn, vn in new.items():
if kn not in old:
old_cnt = 1
new_cnt = vn['cnt']
cnt_change = 0
vn.update({'cnt_change': round(cnt_change, 0)})
u_data.update({kn: vn})
pd.options.display.float_format = "{:,.0f}".format
mydata = []
for _, v in u_data.items():
mydata.append(v)
df = pd.DataFrame(mydata)
df = df.sort_values(by=['cnt'], ascending=False)
# Save to text file.
with open('finalOutput.txt', 'w') as w:
w.write(df.to_string(header=None, index=False))
# Overwrite oldFile.txt
with open('oldFile.txt', 'w') as w:
w.write(df.to_string(header=None, index=False))
# Print in console.
df.insert(0, '#', range(1, 1 + len(df)))
print(df.to_string(index=False,header=True))
while True:
oldfn = './oldFile.txt'
newfn = './newFile.txt'
process_data(oldfn, newfn)
sleep(60)
oldFile.txt
e6c76e4810a464bc 1 Hello(HLL)
65b66cc4e81ac81d 2 CryptoCars (CCAR)
c42d0c924df124ce 3 GoldNugget (NGT)
ee70ad06df3d2657 4 BabySwap (BABY)
e5b7ebc589ea9ed8 8 Heroes&E... (HE)
7e7e9d75f5da2377 3 Robox (RBOX)
newfile.txt #-- content during 1st reading
e6c76e4810a464bc 34 Hello(HLL)
65b66cc4e81ac81d 43 CryptoCars (CCAR)
c42d0c924df124ce 95 GoldNugget (NGT)
ee70ad06df3d2657 15 BabySwap (BABY)
e5b7ebc589ea9ed8 37 Heroes&E... (HE)
7e7e9d75f5da2377 23 Robox (RBOX)
755507d18913a944 49 CharliesFactory
newfile.txt #-- content during 2nd reading
924dfc924df1242d 35 AeroDie (ADie)
e6c76e4810a464bc 34 Hello(HLL)
65b66cc4e81ac81d 73 CryptoCars (CCAR)
c42d0c924df124ce 15 GoldNugget (NGT)
ee70ad06df3d2657 5 BabySwap (BABY)
e5b7ebc589ea9ed8 12 Heroes&E... (HE)
7e7e9d75f5da2377 19 Robox (RBOX)
755507d18913a944 169 CharliesFactory
newfile.txt # content during 3rd reading
924dfc924df1242d 45 AeroDie (ADie)
e6c76e4810a464bc 2 Hello(HLL)
65b66cc4e81ac81d 4 CryptoCars (CCAR)
c42d0c924df124ce 7 GoldNugget (NGT)
ee70ad06df3d2657 5 BabySwap (BABY)
e5b7ebc589ea9ed8 3 Heroes&E... (HE)
7e7e9d75f5da2377 6 Robox (RBOX)
755507d18913a944 9 CharliesFactory
oldFile.txt #-- Current output that needs improvement
# pname cnt cat curr_change%
1 924dfc924df1242d 35 AeroDie (ADie) 29
2 755507d18913a944 9 CharliesFactory -95
3 c42d0c924df124ce 7 GoldNugget (NGT) -53
4 7e7e9d75f5da2377 6 Robox (RBOX) -68
5 ee70ad06df3d2657 5 BabySwap (BABY) 0
6 65b66cc4e81ac81d 4 CryptoCars (CCAR) -95
7 e5b7ebc589ea9ed8 3 Heroes&E... (HE) -75
8 e6c76e4810a464bc 2 Hello(HLL) -94
finalOutput.txt #-- Needed Improved Output with additional columns r1, r2 and so on depending on how many update readings
# curr_change% is the latest 3rd reading
# r2% is based on the 2nd reading
# r1% is based on the 1st reading
# pname cnt cat curr_change% r2% r1%
1 924dfc924df1242d 35 AeroDie (ADie) 29 0 0
2 755507d18913a944 9 CharliesFactory -95 245 0
3 c42d0c924df124ce 7 GoldNugget (NGT) -53 -84 3,067
4 7e7e9d75f5da2377 6 Robox (RBOX) -68 -17 667
5 ee70ad06df3d2657 5 BabySwap (BABY) 0 -67 275
6 65b66cc4e81ac81d 4 CryptoCars (CCAR) -95 70 2,050
7 e5b7ebc589ea9ed8 3 Heroes&E... (HE) -75 -68 362
8 e6c76e4810a464bc 2 Hello(HLL) -94 0 3,300

Updated for feedback, I made adjustments so that it would handle data that was fed to it live. Whenever new data is loaded, load the file name into process_new_file() function, and it will update the 'finalOutput.txt'.
For simplicity, I named the different files file1, file2, file3, and file4.
I'm doing most of the operations using the pandas Dataframe. I think working with Pandas DataFrames will make the task a lot easier for you.
Overall, I created one function to read the file and return a properly formatted DataFrame. I created a second function that compares the old and the new file and does the calculation you were looking for. I merge together the results of these calculations. Finally, I merge all of these calculations with the last file's data to get the output you're looking for.
import pandas as pd
global global_old_df
global results_df
global count
global_old_df = None
results_df = pd.DataFrame()
count = 0
def read_file(file_name):
rows = []
with open(file_name) as f:
for line in f:
rows.append(line.split(" ", 2))
df = pd.DataFrame(rows, columns=['pname', 'cnt', 'cat'])
df['cat'] = df['cat'].str.strip()
df['cnt'] = df['cnt'].astype(float)
return df
def compare_dfs(df_old, df_new, count):
df_ = df_old.merge(df_new, on=['pname', 'cat'], how='outer')
df_['r%s' % count] = (df_['cnt_y'] / df_['cnt_x'] - 1) * 100
df_ = df_[['pname', 'r%s' % count]]
df_ = df_.set_index('pname')
return df_
def process_new_file(file):
global global_old_df
global results_df
global count
df_new = read_file(file)
if global_old_df is None:
global_old_df = df_new
return
else:
count += 1
r_df = compare_dfs(global_old_df, df_new, count)
results_df = pd.concat([r_df, results_df], axis=1)
global_old_df = df_new
output_df = df_new.merge(results_df, left_on='pname', right_index=True)
output_df.to_csv('finalOutput.txt')
pd.options.display.float_format = "{:,.1f}".format
print(output_df.to_string())
files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt']
for file in files:
process_new_file(file)
This gives the output:
pname cnt cat r3 r2 r1
0 924dfc924df1242d 45.0 AeroDie (ADie) 28.6 NaN NaN
1 e6c76e4810a464bc 2.0 Hello(HLL) -94.1 0.0 3,300.0
2 65b66cc4e81ac81d 4.0 CryptoCars (CCAR) -94.5 69.8 2,050.0
3 c42d0c924df124ce 7.0 GoldNugget (NGT) -53.3 -84.2 3,066.7
4 ee70ad06df3d2657 5.0 BabySwap (BABY) 0.0 -66.7 275.0
5 e5b7ebc589ea9ed8 3.0 Heroes&E... (HE) -75.0 -67.6 362.5
6 7e7e9d75f5da2377 6.0 Robox (RBOX) -68.4 -17.4 666.7
7 755507d18913a944 9.0 CharliesFactory -94.7 244.9 NaN
So, to run it live, you'd just replace that last section with:
while True:
newfn = './newFile.txt'
process_new_file(newfn)
sleep(60)

Fastest most efficient way to apply groupby that references multiple columns

Suppose we have a dataset.
tmp = pd.DataFrame({'hi': [1,2,3,3,5,6,3,2,3,2,1],
'bye': [12,23,35,35,53,62,31,22,33,22,12],
'yes': [12,2,32,3,5,6,23,2,32,2,21],
'no': [1,92,93,3,95,6,33,2,33,22,1],
'maybe': [91,2,32,3,95,69,3,2,93,2,1]})
In python we can easily do tmp.groupby('hi').agg(total_bye = ('bye', sum)) to get the sum of bye for each group. However, if I want to reference multiple columns, what would be the fastest, most efficient and least amount of cleanly (easily readable) written code to do this in python? In particular, can I do this using df.groupby(my_cols).agg()? What are the fastest alternatives? I'm open (actually prefer) to using faster libraries than pandas such as dask or vaex.
For example, in R data.table we can do this pretty easily, and it's super fast
# In R, assume this object is a data.table
# In a single line, the below code groups by 'hi' and then creates my_new_col column based on if bye > 5 and yes <= 20, taking the sum of 'no' for each group.
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = 'hi']
# output 1
hi my_new_col
1: 1 1
2: 2 116
3: 3 3
4: 5 95
5: 6 6
# Similarly, we can even group by a rule instead of creating a new col to group by. See below
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = .(new_rule = ifelse(hi > 3, 1, 0))]
# output 2
new_rule my_new_col
1: 0 120
2: 1 101
# We can even apply multiple aggregate functions in parallel using data.table
agg_fns <- function(x) list(sum=sum(as.double(x), na.rm=T),
mean=mean(as.double(x), na.rm=T),
min=min(as.double(x), na.rm=T),
max=max(as.double(x), na.rm=T))
tmp[,
unlist(
list(N = .N, # add a N column (row count) to the summary
unlist(mclapply(.SD, agg_fns, mc.cores = 12), recursive = F)), # apply all agg_fns over all .SDcols
recursive = F),
.SDcols = !unique(c(names('hi'), as.character(unlist('hi'))))]
output 3:
N bye.sum bye.mean bye.min bye.max yes.sum yes.mean yes.min yes.max no.sum no.mean no.min
1: 11 340 30.90909 12 62 140 12.72727 2 32 381 34.63636 1
no.max maybe.sum maybe.mean maybe.min maybe.max
1: 95 393 35.72727 1 95
Do we have this same flexibility in python?

You can use agg on all wanted columns and add a prefix:
tmp.groupby('hi').agg('sum').add_prefix('total_')
output:
total_bye total_yes total_no total_maybe
hi
1 24 33 2 92
2 67 6 116 6
3 134 90 162 131
5 53 5 95 95
6 62 6 6 69
You can even combine columns and operations flexibly with a dictionary:
tmp.groupby('hi').agg(**{'%s_%s' % (label,c): (c, op)
for c in tmp.columns
for (label,op) in [('total', 'sum'), ('average', 'mean')]
})
output:
total_hi average_hi total_bye average_bye total_yes average_yes total_no average_no total_maybe average_maybe
hi
1 2 1 24 12.000000 33 16.5 2 1.000000 92 46.00
2 6 2 67 22.333333 6 2.0 116 38.666667 6 2.00
3 12 3 134 33.500000 90 22.5 162 40.500000 131 32.75
5 5 5 53 53.000000 5 5.0 95 95.000000 95 95.00
6 6 6 62 62.000000 6 6.0 6 6.000000 69 69.00

Entering values at each index of dataframe

I have a pandas dataframe which I am storing information about different objects in a video.
For each frame of the video I'm saving the positions of the objects in a dataframe with columns 'x', 'y' 'particle' with the frame number in the index:
x y particle
frame
0 588 840 0
0 260 598 1
0 297 1245 2
0 303 409 3
0 307 517 4
This works fine but I want to save information about each frame of the video, e.g. the temperature at each frame.
I'm currently doing this by creating a series with the values for each frame and the index containing the frame number then adding the series to the dataframe.
prop = pd.Series(temperature_values,
index=pd.Index(np.arange(len(temperature_values)), name='frame')
df['temperature'] = prop
This works but produces duplicates of the data in every row of the column:
x y particle temperature
frame
0 588 840 0 12
0 260 598 1 12
0 297 1245 2 12
0 303 409 3 12
0 307 517 4 12
Is there anyway of saving this information without duplicates in the current dataframe so that when I try and get the temperature column I just receive the original series that I created?
If there isn't anyway of doing this my plan is to either deal with the duplicates using drop_duplicates or create a second dataframe with just the data for each frame which I can then merge into my first dataframe but I'd like to avoid doing this if possible.
Here is the current code with jupyter outputs formatted as best as I can:
import pandas as pd
import numpy as np
df = pd.DataFrame()
frames = list(range(5))
for f in frames:
x = np.random.randint(10, 100, size=10)
y = np.random.randint(10, 100, size=10)
particle = np.arange(10)
data = {
'x': x,
'y': y,
'particle': particle,
'frame': f}
df_to_append = pd.DataFrame(data)
df = df.append(df_to_append)
print(df.head())
Output:
x y particle frame
0 61 97 0 0
1 49 73 1 0
2 48 72 2 0
3 59 37 3 0
4 39 64 4 0
Input
df = df.set_index('frame')
print(df.head())
Output
x y particle
frame
0 61 97 0
0 49 73 1
0 48 72 2
0 59 37 3
0 39 64 4
Input:
example_data = [10*f for f in frames]
# Current method
prop = pd.Series(example_data, index=pd.Index(np.arange(len(example_data)), name='frame'))
df['data1'] = prop
print(df.head())
print(df.tail())
Output:
x y particle data1
frame
0 61 97 0 0
0 49 73 1 0
0 48 72 2 0
0 59 37 3 0
0 39 64 4 0
x y particle data1
frame
4 25 93 5 40
4 28 17 6 40
4 39 15 7 40
4 28 47 8 40
4 12 56 9 40
Input:
# Proposed method
df['data2'] = example_data
Output:
ValueError Traceback (most recent call last)
<ipython-input-12-e41b12bbe1cd> in <module>
1 # Proposed method
----> 2 df['data2'] = example_data
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3368 else:
3369 # set column
-> 3370 self._set_item(key, value)
3371
3372 def _setitem_slice(self, key, value):
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3443
3444 self._ensure_valid_index(value)
-> 3445 value = self._sanitize_column(key, value)
3446 NDFrame._set_item(self, key, value)
3447
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3628
3629 # turn me into an ndarray
-> 3630 value = sanitize_index(value, self.index, copy=False)
3631 if not isinstance(value, (np.ndarray, Index)):
3632 if isinstance(value, list) and len(value) > 0:
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index, copy)
517
518 if len(data) != len(index):
--> 519 raise ValueError('Length of values does not match length of index')
520
521 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index

I am afraid you cannot. All columns in a DataFrame share the same index and are required to have same length. But coming from the database world, I try to avoid as much as possible indexes with duplicate values.

Python : Print function is not giving excepted output

I have written below function in Python:
df = pd.DataFrame({'age': [32, 33, 33,34,44]})
def PROC_FREQ(dataset,arg1):
x= dataset.groupby(arg1)[arg1[0]].agg(({'Frequency':'count'}))
nombre=x.columns.tolist()[0]
x.rename(columns={nombre:'Freq'},inplace=True)
x['Pct']=round((x['Freq']/x.Freq.sum())*100,2)
x['Freq Acum'],x['Cumm Percent']=x.Freq.cumsum(),x.Pct.cumsum()
x.sort_values(arg1,ascending=[1],inplace=True)
pd.set_option('display.max_columns',500)
x=x.reset_index()
string_repr = x.to_string(index=False,justify='center').splitlines()
string_repr.insert(1, "-" * len(string_repr[0]))
out = '\n'.join(string_repr)
df_split = out.split('\n')
columns = shutil.get_terminal_size().columns
for i in range(len(df_split)):
print(df_split[i].center(columns))
and below is the code to call the function:
PROC_FREQ(df,['age'])
and below is the output of the function:
age Freq Pct Freq Acum Cumm Percent
-----------------------------------------
32 1 16.67 1 16.67
33 2 33.33 3 50.00
34 1 16.67 4 66.67
44 2 33.33 6 100.00
Last line the output is not aligned correctly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: use apply to create 2 new columns - python

Related

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

Merging computed file contents and display previous computed data in output

Fastest most efficient way to apply groupby that references multiple columns

Entering values at each index of dataframe

Python : Print function is not giving excepted output

Categories

Resources