I have a large dataset (4GB) like this:
userID date timeofday seq
0 1000014754 20211028 20 133669542676:1:148;133658378700:1:16;133650937891:1:85
1 1000019906 20211028 6 508420199:0:0;133669581685:1:19
2 1000019906 20211028 22 133665269544:0:0
From this, I would like to split "seq" by ";" first and create a new dataset with renames. It looks like this:
userID date timeofday seq1 seq2 seq3 ... seqN
0 1000014754 20211028 20 133669542676:1:148 133658378700:1:16 133650937891:1:85
1 1000019906 20211028 6 508420199:0:0 133669581685:1:19 None None
2 1000019906 20211028 22 133665269544:0:0 None None None
Then I want to split the seq1,seq2,...,seqN by ":", and create a new dataset with renames. It looks like this:
userID date timeofday name1 click1 time1 name2 click2 time2 ....nameN clickN timeN
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85 None None None
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 None None None None None None
2 1000019906 20211028 22 133665269544 0 0 None None None None None None None None None
I know pandas.split can split the columns, but I don't know how to split it effficiently. Thank you!
A clean solution is to use a regex and extractall, then reshape using unstack, rename the columns and join to the original dataframe.
Assuming df the dataframe name
df2 = (df['seq'].str.extractall(r'(?P<name>[^:]+):(?P<click>[^:]+):(?P<time>[^;]+);?')
.unstack('match')
.sort_index(level=1, axis=1, sort_remaining=False)
)
df2.columns = df2.columns.map(lambda x: f'{x[0]}{x[1]+1}')
df2 = df.drop(columns='seq').join(df2)
output:
userID date timeofday name1 click1 time1 name2 click2 time2 name3 click3 time3
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 NaN NaN NaN
2 1000019906 20211028 22 133665269544 0 0 NaN NaN NaN NaN NaN NaN
Try this, it should get you the result:
A = pd.DataFrame({1:[2,3,4], 2:['as:d', 'asd', 'a:sd']})
print(A)
for i in A.index:
split =str(A[2][i]).split(':',1)
A.at[i,3] = split[0]
if len(split) > 1:
A.at[i, 4] = split[1]
print(A)
It's probably slow since the dataframe is updated often. Alternatively you can write the new columns in separate lists and merge them into one table later.2
Related
I have a data frame:
id parentid score body
1 10 10 abc
2 10 0 xyz
3 10 1 efg
4 23 3 afd
5 23 2 asfagr
6 34 1 wrqqw
i need to groupby(parentid) then aggregate score by mean , and join body. id field is not relevent, it can be changed to min or max.
result should be
id parentid score body
1 10 3 abc xyz efg
4 23 2 afd asfagr
6 34 1 wrqqw
i have tried
def f(x):
x['Id'] = x['Id']
x['ParentId'] = x['ParentId']
x['Score'] = x['Score'].min() #change to max/ min/ mean to get different results!
x['Body']= " ".join(x['Body'])
return x
temp = temp.groupby("ParentId").apply(f)
temp = temp.reset_index()
it gives corerct result but ince dataset size is >1.8 gb , the system becomes irresponsive.
i have tried it in google colab too, it has crashed 3 times.
please suggest a faster method such as lambda functions or anything else.
Try this using groupby with agg and a dictionary to define aggregations for each column:
df.groupby('parentid', as_index=False)[['score', 'body']]\
.agg({'score':'mean', 'body':' '.join})
Output:
parentid score body
0 10 3.666667 abc xyz efg
1 23 2.500000 afd asfagr
2 34 1.000000 wrqqw
Try
temp.groupby("ParentId").agg({"score":np.mean, "body": lambda x: " ".join([I for I in x])})
I have dataframe like this:
df = pd.DataFrame({"flag":["1","0","1","0"],
"val":["111","111","222","222"], "qwe":["","11","","12"]})
It gives:
flag qwe val
0 1 111
1 0 11 111
2 1 222
3 0 12 222
Then i am filtering first dataframe like this:
dff = df.loc[df["flag"]=="1"]
**was:**
dff.loc["qwe"] = "123"
**edited:** (setting all rows in column "qwe" to "123")
dff["qwe"] = "123"
And now i need to merge/join df and dff in such a way to get:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Adding changes in 'qwe' from dff only if df value is empty.
Something like this:
pd.merge(df, dff, left_index=True, right_index=True, how="left")
gives
flag_x qwe_x val_x flag_y qwe_y val_y
0 1 111 1 111
1 0 11 111 NaN NaN NaN
2 1 222 1 222
3 0 12 222 NaN NaN NaN
so, after that i need to drop flag_y, val_y, rename _x columns and merge manually qwe_x and qwe_y. But is there any way to make it easier?
pd.merge has an on argument that you can use to join columns with the same name in different dataframes.
Try:
pd.merge(df, dff, how="left", on=['flag', 'qwe', 'val'])
However, I don't think you need to do that at all. You can produce the same result using df.loc to conditionally assign a value:
df.loc[(df["flag"] == "1") & (df['qwe'].isnull()), 'qwe'] = 123
After edited changes, for me works this code:
c1 = dff.combine_first(df)
It produces:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Which is exactly i was looking for.
I'm trying to add a column, 'C_End', to a DataFrame in Pandas that looks something like this:
df = pd.DataFrame({'ID':[123,123,123,456,456,789],
'C_ID':[8,10,35,36,40,7],
'C_Type':['New','Renew','Renew','New','Term','New'],
'Rank':[1,2,3,1,2,1]})
The new column needs to be the next 'C_Type' for each ID based on 'Rank', resulting in a DataFrame that looks like this:
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 None
3 456 36 New 1 Term
4 456 40 Term 2 None
5 789 7 New 1 None
Essentially, I want to find the row where ID = ID and Rank = Rank+1 and assign C_Type to new column C_End. I've tried creating a function and using Apply (below), but that took forever and eventually gave me an error. I'm still new to Pandas and Python in general, but I feel like there has to be an easy solution that I'm not seeing.
def get_next_c_type(row):
return df.loc[(df['id'] == row['id']) & (df['rank'] == row['rank'] + 1),'c_type']
df['c_end'] = df.apply(get_next_c_type, axis = 1)
Try:
df['C_End'] = df.sort_values('Rank').groupby('ID')['C_Type'].transform('shift',-1)
Or as #W-B suggest:
df['C_End'] = df.sort_values('Rank').groupby('ID')['C_Type'].shift(-1)
Output:
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 NaN
3 456 36 New 1 Term
4 456 40 Term 2 NaN
5 789 7 New 1 NaN
Here's one way using np.where:
dfs = df.shift(-1)
m1 = df.ID == dfs.ID
m2 = df.Rank + 1 == dfs.Rank
df.loc[:, 'C_End'] = np.where(m1 & m2, dfs.C_Type, None)
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 None
3 456 36 New 1 Term
4 456 40 Term 2 None
5 789 7 New 1 None
So I have a data-set with for 4 IDs each id has 70 values, present and absent values. I counted the number of values of present and absent with the following code
df=pd.pivot_table(df,index=["ID",'status'], values=["Sem1"], aggfunc=[len]).reset_index()
df['ID'] = df['ID'].mask(df['ID'].duplicated(), '')
df
ID Status len
Sem1
4234 Present 45
Absent 25
4235 Present 40
Absent 30
4236 Present 35
Absent 35
4237 Present 50
Absent 20
In: df.columns
Out:ultiIndex(levels=[['len', 'status', 'ID'], ['sem1', '']],
labels=[[2, 1, 0], [1, 1, 0]])
I need to take the columns seperately to be added to two different data frames
Is there any way to take the columns separately ?
Also, need to know if it can be changed into the following data-set?
ID Status Sem1
4234 Present 45
Absent 25
4235 Present 40
Absent 30
4236 Present 35
Absent 35
4237 Present 50
Absent 20
In:df.columns
Out:Index(['ID', 'Status','Sem1'], dtype='object')
Can this be done from the from the previos data-set
For me your solution working nice.
df = pd.DataFrame({'Sem1':[1,3,5,7,1,0],
'Sem2':[5,3,6,9,2,4],
'ID':list('aaabbb')})
print (df)
Sem1 Sem2 ID
0 1 5 a
1 3 3 a
2 5 6 a
3 7 9 b
4 1 2 b
5 0 4 b
df1 = df.groupby('ID').mean().reset_index()
print (df1)
ID Sem1 Sem2
0 a 3.000000 4.666667
1 b 2.666667 5.000000
EDIT:
Remove []:
df = pd.pivot_table(df,index=["ID",'status'], values="Sem1", aggfunc='size').reset_index()
I have four pandas dataframes that can be generated with the below code:
#df 1
time1=pandas.Series([0,20,40,60,120])
pPAK2=pandas.Series([0,3,15,21,23])
cols=['time','pPAK2']
df=pandas.DataFrame([time1,pPAK2])
df=df.transpose()
df.columns=cols
df.to_csv('pPAK2.csv',sep='\t')
pak2_df=df
#df2
time2=pandas.Series([0,15,30,60,120])
cAbl=pandas.Series([0,15,34,10,0])
df=pandas.DataFrame([time2,cAbl])
df=df.transpose()
cols=['time','pcAbl']
df.columns=cols
df.to_csv('pcAbl.csv',sep='\t')
pcAbl_df=df
#df 3
time7=pandas.Series([0,60,120,240,480,960,1440])
pSmad3_n=pandas.Series([0,16,14,12,8,7.5,6])
scale_factor=40
pSmad3_n=pSmad3_n*scale_factor
#plt.plot(time7,pSmad3)
df=pandas.DataFrame([time7,pSmad3_n])
df=df.transpose()
cols=['time','pSmad3_n']
df.columns=cols
df.to_csv('pSmad3_n.csv',sep='\t')
smad3_df=df
#df4
time8=pandas.Series([0,240,480,1440])
PAI1_mRNA=pandas.Series([0,23,25,5])
scale_factor=5
PAI1_mRNA=PAI1_mRNA*scale_factor
df=pandas.DataFrame([time8,PAI1_mRNA])
df=df.transpose()
cols=['time','PAI1_mRNA']
df.columns=cols
df.to_csv('PAI1_mRNA.csv',sep='\t')
PAI1_df=df
#print dataframes
print PAI1_df
print pak2_df
print pcAbl_df
print smad3_df
I want to concatenate these dataframes by the time column with the pandas concat function but I can't get the output right. The output should look something like this, if were to just concatenate PAI1_df and pak2_df
time PAI1_mRNA pPAK2
0 0 0 0
1 20 'NaN' 3
2 40 'NaN' 15
3 60 'NaN' 21
4 120 'NaN' 23
5 240 115 'NaN'
6 480 125 'NaN'
7 1440 25 'NaN
I think it should be easy but there are a lot of features in the doc, does anybody know how to do this?
So you can concat it like this:
import pandas
df = pandas.concat([pak2_df.set_index('time'), pcAbl_df.set_index('time')], axis=1).reset_index()
print(df)
Prints:
time pPAK2 pcAbl
0 0 0 0
1 15 NaN 15
2 20 3 NaN
3 30 NaN 34
4 40 15 NaN
5 60 21 10
6 120 23 0