spilt columns twice using Python - python

I have a large dataset (4GB) like this:
userID date timeofday seq
0 1000014754 20211028 20 133669542676:1:148;133658378700:1:16;133650937891:1:85
1 1000019906 20211028 6 508420199:0:0;133669581685:1:19
2 1000019906 20211028 22 133665269544:0:0
From this, I would like to split "seq" by ";" first and create a new dataset with renames. It looks like this:
userID date timeofday seq1 seq2 seq3 ... seqN
0 1000014754 20211028 20 133669542676:1:148 133658378700:1:16 133650937891:1:85
1 1000019906 20211028 6 508420199:0:0 133669581685:1:19 None None
2 1000019906 20211028 22 133665269544:0:0 None None None
Then I want to split the seq1,seq2,...,seqN by ":", and create a new dataset with renames. It looks like this:
userID date timeofday name1 click1 time1 name2 click2 time2 ....nameN clickN timeN
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85 None None None
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 None None None None None None
2 1000019906 20211028 22 133665269544 0 0 None None None None None None None None None
I know pandas.split can split the columns, but I don't know how to split it effficiently. Thank you!

A clean solution is to use a regex and extractall, then reshape using unstack, rename the columns and join to the original dataframe.
Assuming df the dataframe name
df2 = (df['seq'].str.extractall(r'(?P<name>[^:]+):(?P<click>[^:]+):(?P<time>[^;]+);?')
.unstack('match')
.sort_index(level=1, axis=1, sort_remaining=False)
)
df2.columns = df2.columns.map(lambda x: f'{x[0]}{x[1]+1}')
df2 = df.drop(columns='seq').join(df2)
output:
userID date timeofday name1 click1 time1 name2 click2 time2 name3 click3 time3
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 NaN NaN NaN
2 1000019906 20211028 22 133665269544 0 0 NaN NaN NaN NaN NaN NaN

Try this, it should get you the result:
A = pd.DataFrame({1:[2,3,4], 2:['as:d', 'asd', 'a:sd']})
print(A)
for i in A.index:
split =str(A[2][i]).split(':',1)
A.at[i,3] = split[0]
if len(split) > 1:
A.at[i, 4] = split[1]
print(A)
It's probably slow since the dataframe is updated often. Alternatively you can write the new columns in separate lists and merge them into one table later.2

Related

pandas agrregate and join dataframe during group by

I have a data frame:
id parentid score body
1 10 10 abc
2 10 0 xyz
3 10 1 efg
4 23 3 afd
5 23 2 asfagr
6 34 1 wrqqw
i need to groupby(parentid) then aggregate score by mean , and join body. id field is not relevent, it can be changed to min or max.
result should be
id parentid score body
1 10 3 abc xyz efg
4 23 2 afd asfagr
6 34 1 wrqqw
i have tried
def f(x):
x['Id'] = x['Id']
x['ParentId'] = x['ParentId']
x['Score'] = x['Score'].min() #change to max/ min/ mean to get different results!
x['Body']= " ".join(x['Body'])
return x
temp = temp.groupby("ParentId").apply(f)
temp = temp.reset_index()
it gives corerct result but ince dataset size is >1.8 gb , the system becomes irresponsive.
i have tried it in google colab too, it has crashed 3 times.
please suggest a faster method such as lambda functions or anything else.
Try this using groupby with agg and a dictionary to define aggregations for each column:
df.groupby('parentid', as_index=False)[['score', 'body']]\
.agg({'score':'mean', 'body':' '.join})
Output:
parentid score body
0 10 3.666667 abc xyz efg
1 23 2.500000 afd asfagr
2 34 1.000000 wrqqw
Try
temp.groupby("ParentId").agg({"score":np.mean, "body": lambda x: " ".join([I for I in x])})

Merge two Dataframes with same columns with overwrite

I have dataframe like this:
df = pd.DataFrame({"flag":["1","0","1","0"],
"val":["111","111","222","222"], "qwe":["","11","","12"]})
It gives:
flag qwe val
0 1 111
1 0 11 111
2 1 222
3 0 12 222
Then i am filtering first dataframe like this:
dff = df.loc[df["flag"]=="1"]
**was:**
dff.loc["qwe"] = "123"
**edited:** (setting all rows in column "qwe" to "123")
dff["qwe"] = "123"
And now i need to merge/join df and dff in such a way to get:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Adding changes in 'qwe' from dff only if df value is empty.
Something like this:
pd.merge(df, dff, left_index=True, right_index=True, how="left")
gives
flag_x qwe_x val_x flag_y qwe_y val_y
0 1 111 1 111
1 0 11 111 NaN NaN NaN
2 1 222 1 222
3 0 12 222 NaN NaN NaN
so, after that i need to drop flag_y, val_y, rename _x columns and merge manually qwe_x and qwe_y. But is there any way to make it easier?
pd.merge has an on argument that you can use to join columns with the same name in different dataframes.
Try:
pd.merge(df, dff, how="left", on=['flag', 'qwe', 'val'])
However, I don't think you need to do that at all. You can produce the same result using df.loc to conditionally assign a value:
df.loc[(df["flag"] == "1") & (df['qwe'].isnull()), 'qwe'] = 123
After edited changes, for me works this code:
c1 = dff.combine_first(df)
It produces:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Which is exactly i was looking for.

How to assign different row's value to new column

I'm trying to add a column, 'C_End', to a DataFrame in Pandas that looks something like this:
df = pd.DataFrame({'ID':[123,123,123,456,456,789],
'C_ID':[8,10,35,36,40,7],
'C_Type':['New','Renew','Renew','New','Term','New'],
'Rank':[1,2,3,1,2,1]})
The new column needs to be the next 'C_Type' for each ID based on 'Rank', resulting in a DataFrame that looks like this:
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 None
3 456 36 New 1 Term
4 456 40 Term 2 None
5 789 7 New 1 None
Essentially, I want to find the row where ID = ID and Rank = Rank+1 and assign C_Type to new column C_End. I've tried creating a function and using Apply (below), but that took forever and eventually gave me an error. I'm still new to Pandas and Python in general, but I feel like there has to be an easy solution that I'm not seeing.
def get_next_c_type(row):
return df.loc[(df['id'] == row['id']) & (df['rank'] == row['rank'] + 1),'c_type']
df['c_end'] = df.apply(get_next_c_type, axis = 1)
Try:
df['C_End'] = df.sort_values('Rank').groupby('ID')['C_Type'].transform('shift',-1)
Or as #W-B suggest:
df['C_End'] = df.sort_values('Rank').groupby('ID')['C_Type'].shift(-1)
Output:
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 NaN
3 456 36 New 1 Term
4 456 40 Term 2 NaN
5 789 7 New 1 NaN
Here's one way using np.where:
dfs = df.shift(-1)
m1 = df.ID == dfs.ID
m2 = df.Rank + 1 == dfs.Rank
df.loc[:, 'C_End'] = np.where(m1 & m2, dfs.C_Type, None)
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 None
3 456 36 New 1 Term
4 456 40 Term 2 None
5 789 7 New 1 None

Is it possible to change the multi index to normal in python

So I have a data-set with for 4 IDs each id has 70 values, present and absent values. I counted the number of values of present and absent with the following code
df=pd.pivot_table(df,index=["ID",'status'], values=["Sem1"], aggfunc=[len]).reset_index()
df['ID'] = df['ID'].mask(df['ID'].duplicated(), '')
df
ID Status len
Sem1
4234 Present 45
Absent 25
4235 Present 40
Absent 30
4236 Present 35
Absent 35
4237 Present 50
Absent 20
In: df.columns
Out:ultiIndex(levels=[['len', 'status', 'ID'], ['sem1', '']],
labels=[[2, 1, 0], [1, 1, 0]])
I need to take the columns seperately to be added to two different data frames
Is there any way to take the columns separately ?
Also, need to know if it can be changed into the following data-set?
ID Status Sem1
4234 Present 45
Absent 25
4235 Present 40
Absent 30
4236 Present 35
Absent 35
4237 Present 50
Absent 20
In:df.columns
Out:Index(['ID', 'Status','Sem1'], dtype='object')
Can this be done from the from the previos data-set
For me your solution working nice.
df = pd.DataFrame({'Sem1':[1,3,5,7,1,0],
'Sem2':[5,3,6,9,2,4],
'ID':list('aaabbb')})
print (df)
Sem1 Sem2 ID
0 1 5 a
1 3 3 a
2 5 6 a
3 7 9 b
4 1 2 b
5 0 4 b
df1 = df.groupby('ID').mean().reset_index()
print (df1)
ID Sem1 Sem2
0 a 3.000000 4.666667
1 b 2.666667 5.000000
EDIT:
Remove []:
df = pd.pivot_table(df,index=["ID",'status'], values="Sem1", aggfunc='size').reset_index()

Concatenate pandas dataframes via a column and filling in blanks with 'NaN'

I have four pandas dataframes that can be generated with the below code:
#df 1
time1=pandas.Series([0,20,40,60,120])
pPAK2=pandas.Series([0,3,15,21,23])
cols=['time','pPAK2']
df=pandas.DataFrame([time1,pPAK2])
df=df.transpose()
df.columns=cols
df.to_csv('pPAK2.csv',sep='\t')
pak2_df=df
#df2
time2=pandas.Series([0,15,30,60,120])
cAbl=pandas.Series([0,15,34,10,0])
df=pandas.DataFrame([time2,cAbl])
df=df.transpose()
cols=['time','pcAbl']
df.columns=cols
df.to_csv('pcAbl.csv',sep='\t')
pcAbl_df=df
#df 3
time7=pandas.Series([0,60,120,240,480,960,1440])
pSmad3_n=pandas.Series([0,16,14,12,8,7.5,6])
scale_factor=40
pSmad3_n=pSmad3_n*scale_factor
#plt.plot(time7,pSmad3)
df=pandas.DataFrame([time7,pSmad3_n])
df=df.transpose()
cols=['time','pSmad3_n']
df.columns=cols
df.to_csv('pSmad3_n.csv',sep='\t')
smad3_df=df
#df4
time8=pandas.Series([0,240,480,1440])
PAI1_mRNA=pandas.Series([0,23,25,5])
scale_factor=5
PAI1_mRNA=PAI1_mRNA*scale_factor
df=pandas.DataFrame([time8,PAI1_mRNA])
df=df.transpose()
cols=['time','PAI1_mRNA']
df.columns=cols
df.to_csv('PAI1_mRNA.csv',sep='\t')
PAI1_df=df
#print dataframes
print PAI1_df
print pak2_df
print pcAbl_df
print smad3_df
I want to concatenate these dataframes by the time column with the pandas concat function but I can't get the output right. The output should look something like this, if were to just concatenate PAI1_df and pak2_df
time PAI1_mRNA pPAK2
0 0 0 0
1 20 'NaN' 3
2 40 'NaN' 15
3 60 'NaN' 21
4 120 'NaN' 23
5 240 115 'NaN'
6 480 125 'NaN'
7 1440 25 'NaN
I think it should be easy but there are a lot of features in the doc, does anybody know how to do this?
So you can concat it like this:
import pandas
df = pandas.concat([pak2_df.set_index('time'), pcAbl_df.set_index('time')], axis=1).reset_index()
print(df)
Prints:
time pPAK2 pcAbl
0 0 0 0
1 15 NaN 15
2 20 3 NaN
3 30 NaN 34
4 40 15 NaN
5 60 21 10
6 120 23 0

Categories