Match certain column values with other multiple - python

I have a data frame, df, where I would like to take certain values from multiple columns and append them to other columns given certain criteria.
Data
id value type_a pos date stat type_b id date2
aaa 10 aaa_q2.25_1 30 q1.22 aaa
aaa 20 aaa_q3.25_2 30 q1.22 aaa
aaa 500 aaa_q1.22_3 30 q1.22 aaa aaa_q1.22_3 aaa
bbb 20 bbb_q1.22_1 20 q1.22 bbb bbb_q1.22_1 bbb
bbb 10 bbb_q3.25_4 20 q2.22 bbb
aaa 5 aaa_q2.22_3 30 q2.22 aaa aaa_q2.22_3 aaa
ccc 15 ccc_q3.22_1 50 q3.22 ccc ccc_q3.22_1 ccc
ccc ccc_q4.26_2 ccc q1.22
aaa ccc_q2.22_2 aaa q1.22
ccc ccc_q2.22_3 ccc q1.22
Desired
Logic, for the empty column spaces of 'id', 'typea' and 'date', take the values from the columns, 'stat', 'typeb' and 'date2' and apply them to 'id', 'typea', and 'date'
id value type_a pos date stat type_b id date2
aaa 10 aaa_q2.25_1 30 q1.22 aaa
aaa 20 aaa_q3.25_2 30 q1.22 aaa
aaa 500 aaa_q1.22_3 30 q1.22 aaa aaa_q1.22_3 aaa
bbb 20 bbb_q1.22_1 20 q1.22 bbb bbb_q1.22_1 bbb
bbb 10 bbb_q3.25_4 20 q2.22 bbb
aaa 5 aaa_q2.22_3 30 q2.22 aaa aaa_q2.22_3 aaa
ccc 15 ccc_q3.22_1 50 q3.22 ccc ccc_q3.22_1 ccc
ccc ccc_q4.26_2 50 q1.22 ccc ccc_q4.26_2 ccc q1.22
aaa aaa_q2.22_2 30 q1.22 aaa aaa_q2.22_2 aaa q1.22
ccc ccc_q2.22_3 50 q1.22 ccc ccc_q2.22_3 ccc q1.22
Doing
A SO member has pointed me in the right direction and the code works well, however, I am wanting to manipulate more than one column on this particular problem.
df['id'] = df['id'].fillna(df['stat'].str[:3])
df['typea'] = df['typea'].fillna(df['typeb'].str[:11])
df['date'] = df['date'].fillna(df['date2'].str[:5])
Any suggestion is appreciated

Try via fillna():
df.iloc[:,0]=df.iloc[:,0].fillna(df['stat'])
#you have to use iloc since you have 2 columns of name 'id'
df['type_a']=df['type_a'].fillna(df['type_b'])
df['date']=df['date'].fillna(df['date2'])

Related

How can I use the equivalent of pandas pivot_table() in pyspark?

I have a pyspark dataFrame that i want to pivot.
input_dataframe:
mdn
top_protocol_by_vol
top_vol
rank
55555
AAA
30
1
55555
BBB
20
2
55555
DDD
10
3
9898
JJJ
30
1
9898
CCC
20
2
9898
FFF
10
3
2030
PPP
30
1
2030
KKK
20
2
2030
FFF
10
3
and I want to have something like this
output_dataframe:
mdn
top_protocol_by_vol_1
top_protocol_by_vol_2
top_protocol_by_vol_3
top_vol_1
top_vol_2
top_vol_3
2030
PPP
KKK
FFF
30
20
10
9898
JJJ
CCC
FFF
30
20
10
55555
AAA
BBB
DDD
30
20
10
I know for sure that i cant do soemting like this with Pandas using the code:
output_dataframe = input_dataframe.pivot_table(index='mdn', columns=['rank'],aggfunc=lambda x: ''.join(x) if isinstance(x, str) else x,dropna=True).reset_index()
output_dataframe.columns = [''.join('_'.join([str(c) for c in col if c != ""])) for col in output_dataframe.columns.values]
How can I achieve the same results with pyspark converting to pandas ?
You can use pivot function with first as aggregate.
from pyspark.sql import functions as F
df = (df.groupby('mdn')
.pivot('rank')
.agg(F.first('top_protocol_by_vol').alias('top_protocol_by_vol'),
F.first('top_vol').alias('top_vol')))

DataFrame MultiIndex - find column by value

I have a multiindex dataframe with two layers of indices and roughly 100 columns. I would like to get groups of values (organized in columns) based on the presence of a certain value, but I am still struggling with the indexing mechanics.
Here is some example data:
import pandas as pd
index_arrays = [np.array(["one"]*5+["two"]*5),
np.array(["aaa","bbb","ccc","ddd","eee"]*2)]
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],
[10,11,12],[13,14,15],[16,1,17],
[18,19,20],[21,22,23],[24,25,26],
[27,28,29]], index=index_arrays)
Gives
0 1 2
one aaa 1 2 3
bbb 4 5 6
ccc 7 8 9
ddd 10 11 12
eee 13 14 15
two aaa 16 1 17
bbb 18 19 20
ccc 21 22 23
ddd 24 25 26
eee 27 28 29
Now, for each level_0 index (one and two), I want to return the entire column in which the level_1 index of aaa equals to a certain value, for example 1.
What I got so far is this:
df[df.loc[(slice(None), "aaa"),:]==1].any(axis=1)
>
one aaa True
bbb False
ccc False
ddd False
eee False
two aaa True
bbb False
ccc False
ddd False
eee False
Instead of the boolean values, I would like to retrieve the actual values. The expected output would be:
expected:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
two aaa 1
bbb 19
ccc 22
ddd 25
eee 28
I would appreciate your help.
Bonus question: Additionally, it would be great to know which column contains the values in question. For the example above, this would be column 0 (for index one)and column 1 (for index two). Is there a way to do this?
Thanks!
This might be what you're looking for:
df.loc[df.index.get_level_values(0) == 'one', df.loc[('one', 'aaa')] == 1]
This outputs:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
To combine the results for all of the different values of the first index, generate these DataFrames and concatenate them:
output_df = pd.DataFrame()
for level_0_val in df.index.get_level_values(0).unique():
_ = df.loc[df.index.get_level_values(0) == level_0_val, df.loc[(level_0_val, 'aaa')] == 1]
output_df = output_df.append(_)
Here is output_df:
0 1
one aaa 1.0 NaN
bbb 4.0 NaN
ccc 7.0 NaN
ddd 10.0 NaN
eee 13.0 NaN
two aaa NaN 1.0
bbb NaN 19.0
ccc NaN 22.0
ddd NaN 25.0
eee NaN 28.0
You can then generate your desired output from this.
Let's try with DataFrame.xs:
m = df.xs('aaa', level=1).eq(1).any()
Or with pd.IndexSlice:
m = df.loc[pd.IndexSlice[:, 'aaa'], :].eq(1).any()
Result:
df.loc[:, m]
0 1
one aaa 1 2
bbb 4 5
ccc 7 8
ddd 10 11
eee 13 14
two aaa 16 1
bbb 18 19
ccc 21 22
ddd 24 25
eee 27 28
df.columns[m]
Int64Index([0, 1], dtype='int64')

Groupby names replace values with there max value in all columns pandas

I have this DataFrame
lst = [['AAA',15,'BBB',20],['BBB',16,'AAA',12],['BBB',22,'CCC',15],['CCC',11,'AAA',31],['DDD',25,'EEE',35]]
df = pd.DataFrame(lst,columns = ['name1','val1','name2','val2'])
which looks like this
name1 val1 name2 val2
0 AAA 15 BBB 20
1 BBB 16 AAA 12
2 BBB 22 CCC 15
3 CCC 11 AAA 31
4 DDD 25 EEE 35
I want this
name1 val1 name2 val2
0 AAA 31 BBB 22
1 BBB 22 AAA 31
2 BBB 22 CCC 15
3 CCC 15 AAA 31
4 DDD 25 EEE 35
replaced all values with the maximum value. we choose the maximum value from both val1 and val2
if i do this i will get the maximum from only val1
df["val1"] = df.groupby("name1")["val1"].transform("max")
Try using pd.wide_to_long to melt that dataframe into a long form, then use groupby with transform to find the max value. Map that max value to 'name' and reshape back to four column (wide) dataframe:
df_long = pd.wide_to_long(df.reset_index(), ['name','val'], 'index', j='num',sep='',suffix='\d+')
mapper= df_long.groupby('name')['val'].max()
df_long['val'] = df_long['name'].map(mapper)
df_new = df_long.unstack()
df_new.columns = [f'{i}{j}' for i,j in df_new.columns]
df_new
Output:
name1 name2 val1 val2
index
0 AAA BBB 31 22
1 BBB AAA 22 31
2 BBB CCC 22 15
3 CCC AAA 15 31
4 DDD EEE 25 35
Borrow Scott's setting up
df_long = pd.wide_to_long(df.reset_index(), ['name','val'], 'index', j='num',sep='',suffix='\d+')
d = df_long.groupby('name')['val'].max()
df.loc[:,df.columns.str.startswith('val')]=df.loc[:,df.columns.str.startswith('name')].replace(d).values
df
Out[196]:
name1 val1 name2 val2
0 AAA 31 BBB 22
1 BBB 22 AAA 31
2 BBB 22 CCC 15
3 CCC 15 AAA 31
4 DDD 25 EEE 35
You can use lreshape (undocumented and ambiguous as to whether it's tested or will continue to remain) to get the long DataFrame, then map each pair of columns using the max.
names = df.columns[df.columns.str.startswith('name')]
vals = df.columns[df.columns.str.startswith('val')]
s = (pd.lreshape(df, groups={'name': names, 'val': vals})
.groupby('name')['val'].max())
for n in names:
df[n.replace('name', 'val')] = df[n].map(s)
name1 val1 name2 val2
0 AAA 31 BBB 22
1 BBB 22 AAA 31
2 BBB 22 CCC 15
3 CCC 15 AAA 31
4 DDD 25 EEE 35
This builds off #ScottBoston's answer :
res = pd.wide_to_long(df.reset_index(), ["name", "val"], "index", j="num")
res.update(res.groupby(["name"]).val.transform("max"))
res = res.unstack()
res.columns = [f"{first}{last}" for first, last in res.columns]
res.rename_axis(index=None)
name1 name2 val1 val2
0 AAA BBB 31 22
1 BBB AAA 22 31
2 BBB CCC 22 15
3 CCC AAA 15 31
4 DDD EEE 25 35

collating multiple rows of a column in a panda to one row while maintaining the data type of the column

I have a panda with a few columns like this
username A time place
AAA B 1 YYY
AAA C 2 YYY
AAA D 1 YYY
AAA B 3 ZZZ
AAA C 4 ZZZ
AAA B 3 ZZZ
BBB B 1 YYY
BBB C 2 YYY
BBB D 1 YYY
BBB B 7 ZZZ
BBB C 8 ZZZ
BBB B 9 ZZZ
CCC B 6 YYY
CCC C 5 YYY
CCC D 8 YYY
CCC B 7 ZZZ
CCC C 8 ZZZ
CCC B 9 ZZZ
in the above panda, all the columns except time are strings. TIme is a float column.
I am trying create a sequence such that for every username, I want the all the rows of a username collated to one row. The output dataframe wants to look like this.
username A time place
AAA B+C+D+B+C+B 1+2+1+3+4+3 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
BBB B+C+D+B+C+B 1+2+1+7+8+9 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
CCC B+C+D+B+C+B 6+5+8+7+8+9 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
I am using the '+' as a separator, but it can be any character generally used for separators(like ,/ \ ..etc)
I have been able to do that for all the columns using
df.groupby('username')['A].apply('+',join).reset_index()
and the same for all columns. I am finally merging all the individual df`s to get the form I want.
For the time column I am able to do but am looking to get a column of type floats. I am having difficulty doing that. Hoping somebody more knowledgeable can guide me here.
I have even tried changing the output column after the fact with
df['time'].astype(float)
but am getting all NaN`s.
I believe you need convert all columns to strings with agg:
df = df.astype(str).groupby('username', as_index=False).agg('+'.join)
print (df)
username A time place
0 AAA B+C+D+B+C+B 1.0+2.0+1.0+3.0+4.0+3.0 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
1 BBB B+C+D+B+C+B 1.0+2.0+1.0+7.0+8.0+9.0 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
2 CCC B+C+D+B+C+B 6.0+5.0+8.0+7.0+8.0+9.0 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
If need sum numeric columns and join by + strings columns:
df = (df.groupby('username', as_index=False)
.agg(lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else '+'.join(x)))
print (df)
username A time place
0 AAA B+C+D+B+C+B 14.0 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
1 BBB B+C+D+B+C+B 28.0 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ
2 CCC B+C+D+B+C+B 43.0 YYY+YYY+YYY+ZZZ+ZZZ+ZZZ

count elements in pandas dataframe with gropuby and attach them to existing dataframe

I have a pandas dataframe with the following structure:
date ticker Name
2/1/10 aaa zzz
2/1/10 aaa yyy
2/5/10 bbb xxx
2/5/10 ccc www
2/5/10 ccc qqq
2/5/10 ddd vvv
2/6/10 aaa zzz
I would like to add a column with the number of times the same ticker appears on the same date to every row. so the output would look like this:
date ticker Name count
2/1/10 aaa zzz 2
2/1/10 aaa yyy 2
2/5/10 bbb xxx 1
2/5/10 ccc www 2
2/5/10 ccc qqq 2
2/5/10 ddd vvv 1
2/6/10 aaa zzz 1
At the moment I was able to get the number of times each ticker appears at the same date but in a reduced dataframe so I can't fit it elegantly back to the original data frame.
this is what I was trying:
grpby2 = df2.groupby(['Date','Ticker'])
tmp = grpby2.agg({'Ticker':'max','Name':'count'}).reset_index(1,drop=True).reset_index(drop=False)
Thanks
Using groupby + transform with 'count':
df['count'] = df.groupby(['date', 'ticker']).transform('count')
print(df)
date ticker Name count
0 2/1/10 aaa zzz 2
1 2/1/10 aaa yyy 2
2 2/5/10 bbb xxx 1
3 2/5/10 ccc www 2
4 2/5/10 ccc qqq 2
5 2/5/10 ddd vvv 1
6 2/6/10 aaa zzz 1
Also works with len, but this option is significantly slower as it does not utilise optimized functions indicated by a string.
np.bincount and pd.factorize
f, u = pd.factorize(list(zip(df.date, df.ticker)))
df.assign(Count=np.bincount(f)[f])
date ticker Name Count
0 2/1/10 aaa zzz 2
1 2/1/10 aaa yyy 2
2 2/5/10 bbb xxx 1
3 2/5/10 ccc www 2
4 2/5/10 ccc qqq 2
5 2/5/10 ddd vvv 1
6 2/6/10 aaa zzz 1

Categories