I have a dataframe products. Products looks this:
Cust_ID Prod Time_of_Sale
A Bat 1
A Ball 2
A Lego 3
B Lego 3
B Lego 9
B Ball 11
B Bat 11
B Bat 13
C Bat 2
C Lego 2
I want to change it so that it becomes like this:
Cust_ID Bat Bat Ball Lego Lego
A 1 NaN 2 3 NaN
B 11 13 11 3 9
C 2 NaN NaN 2 NaN
I have been playing around with products.groupby() and it is not really leading me anywhere. Any help is appreciated.
The aim is to 'visualize' the order in which each item was purchased by each customer. I have more than 1000 unique Customers.
Edit:
I see that a user suggested that I go through How to pivot a dataframe. But this doesn't work because my columns have duplicate values.
This is a little tricky with duplicates on Prod. Basically you need a cumcount and pivot:
new_df = (df.set_index(['Cust_ID','Prod',
df.groupby(['Cust_ID', 'Prod']).cumcount()])
['Time_of_Sale']
.unstack(level=(1,2))
.sort_index(axis=1)
)
new_df.columns = [x for x,y in new_df.columns]
new_df = new_df.reset_index()
Output:
Cust_ID Ball Bat Bat Lego Lego
0 A 2.0 1.0 NaN 3.0 NaN
1 B 11.0 11.0 13.0 3.0 9.0
2 C NaN 2.0 NaN 2.0 NaN
Note: Duplicated column names, although supported, should be avoid in Pandas.
Related
I have df as shown below
df:
player goals_oct goals_nov
messi 2 4
neymar 2 NaN
ronaldo NaN 3
salah NaN NaN
levenoski 2 2
Where I would like to calculate the average goal scored by each player. Which is the average of goals_oct and goals_nov when both the data are available else the available column, if both not available then NaN
Expected output
player goals_oct goals_nov avg_goals
messi 2 4 3
neymar 2 NaN 2
ronaldo NaN 3 3
salah NaN NaN NaN
levenoski 2 0 1
I tried the below code, but it did not works
conditions_g = [(df['goals_oct'].isnull() and df['goals_nov'].notnull()),
(df['goals_oct'].notnull() and df['goals_nov'].isnull())]
choices_g = [df['goals_nov'], df['goals_oct']]
df['avg_goals']=np.select(conditions_g, choices_g, default=(df['goals_oct']+df['goals_nov'])/2)
Simply use mean(axis=1). It will skip NaNs:
columns = df.columns[1:] # all columns except the first
df['avg_goal'] = df[columns].mean(axis=1)
Output:
>>> df
player goals_oct goals_nov avg_goal
0 messi 2.0 4.0 3.0
1 neymar 2.0 NaN 2.0
2 ronaldo NaN 3.0 3.0
3 salah NaN NaN NaN
4 levenoski 2.0 2.0 2.0
Try this it will work
df["avg_goals"] = np.where(df.goals_oct.isnull(),
np.where(df.goals_nov.isnull(), np.NaN, df.goals_nov),
np.where(df.goals_nov.isnull(), df.goals_oct, (df.goals_oct + df.goals_nov) / 2))
if you want to consider 0 as empty value then you can convert 0 to np.NaN and try above statement it will work
I was extracting tables from a PDF with tabula-py. But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame. I'm giving a sample here.
Serial No. Name Type Total
0 1 Easter Multiple 19
1 2 Costeri Roundabout 16
2 3 Zhiop Tee 16
3 4 Nesss Cross 10
4 5 Uoar Lhahara Tee 10
5 6 Trino Nishra (KX) Tee 9
6 7 Old-FX Box Cross 8
7 8 Gardeners Roundabout 8
8 9 Max Detter Roundabout 7
9 NaN Others (Asynco, NaN NaN
10 10 D+ E, Cross 7
11 NaN etc) NaN NaN
If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row. There was multiple line in the table (in pdf). This table has more than 100 rows and at least 12 places those issues have occurred. Some places it is 2 consecutive rows and in some places it is 3 consecutive rows. How can we merge those rows with index values?
You can try:
df['Serial No.'] = df['Serial No.'].bfill().ffill()
df['Total'] = df['Total'].astype(str).replace('nan', np.nan)
df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)
Result:
print(df_out)
Serial No. Name Type Total
0 1.0 Easter Multiple 19.0
1 2.0 Costeri Roundabout 16.0
2 3.0 Zhiop Tee 16.0
3 4.0 Nesss Cross 10.0
4 5.0 Uoar Lhahara Tee 10.0
5 6.0 Trino Nishra(KX) Tee 9.0
6 7.0 Old-FX Box Cross 8.0
7 8.0 Gardeners Roundabout 8.0
8 9.0 Max Detter Roundabout 7.0
9 10.0 Others (Asynco,D+ E,etc) Cross 7.0
I have two dataframes looking like
import pandas as pd
df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'
df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
('A','b',5),('A','b',6),('B','c',7),
('B','c',9),('B','d',10),('B','d',11),
], names=('big', 'small', 't')))
I am searching for an efficient way to combine them such that I get
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.
Of course a loop over the dataframe would work as well, though not feasible for large dataframes.
EDIT:
It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.
Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).
res = (df2.reset_index()
.merge(df1, on='t', how='outer')
.set_index(df2.index.names)
.sort_index(level='t'))
res.index = pd.MultiIndex.from_arrays(
[pd.Series(res.index.get_level_values(i)).ffill()
for i in range(res.index.nlevels)],
names=res.index.names)
print(res)
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
Try extracting the level values and reindex:
df2['0'] = df1.reindex(df2.index.get_level_values('t'))[0].values
Output:
0
big small t
A a 1 NaN
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
9 NaN
d 10 10.5
11 NaN
For more columns in df1, we can just merge:
(df2.reset_index()
.merge(df1, on='t', how='left')
.set_index(df2.index.names)
)
Suppose I have the following contrived example:
ids types values
1 a 10
1 b 11
1 c 12
2 a -10
2 b -11
3 a 100
Is there way to use panda.pivot() to get the following table?
ids a b c
1 10 11 12
2 -10 -11 NaN
3 100 NaN NaN
You could try something like this -
df.pivot(index='ids', columns='types', values='values')
types a b c
ids
1 10.0 11.0 12.0
2 -10.0 -11.0 NaN
3 100.0 NaN NaN
I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs