I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5
I am sorry for being a noob but I can't find a solution for my problem with hours of search.
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
print(new_df)
df1
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
df2
time bought
0 3 0
1 4 0
2 5 0
3 6 0
4 7 0
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 5 0
6 6 0
7 7 0
What I want is
updating df1(existing data) with df2(new data feed). when it comes to bought value, df1 data should comes first
the new_df should have all unique time values from df1, df2 without duplicates
I tried every method I found but no one made my desired outcome or created unnecessary duplicates as above.(two rows with time value of 5)
merge method created _x _y suffixes or duplicates
join() didn't work as well.
What I desire should look like:
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0
Thank you in advance
if you perform the join as you have done all you need to do is remove the duplicate rows keeping only the more resent data,
drop_duplicates() take the kwarg subset which takes a list of columns and keep which sets which row to keep if there are duplicates
in this case we only need to check for duplicates in the time column and wee keep the first column
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
new_df = new_df.drop_duplicates(subset=['time'], keep='first')
print(new_df)
Output:
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0
I have the following dataframe df:
names
status
John
Completed
James
To Do
Jill
To Do
Robert
In Progress
Jill
To Do
Jill
To Do
Marina
Completed
Evy
Completed
Evy
Completed
Now I want the count of each type of status for each user. I can get it like this for all types of statuses.
df = pd.crosstab(df.names,df.status).reset_index("names")
So now the resulting df is
status
names
Completed
In Progress
To Do
0
James
0
0
1
1
Robert
0
1
0
2
John
1
0
0
3
Marina
1
0
0
4
Jill
0
0
3
5
Evy
2
0
0
So my problem is how can I specify only a particular type of status value to be counted? For eg: I want only the values of In Progress and Completed and not To Do. And how can I add a extra column to the above called as Total Statuses, that will actually be the total number of rows for each name in the original dataframe?
Desired Dataframe:
status
names
Completed
In Progress
Total
0
James
0
0
1
1
Robert
0
1
1
2
John
1
0
1
3
Marina
1
0
1
4
Jill
0
0
3
5
Evy
2
0
2
Another way:
pass margins and margins_name parameters in pd.crosstab():
df=(pd.crosstab(df.names,df.status,margins=True,margins_name='Total').iloc[:-1]
.reset_index().drop('To Do',1))
OR
via crosstab()+assign()
df=(pd.crosstab(df.names,df.status).assign(Total=lambda x:x.sum(1))
.reset_index().drop('To Do',1))
OR
In 2 steps:
df=pd.crosstab(df.names,df.status)
df=df.assign(Total=df.sum(1)).drop('To Do',1).reset_index()
You can create the total from the addition of the three previous columns:
df['Total'] = (df['Completed'] + df['In Progress'] + df['To Do'])
Then you can drop the 'to-do' from your new data frame as follows :
df = df.drop(columns=['To Do'])
df = pd.DataFrame({'names': ['John', 'James', 'Jill', 'Robert', 'Jill', 'Jill', 'Marina', 'Evy', 'Evy'],
'status':['Completed', 'To Do', 'To Do', 'In Progress', 'To Do', 'To Do', 'Completed', 'Completed', 'Completed']})
df = pd.crosstab(df.names,df.status).reset_index("names")
df['Total'] = df['Completed'] + df['In Progress'] + df['To Do']
df = df.drop(columns=['To Do'])
print(df)
Output:
status names Completed In Progress Total
0 Evy 2 0 2
1 James 0 0 1
2 Jill 0 0 3
3 John 1 0 1
4 Marina 1 0 1
5 Robert 0 1 1
I can't comprehend what kind of sorting system you are using. But I think you will manage to do that yourself.
For example, I have a dataframe like this:
data = {'id': [1,1,1,2,2],
'value': ['red','red and blue','yellow','oak','oak wood']
}
df = pd.DataFrame (data, columns = ['id','value'])
I want :
id value count
1 red 2
1 blue 1
1 yellow 1
2 oak 2
2 wood 1
Many thanks!
Solution for pandas 0.25+ with DataFrame.explode by lists created by Series.str.split and GroupBy.size:
df1 = (df.assign(value = df['value'].str.split())
.explode('value')
.groupby(['id','value'], sort=False)
.size()
.reset_index(name='count'))
print (df1)
id value count
0 1 red 2
1 1 and 1
2 1 blue 1
3 1 yellow 1
4 2 oak 2
5 2 wood 1
For lower pandas versions use DataFrame.set_index with Series.str.split and expand=True for DataFrame, reshape by DataFrame.stack, create columns from MultiIndex Series ands use same solution like above:
df1 = (df.set_index('id')['value']
.str.split(expand=True)
.stack()
.reset_index(name='value')
.groupby(['id','value'], sort=False)
.size()
.reset_index(name='count')
)
print (df1)
id value count
0 1 red 2
1 1 and 1
2 1 blue 1
3 1 yellow 1
4 2 oak 2
5 2 wood 1
I want to Group By Date and Hour with an aggregation founction count and split the result for each different ID (in the columns) in output.
df = pd.DataFrame({'GpID': [1,1,0,1,1,0,1,1,2,2,1,1,2,1,1,0,1,2,0,1,1],
'HR': [1,1,1,1,1,1,1, 2,2,2,2,1,1,1, 2,2,2,2,3,3,3],
'Date_': [1,1,1,2,2,2,2, 2,2,2,2,3,3,3, 3,3,3,3,3,3,3]
})
The output format is like
df_out = pd.DataFrame({ 'HR': [1,2,3,1,2,3],
'Date_': [1,1,1,2,2,2],
'GpID_0': [1,2,5,1,4,2],
'GpID_1': [1,2,5,1,4,2],
'GpID_2': [4,2,5,1,4,2],
})
Tried:
# 1st try
df_g = df.groupby(["Hr", "Date_"], observed=False).count().fillna(0).unstack()
# 2nd try
df_g = df.groupby(["Hr", "Date_","GpId"], observed=False).count().fillna(0).unstack(-1)
# 3rd try
df_g = df.groupby(["Hr", "Date_"], observed=False).count().fillna(0).unstack()
Nothing accurate yet
I Believed you tried to do something like this
In [1]:
import pandas as pd
df = pd.DataFrame({'GpID': [1,1,0,1,1,0,1,1,2,2,1,1,2,1,1,0,1,2,0,1,1],
'HR': [1,1,1,1,1,1,1, 2,2,2,2,1,1,1, 2,2,2,2,3,3,3],
'Date_': [1,1,1,2,2,2,2, 2,2,2,2,3,3,3, 3,3,3,3,3,3,3]
})
df.loc[:,'Count']=1
pd.pivot_table(df, values='Count', index=['Date_', 'HR'], columns=['GpID'], aggfunc='count').fillna(0).reset_index()
Out [1]:
Date_ HR 0 1 2
0 1 1 1 2 0
1 2 1 1 3 0
2 2 2 0 2 2
3 3 1 0 2 1
4 3 2 1 2 1
5 3 3 1 2 0