Sorting dataframe by specific column names in Pandas

Sorting dataframe by specific column names in Pandas - python

How to sort pandas's dataframe by specific column names?
My dataframe columns look like this:
+-------+-------+-----+------+------+----------+
|movieId| title |drama|horror|action| comedy |
+-------+-------+-----+------+------+----------+
| |
+-------+-------+-----+------+------+----------+
I would like to sort the dataframe only by columns = ['drama','horror','sci-fi','comedy']. So I get the following dataframe:
+-------+-------+------+------+------+----------+
|movieId| title |action|comedy|drama | horror |
+-------+-------+------+------+------+----------+
| |
+-------+-------+------+------+------+----------+
I tried df = df.sort_index(axis=1) but it sorts all columns:
+-------+-------+------+------+-------+----------+
|action | comedy|drama |horror|movieId| title |
+-------+-------+------+------+-------+----------+
| |
+-------+-------+------+------+-------+----------+

Sorting all columns after second column and add first 2 columns:
c = df.columns[:2].tolist() + sorted(df.columns[2:].tolist())
print (c)
['movieId', 'title', 'action', 'comedy', 'drama', 'horror']
Last change order of columns by this list:
df1 = df[c]
Another idea is use DataFrame.sort_index but only for all columns without first 2 selected by DataFrame.iloc:
df.iloc[:, 2:] = df.iloc[:, 2:].sort_index(axis=1)

You can explicitly rearrange columns like so
df[['movieId','title','drama','horror','sci-fi','comedy']]
If you have a lot of columns to sort alphabetically
df[np.concatenate([['movieId,title'],df.drop('movieId,title',axis=1).columns.sort_values()])]

Related

Use rows values from a pandas dataframe as new columns label

If I have a pandas dataframe it's possible to get values from a row and use it as a label for a new column?
I have something like this:
| Team| DateTime| Score
| Red| 2021/03/19 | 5
| Red| 2021/03/20 | 10
| Blue| 2022/04/10 | 20
I would like to write this data on a new dataframe that has:
Team Column
Year/Month SumScore Column
So I would have a row per team with multiple new columns for a month in a year that contains the sum of the score for a specific month.
It should be like this:
Team
2021/03
2022/04
Red
15
0
Blue
0
20
The date format time is YYYY/MM/DD
I hope I was clear

You can use
df = (df.assign(YM=df['DateTime'].str.rsplit('/', 1).str[0])
.pivot_table(index='Team', columns='YM', values='Score', aggfunc='sum', fill_value=0)
.reset_index())
print(df)
YM Team 2021/03 2022/04
0 Blue 0 20
1 Red 15 0

We can use pd.crosstab which allows us to
Compute a simple cross tabulation of two (or more) factors
Below I've changed df['DateTime'] to contain year/month only.
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.strftime('%Y/%m')
pd.crosstab(
df['Team'],
df['DateTime'],
values=df['Score'],
aggfunc='sum'
).fillna(0)
If you don't want multiple levels in the index, just use the method call reset_index on your crosstab and then drop DateTime.

Pandas Merge two tables with the second tables' one column transposed

Table 1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
Table 2
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
Result Table
Brief Explanation on how the Result table needs to be created:
I have two data frames and I want to merge them based on a df_id. But the date column from second table should be transposed into the resultant table.
The date columns for the result table will be a range between the min date and max date from the second table
The column values for the dates in the result table will be from the data column of the second table.
Also the test column from the second table will only take its value of the latest date for the result table
I hope this is clear. Any suggestion or help regarding this will be greatly appreciated.
I have tried using pivot on the second table and then trying to merge the pivoted second table df1 but its not working. I do not know how to get only one row for the latest value of test.
Note: I am trying to solve this problem using vectorization and do not want to serially parse through each row

You need to pivot your df2 into two separate table as we need data and test values and then merge both resulting pivot table with df1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','03-05-2021','05-05-2021'],'data':[12,13,9,16],'test':['g','h','i','j']})
test_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['test'])
data_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['data'])
max_test = test_piv['test'].ffill(axis=1).iloc[:,-1].rename('test')
final = df1.merge(data_piv['data'],left_on=df1.df1_id, right_index=True, how='left')
final = final.merge(max_test,left_on=df1.df1_id, right_index=True, how='left')
and hence your resulting final dataframe as below
| | df1_id | col1 | col2 | 01-05-2021 | 03-05-2021 | 05-05-2021 | test |
|---:|---------:|:-------|:-------|-------------:|-------------:|-------------:|:-------|
| 0 | 1 | a | d | 12 | 9 | 16 | j |
| 1 | 2 | b | e | nan | 13 | nan | h |
| 2 | 3 | c | f | nan | nan | nan | nan |

Here is the solution for the question:
I first sort df2 based of df1_id and date to ensure that table entries are in order.
Then I drop duplicates based on df_id and select the last row to ensure I have the latest values for test and test2
Then I pivot df2 to get the corresponding date as column and data as its value
Then I merge the table with df2_pivoted to combine the latest values of test and test2
Then I merge with df1 to get the resultant table
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
df2=df2.sort_values(by=['df1_id','date'])
df2_latest_vals = df2.drop_duplicates(subset=['df1_id'],keep='last')
df2_pivoted = df2.pivot_table(index=['df1_id'],columns=['date'],values=['data'])
df2_pivoted = df2_pivoted.droplevel(0,axis=1).reset_index()
df2_pivoted = pd.merge(df2_pivoted,df2_latest_vals,on='df1_id')
df2_pivoted = df2_pivoted.drop(columns=['date','data'])
result = pd.merge(df1,df2_pivoted,on='df1_id',how='left')
result
Note: I have not been able to figure out how to get the entire date range between 01-05-2021 and 05-05-2021 and show the empty values as NaN. If anyone can help please edit the answer

Groupby id and create column boolean column

I have a dataframe of transactions:
id | type | date
453| online | 08-12-19
453| instore| 08-12-19
453| return | 10-5-19
There are 4 possible types: online, instore, return, other. I want to create boolean columns where I see if for each unique customer if they ever had a given transaction type.
I tried the following code but it was not giving me what I wanted.
transactions.groupby('id')['type'].transform(lambda x: x == 'online') == 'online'

Use get_dummies with aggregate max for indicaro columns per groups and last add DataFrame.reindex for custom order and add possible misisng types filled by 0:
t = ['online', 'instore', 'return', 'other']
df = pd.get_dummies(df['type']).groupby(df['id']).max().reindex(t, axis=1, fill_value=0)
print (df)
online instore return other
id
453 1 1 1 0
Another idea with join per groups and Series.str.get_dummies:
t = ['online', 'instore', 'return', 'other']
df.groupby('id')['type'].agg('|'.join).str.get_dummies().reindex(t, axis=1, fill_value=0)

Split Pivoted Index Column Pandas

I have a pivoted data frame that looks like this:
|Units_sold | Revenue
-------------------------------------
California_2015 | 10 | 600
California_2016 | 15 | 900
There are additional columns, but basically what I'd like to do is unstack the index column, and have my table look like this:
|State |Year |Units_sold |Revenue
-------------------------------------
California |2015 | 10 |600
California |2016 | 15 |900 `
Basically I had two data frames that I needed to merge, on the state and year, but I'm just not sure how to split the index column/ if that's possible. Still pretty new to Python, so I really appreciate any input!!

df = pd.DataFrame({'Units_sold':[10,15],'Revenue':[600,900]}, index=['California_2015','California_2016'])
df = df.reset_index()
df['State'] = df['index'].str.split("_").str.get(0)
df['Year'] = df['index'].str.split("_").str.get(1)
df = df.set_index('State')[['Year','Units_sold','Revenue']]
df

Grouping dataframe based on column similarities in Python

I have a dataframe with commonalities in groups of column names:
Sample1.Feature1 | Sample1.Feature2 | ... | Sample99.Feature1 | Sample99.Feature 2
And I'd like to reorder this as
|Sample1 ......................... | Sample99
|Feature 1, Feature 2 | ..... | Feature 1, Feature 2 |
I'd then have summary stats, e.g. mean, for Feature1, Feature2, grouped by Sample#. I've played with df.groupby() with no luck so far.
I hope my lack of table formatting skills doesn't distract from the question.
Thanks in advance.

consider the dataframe df
df = pd.DataFrame(
np.ones((1, 6)),
columns='s1.f1 s1.f2 s1.f3 s2.f1 s2.f2 s2.f3'.split())
df
split the columns
df.columns = df.columns.str.split('.', expand=True)
df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting dataframe by specific column names in Pandas - python

You can explicitly rearrange columns like so df[['movieId','title','drama','horror','sci-fi','comedy']] If you have a lot of columns to sort alphabetically df[np.concatenate([['movieId,title'],df.drop('movieId,title',axis=1).columns.sort_values()])]

Related

Use rows values from a pandas dataframe as new columns label

Pandas Merge two tables with the second tables' one column transposed

Groupby id and create column boolean column

Split Pivoted Index Column Pandas

Grouping dataframe based on column similarities in Python

Categories

Resources