Use rows values from a pandas dataframe as new columns label - python

If I have a pandas dataframe it's possible to get values from a row and use it as a label for a new column?
I have something like this:
| Team| DateTime| Score
| Red| 2021/03/19 | 5
| Red| 2021/03/20 | 10
| Blue| 2022/04/10 | 20
I would like to write this data on a new dataframe that has:
Team Column
Year/Month SumScore Column
So I would have a row per team with multiple new columns for a month in a year that contains the sum of the score for a specific month.
It should be like this:
Team
2021/03
2022/04
Red
15
0
Blue
0
20
The date format time is YYYY/MM/DD
I hope I was clear

You can use
df = (df.assign(YM=df['DateTime'].str.rsplit('/', 1).str[0])
.pivot_table(index='Team', columns='YM', values='Score', aggfunc='sum', fill_value=0)
.reset_index())
print(df)
YM Team 2021/03 2022/04
0 Blue 0 20
1 Red 15 0

We can use pd.crosstab which allows us to
Compute a simple cross tabulation of two (or more) factors
Below I've changed df['DateTime'] to contain year/month only.
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.strftime('%Y/%m')
pd.crosstab(
df['Team'],
df['DateTime'],
values=df['Score'],
aggfunc='sum'
).fillna(0)
If you don't want multiple levels in the index, just use the method call reset_index on your crosstab and then drop DateTime.

Related

Sorting dataframe by specific column names in Pandas

How to sort pandas's dataframe by specific column names?
My dataframe columns look like this:
+-------+-------+-----+------+------+----------+
|movieId| title |drama|horror|action| comedy |
+-------+-------+-----+------+------+----------+
| |
+-------+-------+-----+------+------+----------+
I would like to sort the dataframe only by columns = ['drama','horror','sci-fi','comedy']. So I get the following dataframe:
+-------+-------+------+------+------+----------+
|movieId| title |action|comedy|drama | horror |
+-------+-------+------+------+------+----------+
| |
+-------+-------+------+------+------+----------+
I tried df = df.sort_index(axis=1) but it sorts all columns:
+-------+-------+------+------+-------+----------+
|action | comedy|drama |horror|movieId| title |
+-------+-------+------+------+-------+----------+
| |
+-------+-------+------+------+-------+----------+
Sorting all columns after second column and add first 2 columns:
c = df.columns[:2].tolist() + sorted(df.columns[2:].tolist())
print (c)
['movieId', 'title', 'action', 'comedy', 'drama', 'horror']
Last change order of columns by this list:
df1 = df[c]
Another idea is use DataFrame.sort_index but only for all columns without first 2 selected by DataFrame.iloc:
df.iloc[:, 2:] = df.iloc[:, 2:].sort_index(axis=1)
You can explicitly rearrange columns like so
df[['movieId','title','drama','horror','sci-fi','comedy']]
If you have a lot of columns to sort alphabetically
df[np.concatenate([['movieId,title'],df.drop('movieId,title',axis=1).columns.sort_values()])]

Add column of row numbers for each group of successively increasing dates

I have DataFrame which has column with Date and other columns with some values and, let's say, first 100 rows are in order according to the date, and from 101 till 200 again the same Dates, only different values, and so on. I would like to add a column which count rows from 1 to 100, and start again from 1 when the date repeat.
Example
Date | Value | RowNum
2000-01-01 | 2 | 1
2000-02-01 | 10 | 2
.
.
.
2003-12-01 | 11 | 100
2000-01-01 | 32 | 1
2000-02-01 | 14 | 2
.
.
.
2003-12-01 | 4 | 100
I need this to pivot this table where columns are dates, values are values and RowNum will be index.
Thank You for help.
If the exact same dates repeat, your problem becomes a very simple cumsum and cumcount problem:
m = df.Date.eq(df.at[df.index[0], 'Date']).cumsum()
df['RowNum'] = df.groupby(m).cumcount() + 1
If not, you can check the diff:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
m = df['Date'].diff().dt.total_seconds().fillna(-1).lt(0).cumsum()
df['RowNum'] = df.groupby(m).cumcount() + 1
Or, similarly, by converting the underlying NumPy array to float and then diffing:
s = pd.Series(df['Date'].values.astype(float), index=df.index)
df['RowNum'] = df.groupby(s.fillna(-1).lt(0).cumsum()).cumcount() + 1
Explanation
Create a new column and iterate through the data frame and simply use %100 of the index column. This will work just fine if you exactly have 100 same dates as you mentioned above.
Code
df[RowNum] = 1
for i, row in df.iterrows():
RowNum_val = i%100
df.set_value(i,'RowNum',RowNum_val)
Resources
https://www.geeksforgeeks.org/python-pandas-dataframe-set_value/
https://www.tutorialspoint.com/python_pandas/python_pandas_iteration.htm

Pandas: How to sort dataframe rows by date of one column

So I have two different data-frame and I concatenated both. All columns are the same; however, the date column has all sorts of different dates in the M/D/YR format.
dataframe dates get shuffled around later in the sequence
Is there a way to keep the whole dataframe itself and just sort the rows based on the dates in the date column. I also want to keep the format that date is in.
so basically
date people
6/8/2015 1
7/10/2018 2
6/5/2015 0
gets converted into:
date people
6/5/2015 0
6/8/2015 1
7/10/2018 2
Thank you!
PS: I've tried the options in the other post on this but it does not work
Trying to elaborate on what can be done:
Intialize/ Merge the dataframe and convert the column into datetime type
df= pd.DataFrame({'people':[1,2,0],'date': ['6/8/2015','7/10/2018','6/5/2015',]})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
print(df)
Output:
date people
0 2015-06-08 1
1 2018-07-10 2
2 2015-06-05 0
Sort on the basis of date
df=df.sort_values('date')
print(df)
Output:
date people
2 2015-06-05 0
0 2015-06-08 1
1 2018-07-10 2
Maintain the format again:
df['date']=df['date'].dt.strftime('%m/%d/%Y')
print(df)
Output:
date people
2 06/05/2015 0
0 06/08/2015 1
1 07/10/2018 2
Try changing the 'date' column to pandas Datetime and then sort
import pandas as pd
df= pd.DataFrame({'people':[1,1,1,2],'date':
['4/12/1961','5/5/1961','7/21/1961','8/6/1961']})
df['date'] =pd.to_datetime(df.date)
df.sort_values(by='date')
Output:
date people
1961-04-12 1
1961-05-05 1
1961-07-21 1
1961-08-06 2
To get back the initial format:
df['date']=df['date'].dt.strftime('%m/%d/%y')
Output:
date people
04/12/61 1
05/05/61 1
07/21/61 1
08/06/61 2
why not simply?
dataset[SortBy["date"]]
can you provide what you tried or how is your structure?
In case you need to sort in reversed order do:
dataset[SortBy["date"]][Reverse]

Split Pivoted Index Column Pandas

I have a pivoted data frame that looks like this:
|Units_sold | Revenue
-------------------------------------
California_2015 | 10 | 600
California_2016 | 15 | 900
There are additional columns, but basically what I'd like to do is unstack the index column, and have my table look like this:
|State |Year |Units_sold |Revenue
-------------------------------------
California |2015 | 10 |600
California |2016 | 15 |900 `
Basically I had two data frames that I needed to merge, on the state and year, but I'm just not sure how to split the index column/ if that's possible. Still pretty new to Python, so I really appreciate any input!!
df = pd.DataFrame({'Units_sold':[10,15],'Revenue':[600,900]}, index=['California_2015','California_2016'])
df = df.reset_index()
df['State'] = df['index'].str.split("_").str.get(0)
df['Year'] = df['index'].str.split("_").str.get(1)
df = df.set_index('State')[['Year','Units_sold','Revenue']]
df

Combining MultiIndex columns with similar root names in Pandas/Python

I have a MultiIndex dataframe with the top level columns named:
Col1_1 | Col1_2 | Col 2_1 | Col 2_2 | ... |
I'm looking to combine Col1_1 with Col1_2 as Col1. I could also do this before creating the MultiIndex, but the original data is more drawn out as:
Col1_1.aspect1 | Col1_1.aspect 2 | Col1_2.aspect1 | Col1_2.aspect2 | ... |
where 'aspect1' and 'aspect2' become subcolumns in the MultiIndex.
Please let me know if I can clarify anything, and many thanks in advance.
The expected result combines the two as just Sample1; any number of ways is fine, including stacking/concatenating the data, outputting a summary stat e.g. mean(), etc.
You can use groupby and apply an aggregation function against it like mean.
You must group against axis 1 (columns) and with level 1 (lower multiindex columns). It will apply the grouping across all samples. Then simply do a mean if it's what you want to achieve:
df.groupby(level=1, axis=1).mean()

Categories