SO I have the below
Name Tues Mon Tues Mon
col 0 0 1 1 <-
bill 2 1 2 1
jon 4 3 4 3
and i want to order the dataframe columns according to the "col" row to group 0's and 1's
in order but also in order according the days of the week so below is the result.
Name Mon Tues Mon Tues
col 0 0 1 1
bill 1 2 1 2
jon 3 4 3 4
It's easier to sort by rows, so you can use .T to transpose the dataframe and then use .T to transpose it back after running some operations.
The first thing it looks like you need to do is sort by the day of the week? You can crate a new column that replaces the partial Weekday strings to numbers, so you can sort in order by day of week, and then drop that column after sorting.
df = df.set_index('Name').T.reset_index()
df['day'] = df['index'].replace(['Su.*', 'M.*', 'Tu.*', 'W.*', 'Th.*', 'F.*', 'Sa.*'],
[1,2,3,4,5,6,7], regex=True)
df = df.sort_values(['col', 'day']).drop('day', axis=1).set_index('index').T.reset_index()
df
Out[1]:
index Name Mon Tues Mon.1 Tues.1
0 col 0 0 1 1
1 bill 1 2 1 2
2 jon 3 4 3 4
You can chnage the column names with:
df.columns = [col.split('.')[0] for col in df.columns]
Name Mon Tues Mon Tues
0 col 0 0 1 1
1 bill 1 2 1 2
2 jon 3 4 3 4
You can sort df by the following simple self-explained line
df_sorted = df.T.sort_values(['col', 'bill']).T
Related
I have pandas dataframe that contains dates in column Date. I need to add another column Days which contains the date difference from previous cell. So date in ith cell should difference from i-1th. And for the first difference consider it to be 0.
Date Days
08-01-1997 0
09-01-1997 1
10-01-1997 1
13-01-1997 3
14-01-1997 1
15-01-1997 1
01-03-1997 45
03-03-1997 2
04-03-1997 1
05-03-1997 1
13-06-1997 100
I tried this but not useful.
First convert the Date column to pandas DateTime object, then calculate the difference which is timedelta object, from there, take the days from Series.dt and assign 0 to first value
>>> df['Date']=pd.to_datetime(df['Date'], dayfirst=True)
>>> df['Days']=(df['Date']-df['Date'].shift()).dt.days.fillna(0).astype(int)
OUTPUT
df
Date Days
0 1997-01-08 0
1 1997-01-09 1
2 1997-01-10 1
3 1997-01-13 3
4 1997-01-14 1
5 1997-01-15 1
6 1997-03-01 45
7 1997-03-03 2
8 1997-03-04 1
9 1997-03-05 1
10 1997-06-13 100
you can use diff as well
df['date_up'] = pd.to_datetime(df['Date'],dayfirst=True)
df['date_diff'] = df['date_up'].diff()
df['date_diff_num_days'] = df['date_diff'].dt.days.fillna(0).astype(int)
df.head()
Date Days date_up date_diff date_diff_num_days
0 08-01-1997 0 1997-01-08 NaT 0
1 09-01-1997 1 1997-01-09 1 days 1
2 10-01-1997 1 1997-01-10 1 days 1
3 13-01-1997 3 1997-01-13 3 days 3
4 14-01-1997 1 1997-01-14 1 days 1
I was trying to remove the rows with nan values in a python dataframe and when i do so, i want the row identifiers to shift in such way that the identifiers in the new data frame start from 0 and are one number away from each other. By identifiers i mean the numbers at the left of the following example. Notice that this is not an actual column of my df. This is rather placed by default in every dataframe.
If my Df is like:
name toy born
0 a 1 2020
1 na 2 2020
2 c 5 2020
3 na 1 2020
4 asf 1 2020
i want after dropna()
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
I dont want, but this is what i get:
name toy born
0 a 1 2020
2 c 5 2020
4 asf 1 2020
You can simply add df.reset_index(drop=True)
By default, df.dropna and df.reset_index are not performed in place. Therefore, the complete answer would be as follows.
df = df.dropna().reset_index(drop=True)
Results and Explanations
The above code yields the following result.
>>> df = df.dropna().reset_index(drop=True)
>>> df
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
We use the argument drop=True to drop the index column. Otherwise, the result would look like this.
>>> df = df.dropna().reset_index()
>>> df
index name toy born
0 0 a 1 2020
1 2 c 5 2020
2 4 asf 1 2020
How can I use groupby by indexes (1,2,3)(they all are in the same order) and get the sum of the column score belonging to the range of each indexes? Basically I have this:
index score
1 2
2 2
3 2
1 3
2 3
3 3
What I want:
index score sum
1 2 6
2 2 9
3 2
1 3
2 3
3 3
I understand it has to be something like this :
df = df.groupby(['Year'])['Score'].sum()
but instead of a Year, to somehow do it by indexes?
Per the comments, you can groupby the index and return the cumcount() in a new object s. Then, you can groupby this new object s and get the sum(). I am assuming index is on your index in your example and not a column called index. If it is a column called index, then first do df = df.set_index('index'):
s = df.groupby(level=0).cumcount()
df.groupby(s)['score'].sum()
0 6
1 9
Name: score, dtype: int64
If you print out s, then s looks like this:
index
1 0
2 0
3 0
1 1
2 1
3 1
I am new to Python.
I have a dataframe with two columns. One is ID column and the other is the
year and count information related to the ID.
I want to convert this format into multiple rows with the same ID.
The current dataframe looks like:
ID information
1 2014:Total:0, 2015:Total:1, 2016:Total:2
2 2017:Total:3, 2018:Total:1, 2019:Total:2
I expect the converted dataframe should like this:
ID Year Value
1 2014 0
1 2015 1
1 2016 2
2 2017 3
2 2018 1
2 2019 2
I tried to use the str.split method of pandas dataframe, but no luck.
Any suggestions would be appreciated.
Let us using explode :-) (New in pandas 0.25.0)
df.information=df.information.str.split(', ')
Yourdf=df[['ID']].join(df.information.explode().str.split(':',expand=True).drop(1,axis=1))
Yourdf
ID 0 2
0 1 2014 0
0 1 2015 1
0 1 2016 2
1 2 2017 3
1 2 2018 1
1 2 2019 2
Try using the below code, unlike #WenYoBen's answer this works for much lower versions as well:
df2 = pd.DataFrame(df['information'].str.split(', ', expand=True).apply(lambda x: x.str.split(':')).T.values.flatten().tolist(), columns=['Year', '', 'Value']).iloc[:, [0, 2]]
print(pd.DataFrame(sorted(df['ID'].tolist() * (len(df2) // 2)), columns=['ID']).join(df2))
Output:
ID Year Value
0 1 2014 0
1 1 2017 3
2 1 2015 1
3 2 2018 1
4 2 2016 2
5 2 2019 2
In continuation to my previous Question I need some more help.
The dataframe is like
time eve_id sub_id flag
0 5 2 0
1 5 2 0
2 5 2 1
3 5 2 1
4 5 2 0
5 4 25 0
6 4 30 0
7 5 2 1
I need to count the eve_id in the time flag goes 0 to 1,
and count the eve_id for the time flag is 1 to 1
the output will look like this
time flag count
0 0 2
2 1 2
4 0 3
Can someone help me here ?
First we make a grouper indicator which checks if the difference between two rows is not equal to 0, which indicates a difference.
Then we groupby on this indicator and use agg. Since pandas 0.25.0 we have named aggregations:
s = df['flag'].diff().ne(0).cumsum()
grpd = df.groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
Output
time flag count
0 0 0 2
1 2 1 2
2 4 0 3
3 7 1 1
If time is your index, use:
grpd = df.assign(time=df.index).groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
notice: the row extra is because there's a difference between the last row and the row before as well
Change aggregate function sum to GroupBy.size:
df1 = (df.groupby([df['flag'].ne(df['flag'].shift()).cumsum(), 'flag'])
.size()
.reset_index(level=0, drop=True)
.reset_index(name='count'))
print (df1)
flag count
0 0 2
1 1 2
2 0 3
3 1 1