I have the following resulting pandas DateFrame:
How can I get this to sort properly? For example have the sort so that Day 2 comes after Day 1, not Day 11. As seen in Group 2 below?
set_levels + sort_index
The issue is your strings are being sorted as strings rather than numerically. First convert your first index level to numeric, then sort by index:
# split by whitespace, take last split, convert to integers
new_index_values = df.index.levels[1].str.split().str[-1].astype(int)
# set 'Day' level
df.index = df.index.set_levels(new_index_values, level='Day')
# sort by index
df = df.sort_index()
print(df)
Value
Group Day
A 0 1
2 3
11 2
B 5 5
7 6
10 4
Setup
The above demonstration uses this example setup:
df = pd.DataFrame({'Group': ['A', 'A', 'A', 'B', 'B', 'B'],
'Day': ['Day 0', 'Day 11', 'Day 2', 'Day 10', 'Day 5', 'Day 7'],
'Value': [1, 2, 3, 4, 5, 6]}).set_index(['Group', 'Day'])
print(df)
Value
Group Day
A Day 0 1
Day 11 2
Day 2 3
B Day 10 4
Day 5 5
Day 7 6
You need to sort integers instead of strings:
import pandas as pd
x = pd.Series([1,2,3,4,6], index=[3,2,1,11,12])
x.sort_index()
1 3
2 2
3 1
11 4
12 6
dtype: int64
y = pd.Series([1,2,3,4,5], index=['3','2','1','11','12'])
y.sort_index()
1 3
11 4
12 5
2 2
3 1
dtype: int64
I would suggest to have only numbers in the column instead of strings 'Day..'.
Related
I have data that looks like this
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'DATE': ['1/1/2015','1/2/2015', '1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015','1/8/2015',
'1/9/2016','1/2/2015','1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015'],
'CD': ['A','A','A','A','B','B','A','A','C','A','A','A','A','A','A']})
What I would like to do is group by ID and CD and get the start and stop change for each change. I tried using groupby and agg function but it will group all A together even though they needs to be separated since there is B in between 2 A.
df1 = df.groupby(['ID','CD'])
df1 = df1.agg(
Start_Date = ('Date',np.min),
End_Date=('Date', np.min)
).reset_index()
What I get is :
I was hoping if some one could help me get the result I need. What I am looking for is :
make grouper for grouping
grouper = df['CD'].ne(df['CD'].shift(1)).cumsum()
grouper:
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 5
14 5
Name: CD, dtype: int32
then use groupby with grouper
df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
output:
min max
ID CD
1 A 1/1/2015 1/4/2015
B 1/5/2015 1/6/2015
A 1/7/2015 1/8/2015
C 1/9/2016 1/9/2016
2 A 1/2/2015 1/7/2015
change column name and use reset_index and so on..for your desired output
(df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
.set_axis(['Start_Date', 'End_Date'], axis=1)
.reset_index()
.assign(CD=lambda x: x.pop('CD')))
result
ID Start_Date End_Date CD
0 1 1/1/2015 1/4/2015 A
1 1 1/5/2015 1/6/2015 B
2 1 1/7/2015 1/8/2015 A
3 1 1/9/2016 1/9/2016 C
4 2 1/2/2015 1/7/2015 A
I have 2 columns:
A
B
1
ABCSD
2
SSNFs
3 CVY KIP
4 MSSSQ
5
ABCSD
6 MMS LLS
7
QQLL
This is an example actual files contains these type of cases in 1000+ rows.
I want to separate all the alphabets from column A and get them as output in column B:
Expected Output:
A
B
1
ABCSD
2
SSNFs
3
CVY KIP
4
MSSSQ
5
ABCSD
6
MMS LLS
7
QQLL
So Far I have tried this which works but looking for a better way:
df['B2'] = df['A'].str.split(' ').str[1:]
def try_join(l):
try:
return ' '.join(map(str, l))
except TypeError:
return np.nan
df['B2'] = [try_join(l) for l in df['B2']]
df = df.replace('', np.nan)
append=df['B2']
df['B']=df['B'].combine_first(append)
df['A']=[str(x).split(' ')[0] for x in df['A']]
df.drop(['B2'],axis=1,inplace=True)
df
You could try as follows.
Either use str.extractall with two named capture groups (generic: (?P<name>...)) as A and B. First one for the digit(s) at the start, second one for the rest of the string. (You can easily adjust these patterns if your actual strings are less straightforward.) Finally, drop the added index level (1) by using df.droplevel.
Or use str.split with n=1 and expand=True and rename the columns (0 and 1 to A and B).
Either option can be placed inside df.update with overwrite=True to get the desired outcome.
import pandas as pd
import numpy as np
data = {'A': {0: '1', 1: '2', 2: '3 CVY KIP', 3: '4 MSSSQ',
4: '5', 5: '6 MMS LLS', 6: '7'},
'B': {0: 'ABCSD', 1: 'SSNFs', 2: np.nan, 3: np.nan,
4: 'ABCSD', 5: np.nan, 6: 'QQLL'}
}
df = pd.DataFrame(data)
df.update(df.A.str.extractall(r'(?P<A>^\d+)\s(?P<B>.*)').droplevel(1),
overwrite=True)
# or in this case probably easier:
# df.update(df.A.str.split(pat=' ', n=1, expand=True)\
# .rename(columns={0:'A',1:'B'}),overwrite=True)
df['A'] = df.A.astype(int)
print(df)
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL
You can split on ' ' as it seems that the numeric value is always at the beginning and the text is after a space.
split = df.A.str.split(' ', 1)
df.loc[df.B.isnull(), 'B'] = split.str[1]
df.loc[:, 'A'] = split.str[0]
You could use str.split() if your number appears first.
df['A'].str.split(n=1,expand=True).set_axis(df.columns,axis=1).combine_first(df)
or
df['A'].str.extract(r'(?P<A>\d+) (?P<B>[A-Za-z ]+)').combine_first(df)
Output:
A B
0 1 ABCSD
1 2 SSNFs
2 3 CVY KIP
3 4 MSSSQ
4 5 ABCSD
5 6 MMS LLS
6 7 QQLL
How can I merge columns Year, Month, and Day into one column of months?
import pandas as pd
data = {'Subject': ['A', 'B', 'C', 'D'],
'Year':[1, 0, 0, 2],
'Month':[5,2,8,8],
'Day': [3,22,5,12]}
df = pd.DataFrame(data)
print(df)
My example gives the resulting dataframe:
Subject Year Month Day
0 A 1 5 3
1 B 0 2 22
2 C 0 8 5
3 D 2 8 12
I would like it to look like this:
*note: I rounded this so these numbers are not 100% accurate
Subject Months
0 A 17
1 B 3
2 C 8
3 D 32
Assuming the Gregorian calendar:
365.2425 days/year
30.436875 days/month.
day_year = 365.2425
day_month = 30.436875
df['Days'] = df.Year.mul(day_year) + df.Month.mul(day_month) + df.Day
# You could also skip this step and just do:
# df['Months'] = (df.Year.mul(day_year) + df.Month.mul(day_month) + df.Day).div(day_month)
df['Months'] = df.Days.div(day_month)
print(df.round(2))
Output:
Subject Year Month Day Days Months
0 A 1 5 3 520.43 17.10
1 B 0 2 22 82.87 2.72
2 C 0 8 5 248.50 8.16
3 D 2 8 12 985.98 32.39
What is the best way in python to get the column number of the max values of each row in a pandas dataframe-matrix (without the index column). E.g. if I would have
Date
Company 1
Company 2
Company 3
01.01.2020
23
21
14
02.01.2020
22
12
22
03.01.2020
11
11
12
....
...
...
...
02.01.2020
2
14
3
The output should be the vector:
[1, 1, 3, ..., 2]
Use idxmax:
In [949]: cols = df.iloc[:, 1:].idxmax(1).tolist()
In [950]: cols
Out[950]: ['Company 1', 'Company 1', 'Company 3']
If you want column index:
In [951]: [df.columns.get_loc(i) for i in cols]
Out[951]: [1, 1, 3]
I have a pandas dataframe with 2 level of indexes. For each level 1 Index, I want to select 1st Level 2 Index records.
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Year': ['2020','2020', '2019','2019','2019','2018', '2019','2018','2017'],'class':list('AISAAIASS'),
'val': randint(0, 10, 9)})
df
Person Year class val
0 1 2020 A 8
1 1 2020 I 7
2 1 2019 S 6
3 2 2019 A 8
4 2 2019 A 1
5 2 2018 I 2
6 3 2019 A 0
7 3 2018 S 6
8 3 2017 S 8
I want 2020(Year) records for Person 1 (2 in no), 2019 records (2 in no.) for Person 2 and 2019 record ( 1 record) for Person 3.
I looked into lot of codes, still unable to get the answer. Is there a simple way of doing it?
Use Index.get_level_values with Index.duplicated for first MultiIndex values and then filter by Index.isin:
np.random.seed(2020)
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Year': ['2020','2020', '2019','2019','2019','2018', '2019','2018','2017'],
'class':list('AISAAIASS'),
'val': np.random.randint(0, 10, 9)}).set_index(['Person','Year'])
idx = df.index[~df.index.get_level_values(0).duplicated()]
df1 = df[df.index.isin(idx)]
Or get first index values by GroupBy.head by first level:
df1 = df[df.index.isin(df.groupby(['Person']).head(1).index)]
print (df1)
class val
Person Year
1 2020 A 0
2020 I 8
2 2019 A 6
2019 A 3
3 2019 A 7