What is the best way in python to get the column number of the max values of each row in a pandas dataframe-matrix (without the index column). E.g. if I would have
Date
Company 1
Company 2
Company 3
01.01.2020
23
21
14
02.01.2020
22
12
22
03.01.2020
11
11
12
....
...
...
...
02.01.2020
2
14
3
The output should be the vector:
[1, 1, 3, ..., 2]
Use idxmax:
In [949]: cols = df.iloc[:, 1:].idxmax(1).tolist()
In [950]: cols
Out[950]: ['Company 1', 'Company 1', 'Company 3']
If you want column index:
In [951]: [df.columns.get_loc(i) for i in cols]
Out[951]: [1, 1, 3]
Related
This question already has answers here:
Pandas: sum DataFrame rows for given columns
(8 answers)
Closed 2 years ago.
I have a df as such:
column A column B column C .... ColumnZ
index
X 1 4 7 10
Y 2 5 8 11
Z 3 6 9 12
For the life on me I can't figure out how to sum rows for each column, to arrive at a summation df:
column A column B column C .... ColumnZ
index
total 6 16 25 33
Any thoughts?
You can use:
df.loc['total'] = df.sum(numeric_only=True, axis=0)
Try this:
import pandas as pd
df = pd.DataFrame({'column A': [1, 2, 3], 'column B': [4, 5, 6], 'column C': [7, 8, 9]})
df.loc['total'] = df.sum()
print(df)
Output:
column A column B column C
0 1 4 7
1 2 5 8
2 3 6 9
total 6 15 24
I have a pandas dataframe with 2 level of indexes. For each level 1 Index, I want to select 1st Level 2 Index records.
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Year': ['2020','2020', '2019','2019','2019','2018', '2019','2018','2017'],'class':list('AISAAIASS'),
'val': randint(0, 10, 9)})
df
Person Year class val
0 1 2020 A 8
1 1 2020 I 7
2 1 2019 S 6
3 2 2019 A 8
4 2 2019 A 1
5 2 2018 I 2
6 3 2019 A 0
7 3 2018 S 6
8 3 2017 S 8
I want 2020(Year) records for Person 1 (2 in no), 2019 records (2 in no.) for Person 2 and 2019 record ( 1 record) for Person 3.
I looked into lot of codes, still unable to get the answer. Is there a simple way of doing it?
Use Index.get_level_values with Index.duplicated for first MultiIndex values and then filter by Index.isin:
np.random.seed(2020)
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Year': ['2020','2020', '2019','2019','2019','2018', '2019','2018','2017'],
'class':list('AISAAIASS'),
'val': np.random.randint(0, 10, 9)}).set_index(['Person','Year'])
idx = df.index[~df.index.get_level_values(0).duplicated()]
df1 = df[df.index.isin(idx)]
Or get first index values by GroupBy.head by first level:
df1 = df[df.index.isin(df.groupby(['Person']).head(1).index)]
print (df1)
class val
Person Year
1 2020 A 0
2020 I 8
2 2019 A 6
2019 A 3
3 2019 A 7
I have a problem with counting the amount of rows between two indices from another Dataframe. Let me explain this in an example:
The index of DF2 is the reference vector and I want to count the amount rows/entries in DF1 that are between the couple indices.
DF1 DF2
data data
index index
3 1 2 1
9 1 11 1
15 0 33 1
21 0 34 1
23 0
30 1
34 0
Now I want to count all rows that lie between a couple of indices in DF 2.
The reference vector is the index vector of DF2: [2, 11, 33, 34]
Between index 2 and 11 of DF2 is index: 3 and 9 of DF1 -> result 2
Between index 11 and 33 of DF2 is index: 15, 21, 23, 30 of DF1 -> result 4
Between index 33 and 34 of DF2 is index: 34 of DF1 -> result 1
Therefore the result vector should be: [2, 4, 1]
It is really struggling, so I hope you can help me.
If would first build a dataframe giving the min and max indexes from df2:
limits = pd.DataFrame({'mn': np.roll(df2.index, 1), 'mx': df2.index}).iloc[1:]
It gives:
mn mx
1 2 11
2 11 33
3 33 34
It is then easy to use a comprehension to get the expected list:
result = [len(df1[(i[0]<=df1.index)&(df1.index<=i[1])]) for i in limits.values]
and obtain as expected:
[2, 4, 1]
I have the following resulting pandas DateFrame:
How can I get this to sort properly? For example have the sort so that Day 2 comes after Day 1, not Day 11. As seen in Group 2 below?
set_levels + sort_index
The issue is your strings are being sorted as strings rather than numerically. First convert your first index level to numeric, then sort by index:
# split by whitespace, take last split, convert to integers
new_index_values = df.index.levels[1].str.split().str[-1].astype(int)
# set 'Day' level
df.index = df.index.set_levels(new_index_values, level='Day')
# sort by index
df = df.sort_index()
print(df)
Value
Group Day
A 0 1
2 3
11 2
B 5 5
7 6
10 4
Setup
The above demonstration uses this example setup:
df = pd.DataFrame({'Group': ['A', 'A', 'A', 'B', 'B', 'B'],
'Day': ['Day 0', 'Day 11', 'Day 2', 'Day 10', 'Day 5', 'Day 7'],
'Value': [1, 2, 3, 4, 5, 6]}).set_index(['Group', 'Day'])
print(df)
Value
Group Day
A Day 0 1
Day 11 2
Day 2 3
B Day 10 4
Day 5 5
Day 7 6
You need to sort integers instead of strings:
import pandas as pd
x = pd.Series([1,2,3,4,6], index=[3,2,1,11,12])
x.sort_index()
1 3
2 2
3 1
11 4
12 6
dtype: int64
y = pd.Series([1,2,3,4,5], index=['3','2','1','11','12'])
y.sort_index()
1 3
11 4
12 5
2 2
3 1
dtype: int64
I would suggest to have only numbers in the column instead of strings 'Day..'.
I have a pandas DataFrame containing one column with multiple JSON data items as list of dicts. I want to normalize the JSON column and duplicate the non-JSON columns:
# creating dataframe
df_actions = pd.DataFrame(columns=['id', 'actions'])
rows = [[12,json.loads('[{"type": "a","value": "17"},{"type": "b","value": "19"}]')],
[15, json.loads('[{"type": "a","value": "1"},{"type": "b","value": "3"},{"type": "c","value": "5"}]')]]
df_actions.loc[0] = rows[0]
df_actions.loc[1] = rows[1]
>>>df_actions
id actions
0 12 [{'type': 'a', 'value': '17'}, {'type': 'b', '...
1 15 [{'type': 'a', 'value': '1'}, {'type': 'b', 'v...
I want
>>>df_actions_parsed
id type value
12 a 17
12 b 19
15 a 1
15 b 3
15 c 5
I can normalize JSON data using:
pd.concat([pd.DataFrame(json_normalize(x)) for x in df_actions['actions']],ignore_index=True)
but I don't know how to join that back to the id column of the original DataFrame.
You can use concat with dict comprehension with pop for extract column, remove second level and join to original:
df1 = (pd.concat({i: pd.DataFrame(x) for i, x in df_actions.pop('actions').items()})
.reset_index(level=1, drop=True)
.join(df_actions)
.reset_index(drop=True))
What is same as:
df1 = (pd.concat({i: json_normalize(x) for i, x in df_actions.pop('actions').items()})
.reset_index(level=1, drop=True)
.join(df_actions)
.reset_index(drop=True))
print (df1)
type value id
0 a 17 12
1 b 19 12
2 a 1 15
3 b 3 15
4 c 5 15
Another solution if performance is important:
L = [{**{'i':k, **y}} for k, v in df_actions.pop('actions').items() for y in v]
df_actions = df_actions.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df_actions)
id type value
0 12 a 17
1 12 b 19
2 15 a 1
3 15 b 3
4 15 c 5
Here's another solution that uses explode and json_normalize:
exploded = df_actions.explode("actions")
pd.concat([exploded["id"].reset_index(drop=True), pd.json_normalize(exploded["actions"])], axis=1)
Here's the result:
id type value
0 12 a 17
1 12 b 19
2 15 a 1
3 15 b 3
4 15 c 5