Iteratively Capture Value Counts in Single DataFrame - python

I have a pandas dataframe that looks something like this:
id group gender age_grp status
1 1 m over21 active
2 4 m under21 active
3 2 f over21 inactive
I have over 100 columns and thousands of rows. I am trying to create a single pandas dataframe of the value_counts of each of the colums. So I want something that looks like this:
group1
gender m 100
f 89
age over21 98
under21 11
status active 87
inactive 42
Any one know a simple way I can iteratively concat the value_counts from each of the 100+ columns in the original dataset while capturing the name of the columns as a hierarchical index?
Eventually I want to be able to merge with another dataframe of a different group to look like this:
group1 group2
gender m 100 75
f 89 92
age over21 98 71
under21 11 22
status active 87 44
inactive 42 13
Thanks!

This should do it:
df.stack().groupby(level=1).value_counts()
id 1 1
2 1
3 1
group 1 1
2 1
4 1
gender m 2
f 1
age_grp over21 2
under21 1
status active 2
inactive 1
dtype: int64

Related

python sum values in columns taken from another dataframe

I have a dataframe ("MUNg") like this:
MUN_id Col1
1-2 a
3 b
4-5-6 c
...
And another dataframe ("ppc") like this:
id population
0 1 20
1 2 25
2 3 4
3 4 45
4 5 100
5 6 50
...
I need to create a column in "MUNg" that contains the total population obtained by summing the population corresponding to the ids from "pcc", that are present in MUN_id
Expected result:
MUN_id Col1 total_population
1-2 a 45
3 b 4
4-5-6 c 195
...
I don't write how I tried to achieve this, because I am new to python and I don't know how to do it.
MUNg['total_population']=?
Many thanks!
You can split and explode your string into new rows, map the population data and GroupBy.agg to get the sum:
MUNg['total_population'] = (MUNg['MUN_id']
.str.split('-')
.explode()
.astype(int) # required if "id" in "ppc" is an integer, comment if string
.map(ppc.set_index('id')['population'])
.groupby(level=0).sum()
)
output:
MUN_id Col1 total_population
0 1-2 a 45
1 3 b 4
2 4-5-6 c 195

How to combine dataframes based on index column name

Hello I am new to python and I have 2 dfs and a list of tickers and i would like to combine the 2 dfs based on a list of tickers. My second df had the tickers imported from an excel sheet and so the column names in the index are in a different order, I am not sure if that changes anything.
df1 looks like
df1
index
ABC
DEF
XYZ
avg
2
6
12
std
1
2
3
var
24
25
35
max
56
66
78
df 2
index
10
40
96
ticker
XYZ
ABC
DEF
Sector
Auto
Tech
Mining
I would like to combine them based on their ticker names in a third df with all the information so it looks something like this
df3
index
ABC
DEF
XYZ
avg
2
6
12
std
1
2
3
var
24
25
35
max
56
66
78
Sector
Tech
Mining
Auto
I have tried this
df3= pd.concat([df1,df2], ignore_index=True)
but it made a df where they were side by side instead of in one combine df. Any help would be appreciated.
You need to set the index
df2 = df2.set_index('index').T.set_index('ticker').T
out = pd.concat([df1,df2])
ABC DEF XYZ
index
avg 2 6 12
std 1 2 3
var 24 25 35
max 56 66 78
Sector Tech Mining Auto

Python: Randomly select a subgroup in a group

I have a dataframe that looks like:
patient_id note_id lines
A 10 1
A 10 2
A 10 3
A 29 1
A 29 2
B 12 1
B 95 1
B 95 2
B 95 3
C......
D......
E 14 1
E 55 1
E 87 1
......
Each patient can have multiple notes and each note may contain more than 1 line. Say that I have 20 patients, 50 notes and 150 lines. How can I randomly select only one random note for randomly selected 3 patient? Say that I want one random note per randomly selected patient_id, I would get:
patient_id note_id lines
A 29 1
A 29 2
B 12 1
E 55 1
I'd suggest creating a temporary dataset without the lines column. Then .drop_duplicates() to get one line per note. Then invoke .sample() to choose your random subset, then .merge() to rejoin the sample to the original dataset on patient_id and note_id. There may well be a quicker way as I'm not a pandas expert.

New dataframe from a groupby

I have the following dataframe:
teste.head(5)
card_id feature_1 feature_2
0 C_ID_92a2005557 5 2
1 C_ID_3d0044924f 4 1
2 C_ID_d639edf6cd 2 2
3 C_ID_186d6a6901 4 3
4 C_ID_cdbd2c0db2 1 3
And I have this other dataframe:
historical.head(5)
authorized_flag card_id city_id category_1 installments category_3 merchant_category_id merchant_id
0 Y C_ID_cdbd2c0db2 88 N 0 A 80 M_ID_e020e9b302
1 Y C_ID_92a2005557 88 N 0 A 367 M_ID_86ec983688
2 Y C_ID_d639edf6cd 88 N 0 A 80 M_ID_979ed661fc
3 Y C_ID_d639edf6cd 88 N 0 A 560 M_ID_e6d5ae8ea6
4 Y C_ID_92a2005557 88 N 0 A 80 M_ID_e020e9b302
Comments:
The first dataframe has only some information about the card_id and the value I want to predict (target)
The second dataframe looks like the history of each card_id containing the columns I need to merge to the first dataframe (giving more information / columns for each card_id)
obviously the card_id in the second dataframe is repeated several times, with this, from the second dataframe I need to create a new dataframe, not letting the card_id multiply.
I can use:
historical.groupby('card_id').size()
and create a new column with the number of times that cad_id was used.
But I'm not able to do this with the rest of the columns because I need to sum all the values ​​in each column and associate each card_id to then merge with the first dataframe
Can you help me create the new columns in the best way?

Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3

Categories