I have a dataframe ("MUNg") like this:
MUN_id Col1
1-2 a
3 b
4-5-6 c
...
And another dataframe ("ppc") like this:
id population
0 1 20
1 2 25
2 3 4
3 4 45
4 5 100
5 6 50
...
I need to create a column in "MUNg" that contains the total population obtained by summing the population corresponding to the ids from "pcc", that are present in MUN_id
Expected result:
MUN_id Col1 total_population
1-2 a 45
3 b 4
4-5-6 c 195
...
I don't write how I tried to achieve this, because I am new to python and I don't know how to do it.
MUNg['total_population']=?
Many thanks!
You can split and explode your string into new rows, map the population data and GroupBy.agg to get the sum:
MUNg['total_population'] = (MUNg['MUN_id']
.str.split('-')
.explode()
.astype(int) # required if "id" in "ppc" is an integer, comment if string
.map(ppc.set_index('id')['population'])
.groupby(level=0).sum()
)
output:
MUN_id Col1 total_population
0 1-2 a 45
1 3 b 4
2 4-5-6 c 195
Hello I am new to python and I have 2 dfs and a list of tickers and i would like to combine the 2 dfs based on a list of tickers. My second df had the tickers imported from an excel sheet and so the column names in the index are in a different order, I am not sure if that changes anything.
df1 looks like
df1
index
ABC
DEF
XYZ
avg
2
6
12
std
1
2
3
var
24
25
35
max
56
66
78
df 2
index
10
40
96
ticker
XYZ
ABC
DEF
Sector
Auto
Tech
Mining
I would like to combine them based on their ticker names in a third df with all the information so it looks something like this
df3
index
ABC
DEF
XYZ
avg
2
6
12
std
1
2
3
var
24
25
35
max
56
66
78
Sector
Tech
Mining
Auto
I have tried this
df3= pd.concat([df1,df2], ignore_index=True)
but it made a df where they were side by side instead of in one combine df. Any help would be appreciated.
You need to set the index
df2 = df2.set_index('index').T.set_index('ticker').T
out = pd.concat([df1,df2])
ABC DEF XYZ
index
avg 2 6 12
std 1 2 3
var 24 25 35
max 56 66 78
Sector Tech Mining Auto
I have a dataframe that looks like:
patient_id note_id lines
A 10 1
A 10 2
A 10 3
A 29 1
A 29 2
B 12 1
B 95 1
B 95 2
B 95 3
C......
D......
E 14 1
E 55 1
E 87 1
......
Each patient can have multiple notes and each note may contain more than 1 line. Say that I have 20 patients, 50 notes and 150 lines. How can I randomly select only one random note for randomly selected 3 patient? Say that I want one random note per randomly selected patient_id, I would get:
patient_id note_id lines
A 29 1
A 29 2
B 12 1
E 55 1
I'd suggest creating a temporary dataset without the lines column. Then .drop_duplicates() to get one line per note. Then invoke .sample() to choose your random subset, then .merge() to rejoin the sample to the original dataset on patient_id and note_id. There may well be a quicker way as I'm not a pandas expert.
I have the following dataframe:
teste.head(5)
card_id feature_1 feature_2
0 C_ID_92a2005557 5 2
1 C_ID_3d0044924f 4 1
2 C_ID_d639edf6cd 2 2
3 C_ID_186d6a6901 4 3
4 C_ID_cdbd2c0db2 1 3
And I have this other dataframe:
historical.head(5)
authorized_flag card_id city_id category_1 installments category_3 merchant_category_id merchant_id
0 Y C_ID_cdbd2c0db2 88 N 0 A 80 M_ID_e020e9b302
1 Y C_ID_92a2005557 88 N 0 A 367 M_ID_86ec983688
2 Y C_ID_d639edf6cd 88 N 0 A 80 M_ID_979ed661fc
3 Y C_ID_d639edf6cd 88 N 0 A 560 M_ID_e6d5ae8ea6
4 Y C_ID_92a2005557 88 N 0 A 80 M_ID_e020e9b302
Comments:
The first dataframe has only some information about the card_id and the value I want to predict (target)
The second dataframe looks like the history of each card_id containing the columns I need to merge to the first dataframe (giving more information / columns for each card_id)
obviously the card_id in the second dataframe is repeated several times, with this, from the second dataframe I need to create a new dataframe, not letting the card_id multiply.
I can use:
historical.groupby('card_id').size()
and create a new column with the number of times that cad_id was used.
But I'm not able to do this with the rest of the columns because I need to sum all the values in each column and associate each card_id to then merge with the first dataframe
Can you help me create the new columns in the best way?
I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3