I have the following dataframe:
teste.head(5)
card_id feature_1 feature_2
0 C_ID_92a2005557 5 2
1 C_ID_3d0044924f 4 1
2 C_ID_d639edf6cd 2 2
3 C_ID_186d6a6901 4 3
4 C_ID_cdbd2c0db2 1 3
And I have this other dataframe:
historical.head(5)
authorized_flag card_id city_id category_1 installments category_3 merchant_category_id merchant_id
0 Y C_ID_cdbd2c0db2 88 N 0 A 80 M_ID_e020e9b302
1 Y C_ID_92a2005557 88 N 0 A 367 M_ID_86ec983688
2 Y C_ID_d639edf6cd 88 N 0 A 80 M_ID_979ed661fc
3 Y C_ID_d639edf6cd 88 N 0 A 560 M_ID_e6d5ae8ea6
4 Y C_ID_92a2005557 88 N 0 A 80 M_ID_e020e9b302
Comments:
The first dataframe has only some information about the card_id and the value I want to predict (target)
The second dataframe looks like the history of each card_id containing the columns I need to merge to the first dataframe (giving more information / columns for each card_id)
obviously the card_id in the second dataframe is repeated several times, with this, from the second dataframe I need to create a new dataframe, not letting the card_id multiply.
I can use:
historical.groupby('card_id').size()
and create a new column with the number of times that cad_id was used.
But I'm not able to do this with the rest of the columns because I need to sum all the values in each column and associate each card_id to then merge with the first dataframe
Can you help me create the new columns in the best way?
Related
I have a dataframe ("MUNg") like this:
MUN_id Col1
1-2 a
3 b
4-5-6 c
...
And another dataframe ("ppc") like this:
id population
0 1 20
1 2 25
2 3 4
3 4 45
4 5 100
5 6 50
...
I need to create a column in "MUNg" that contains the total population obtained by summing the population corresponding to the ids from "pcc", that are present in MUN_id
Expected result:
MUN_id Col1 total_population
1-2 a 45
3 b 4
4-5-6 c 195
...
I don't write how I tried to achieve this, because I am new to python and I don't know how to do it.
MUNg['total_population']=?
Many thanks!
You can split and explode your string into new rows, map the population data and GroupBy.agg to get the sum:
MUNg['total_population'] = (MUNg['MUN_id']
.str.split('-')
.explode()
.astype(int) # required if "id" in "ppc" is an integer, comment if string
.map(ppc.set_index('id')['population'])
.groupby(level=0).sum()
)
output:
MUN_id Col1 total_population
0 1-2 a 45
1 3 b 4
2 4-5-6 c 195
I have multiple dataframes that I have scraped from a website using pandas.read_html(). The tables that I get are not in proportion, meaning that they have different numbers of rows and columns, but they belong to a single entity. So I want to add all those dataframes to individual cells of the same row.
Here is an example.
I have the following dataframes:
df1=pd.DataFrame([[1,2,3]]*2)
df1
df2=pd.DataFrame([['a','b']]*3)
df2
df3=pd.DataFrame([[23,565,34,67,34]]*1)
df3
df4=pd.DataFrame([['df','grgrd','weddv','dfdf','re',45,93]]*5)
df4
and this how I am trying to do it:
dic={}
d['a']=df1
d['b']=df2
d['c']=df3
d['d']=df4
df_out=pd.DataFrame([d])
but the result looks like this:
a b c d
0 0 1 2 0 1 2 3 1 1 2 3 0 1 0 a b 1 a b 2 a b 0 1 2 3 4 0 23 565 34 67 34 0 1 2 3 4 5 6 0 df g...
Looks like the index counters are also added as values to the cells.
How do I remove indices values?
Is there a way that they are stored in a way that they would appear as a table within individual cells?
Is there a better way to do it?
I basically have 2 related columns in a data frame in python. One of the columns is binary i.e. 1,0,0,1,0 etc and the next column has a related value i.e 200, 34, 124, etc. I want to take all the zero values with their corresponding values in the adjacent column and create a new data frame and do the same for all the ones. An illustration of the columns are below;
Location Price
1 24
0 200
0 56
0 89
1 101
1 94
1 3
You can make two new dataframes with just ones and zeros like this, IIUC:
df[df.Location == 0]
# Location Price
#1 0 200
#2 0 56
#3 0 89
df[df.Location == 1]
# Location Price
#0 1 24
#4 1 101
#5 1 94
#6 1 3
I have a dataframe like this:
userid itemid timestamp
1 1 50
1 2 50
1 3 50
1 4 60
2 1 40
2 2 50
I want to drop all rows whose userid occur more than 2 times and get a new dataframe as follows. Does someone can help me? Thanks.
userid itemid timestamp
2 1 40
2 2 50
You can use pd.Series.value_counts and calculate an array of userid filtered by your condition. Then use this to filter your original dataframe.
c = df['userid'].value_counts()
idx = c[c > 2].index
res = df[~df['userid'].isin(idx)]
print(res)
userid itemid timestamp
4 2 1 40
5 2 2 50
I have a pandas dataframe that looks something like this:
id group gender age_grp status
1 1 m over21 active
2 4 m under21 active
3 2 f over21 inactive
I have over 100 columns and thousands of rows. I am trying to create a single pandas dataframe of the value_counts of each of the colums. So I want something that looks like this:
group1
gender m 100
f 89
age over21 98
under21 11
status active 87
inactive 42
Any one know a simple way I can iteratively concat the value_counts from each of the 100+ columns in the original dataset while capturing the name of the columns as a hierarchical index?
Eventually I want to be able to merge with another dataframe of a different group to look like this:
group1 group2
gender m 100 75
f 89 92
age over21 98 71
under21 11 22
status active 87 44
inactive 42 13
Thanks!
This should do it:
df.stack().groupby(level=1).value_counts()
id 1 1
2 1
3 1
group 1 1
2 1
4 1
gender m 2
f 1
age_grp over21 2
under21 1
status active 2
inactive 1
dtype: int64