Split Two Related DataFrame Columns into Two New DataFrames - python

I basically have 2 related columns in a data frame in python. One of the columns is binary i.e. 1,0,0,1,0 etc and the next column has a related value i.e 200, 34, 124, etc. I want to take all the zero values with their corresponding values in the adjacent column and create a new data frame and do the same for all the ones. An illustration of the columns are below;
Location Price
1 24
0 200
0 56
0 89
1 101
1 94
1 3

You can make two new dataframes with just ones and zeros like this, IIUC:
df[df.Location == 0]
# Location Price
#1 0 200
#2 0 56
#3 0 89
df[df.Location == 1]
# Location Price
#0 1 24
#4 1 101
#5 1 94
#6 1 3

Related

How to add multiple dataframe to individual cells inside a new dataframe

I have multiple dataframes that I have scraped from a website using pandas.read_html(). The tables that I get are not in proportion, meaning that they have different numbers of rows and columns, but they belong to a single entity. So I want to add all those dataframes to individual cells of the same row.
Here is an example.
I have the following dataframes:
df1=pd.DataFrame([[1,2,3]]*2)
df1
df2=pd.DataFrame([['a','b']]*3)
df2
df3=pd.DataFrame([[23,565,34,67,34]]*1)
df3
df4=pd.DataFrame([['df','grgrd','weddv','dfdf','re',45,93]]*5)
df4
and this how I am trying to do it:
dic={}
d['a']=df1
d['b']=df2
d['c']=df3
d['d']=df4
df_out=pd.DataFrame([d])
but the result looks like this:
a b c d
0 0 1 2 0 1 2 3 1 1 2 3 0 1 0 a b 1 a b 2 a b 0 1 2 3 4 0 23 565 34 67 34 0 1 2 3 4 5 6 0 df g...
Looks like the index counters are also added as values to the cells.
How do I remove indices values?
Is there a way that they are stored in a way that they would appear as a table within individual cells?
Is there a better way to do it?

Binary Vectorization Encoding for categorical variable grouped by date issue

I'm having an issue trying to vectorize this in some kind of binary encoding but aggregated when there is more than one row (as the variations of the categorical variable are non-exclusive), yet avoiding merging it with other dates. (python and pandas)
Let's say this is the data
id1
id2
type
month.measure
105
50
growing
04-2020
105
50
advancing
04-2020
44
29
advancing
04-2020
105
50
retreating
05-2020
105
50
shrinking
05-2020
It would have to end like this
id1
id2
growing
shrinking
advancing
retreating
month.measure
105
50
1
0
1
0
04-2020
44
29
0
0
1
0
04-2020
105
50
0
1
0
1
05-2020
I've been trying with transformations of all kinds, lambda functions, pandas get_dummies and trying to aggregate them grouped by the 2 ids and the date but I couldn't find a way.
Hope we can sort it out! Thanks in advance! :)
This solution uses pandas get_dummies to one-hot encode the "TYPE" column, then concatenates the one-hot encoded dataframe back with the original, followed by a groupby applied to the ID columns and "MONTH":
# Set up the dataframe
ID1 = [105,105,44,105,105]
ID2 = [50,50,29,50,50]
TYPE = ['growing','advancing','advancing','retreating','shrinking']
MONTH = ['04-2020','04-2020','04-2020','05-2020','05-2020']
df = pd.DataFrame({'ID1':ID1,'ID2':ID2, 'TYPE':TYPE, 'MONTH.MEASURE':MONTH})
# Apply get_dummies and groupby operations
df = pd.concat([df.drop('TYPE',axis=1),pd.get_dummies(df['TYPE'])],axis=1)\
.groupby(['ID1','ID2','MONTH.MEASURE']).sum().reset_index()
# These bits are just cosmetic to get the output to look more like your required output
df.columns = [c.upper() for c in df.columns]
col_order = ['GROWING','SHRINKING','ADVANCING','RETREATING','MONTH.MEASURE']
df[['ID1','ID2']+col_order]
# ID1 ID2 GROWING SHRINKING ADVANCING RETREATING MONTH.MEASURE
# 0 44 29 0 0 1 0 04-2020
# 1 105 50 1 0 1 0 04-2020
# 2 105 50 0 1 0 1 05-2020
This is crosstab:
pd.crosstab([df['id1'],df['id2'],df['month.measure']], df['type']).reset_index()
Output:
type id1 id2 month.measure advancing growing retreating shrinking
0 44 29 04-2020 1 0 0 0
1 105 50 04-2020 1 1 0 0
2 105 50 05-2020 0 0 1 1

New dataframe from a groupby

I have the following dataframe:
teste.head(5)
card_id feature_1 feature_2
0 C_ID_92a2005557 5 2
1 C_ID_3d0044924f 4 1
2 C_ID_d639edf6cd 2 2
3 C_ID_186d6a6901 4 3
4 C_ID_cdbd2c0db2 1 3
And I have this other dataframe:
historical.head(5)
authorized_flag card_id city_id category_1 installments category_3 merchant_category_id merchant_id
0 Y C_ID_cdbd2c0db2 88 N 0 A 80 M_ID_e020e9b302
1 Y C_ID_92a2005557 88 N 0 A 367 M_ID_86ec983688
2 Y C_ID_d639edf6cd 88 N 0 A 80 M_ID_979ed661fc
3 Y C_ID_d639edf6cd 88 N 0 A 560 M_ID_e6d5ae8ea6
4 Y C_ID_92a2005557 88 N 0 A 80 M_ID_e020e9b302
Comments:
The first dataframe has only some information about the card_id and the value I want to predict (target)
The second dataframe looks like the history of each card_id containing the columns I need to merge to the first dataframe (giving more information / columns for each card_id)
obviously the card_id in the second dataframe is repeated several times, with this, from the second dataframe I need to create a new dataframe, not letting the card_id multiply.
I can use:
historical.groupby('card_id').size()
and create a new column with the number of times that cad_id was used.
But I'm not able to do this with the rest of the columns because I need to sum all the values ​​in each column and associate each card_id to then merge with the first dataframe
Can you help me create the new columns in the best way?

DataFrame transform column values to new columns

I have following series:
project id type
First 130403725 PRODUCT 68
EMPTY 2
Six 130405706 PRODUCT 24
132517244 PRODUCT 33
132607436 PRODUCT 87
How I can transform type values to new columns:
PRODUCT EMPTY
project id
First 130403725 68 2
Six 130405706 24 0
132517244 33 0
132607436 87 0
This is a classic pivot table:
df_pivoted = df.pivot(index=["project", "id"], columns=["type"], values=[3])
I've used 3 as the index of the value column but it would be more clear if you would have named it.
Use unstack, because MultiIndex Series:
s1 = s.unstack(fill_value=0)
print (s1)
type EMPTY PRODUCT
project id
First 130403725 2 68
Six 130405706 0 24
132517244 0 33
132607436 0 87
For DataFrame:
df = s.unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
print (df)
project id EMPTY PRODUCT
0 First 130403725 2 68
1 Six 130405706 0 24
2 Six 132517244 0 33
3 Six 132607436 0 87

Python Pandas: Select Multiple Cell Values of one column based on the Value of another Column

So my data, in Pandas, looks like this:
values variables
134 1
12 2
43 1
54 3
16 2
And I want to create a new column which is the sum of values whenever the rest of variables does not equal the variable of the current row in variables. For example, for the first row, I would want to sum all the rows of values where variables != 1. The result would look like this:
values variables result
134 1 82
12 2 231
43 1 82
54 3 205
16 2 231
I've tried a couple things like enumerate, but I can't seem to get a good handle on this. Thanks!
Instead of finding the sum of all values that aren't equal to the current variable, you can equivalently subtract the sum of all values that are equal to the current variable from the total sum without any filters:
df['result'] = df['values'].sum()
df['result'] -= df.groupby('variables')['values'].transform('sum')
Or in a single line if you want to be terse:
df['result'] = df['values'].sum() - df.groupby('variables')['values'].transform('sum')
The resulting output:
values variables result
0 134 1 82
1 12 2 231
2 43 1 82
3 54 3 205
4 16 2 231

Categories