Smart pandas merge - python

every one!
I've got a problem. I want to merge two pandas DataFrame by same column, where the 1st DataFrame in his column contains values of column 2nt DataFrame. And i want to keep in result values of 1st DataFrame, where they exist, and where they isn't keep values from 2nt. Like this:
1st:
_ col_1 col_2
0 123 100
1 124 200
2 125 150
3 126 250
4 127 300
2nt:
_ col_1 col_2
0 123 10
1 125 20
2 127 30
And i want to get next one:
_ col_1 col_2
0 123 10
1 124 200
2 125 20
3 126 250
4 127 30

Use concat with DataFrame.drop_duplicates and DataFrame.sort_values:
df = (pd.concat([df2, df1], ignore_index=True)
.drop_duplicates('col_1')
.sort_values('col_1'))
print (df)
col_1 col_2
0 123 10
4 124 200
1 125 20
6 126 250
2 127 30

Related

How to groupby multiple columns in dataframe, except one in python

I have the following dataframe:
ID Code Color Value
-----------------------------------
0 111 AAA Blue 23
1 111 AAA Red 43
2 111 AAA Green 4
3 121 ABA Green 45
4 121 ABA Green 23
5 121 ABA Red 75
6 122 AAA Red 52
7 122 ACA Blue 24
8 122 ACA Blue 53
9 122 ACA Green 14
...
I want to group this dataframe by the columns "ID", and "Code", and sum the values from the "Value" column, while excluding the "Color" column from this grouping. Or in other words, I want to groupy by all non-Value columns, except for the "Color" column, and then sum the values from the "Value" column. I am using python for this.
What I am thinking of doing is creating a list of all column names that are not "Color" and "Value", and creating this "column_list", and then simply running:
df.groupby['column_list'].sum()
Though this will not work. How might I augment this code so that I can properly groupby as intended?
EDIT:
This code works:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))
Full code that is not working:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list, as_index=False)['Value'].count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))
You can pass list to groupby and specify column for aggregate sum:
column_list = [c for c in df.columns if c not in ['Color','Value']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
Or:
column_list = list(df.columns.difference(['Color','Value'], sort=False))
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
It working with sample data like:
df1 = df.groupby(['ID','Code'], as_index=False)['Value'].sum()
EDIT: Yes, also working:
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
Reason is because sum remove by default not numeric columns and if not specified Value it summed all columns.
So if Color is numeric, it sum it too:
print (df)
ID Code Color Value
0 111 AAA 1 23
1 111 AAA 2 43
2 111 AAA 3 4
3 121 ABA 1 45
4 121 ABA 1 23
5 121 ABA 2 75
6 122 AAA 1 52
7 122 ACA 2 24
8 122 ACA 1 53
9 122 ACA 2 14
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
print (df1)
ID Code Value Color
0 111 AAA 4 3
1 111 AAA 23 1
2 111 AAA 43 2
3 121 ABA 23 1
4 121 ABA 45 1
5 121 ABA 75 2
6 122 AAA 52 1
7 122 ACA 14 2
8 122 ACA 24 2
9 122 ACA 53 1
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
print (df1)
ID Code Value
0 111 AAA 4
1 111 AAA 23
2 111 AAA 43
3 121 ABA 23
4 121 ABA 45
5 121 ABA 75
6 122 AAA 52
7 122 ACA 14
8 122 ACA 24
9 122 ACA 53
EDIT: If need MultiIndex in bins remove as_index=False and column after groupby:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
should be changed to:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list).count()

Python-How to bin positive and negative values to get counts for time series plot

I'm trying to recreate a time-series plot similar to the one below (not including the 'HLMA Flashes' data)
This is what my datafile looks like, the polarity is in the "charge" column. I used pandas to load in the file and set up the table on jupyter notebook. The value of the charge does not matter, only whether it is positive or negative.
Once I get the count of the total/negative/positives, I know how to plot this against time, but I'm not sure how to approach binning to get the counts (or whatever is needed) to make the time series. Preferably I need this in 5-minute bin periods during the timeframe of my dataframe (0000-07000 UTC). Apologies if this question is worded poorly, but any leads would be appreciated.
Link to .txt file: https://drive.google.com/file/d/13XEc74LO3cZQhylAdSfhLeUn7GFgtiKT/view?usp=sharing
Here's a way to do what I believe you are asking:
df2 = ( pd.DataFrame( {
'Datetime' : pd.to_datetime(df.agg(lambda x: f"{x['Date']} {x['Time']}", axis=1)),
'Neg': df.Charge < 0,
'Pos': df.Charge > 0,
'Tot': [1] * len(df)} ) )
df2['minutes'] = (df2.Datetime.dt.hour * 60 + df2.Datetime.dt.minute) // 5 * 5
df3 = df2[['minutes','Neg','Pos','Tot']].groupby('minutes').sum()
Output:
Neg Pos Tot
minutes
45 0 1 1
55 0 1 1
65 0 2 2
85 0 2 2
90 0 2 2
95 0 1 1
100 0 3 3
105 1 4 5
110 2 11 13
115 0 10 10
120 0 6 6
125 1 13 14
130 3 70 73
135 2 20 22
140 1 5 6
165 0 2 2
170 3 1 4
175 2 5 7
180 2 12 14
185 3 26 29
190 1 11 12
195 0 4 4
200 1 14 15
205 1 4 5
210 0 1 1
215 0 1 1
220 0 1 1
225 3 0 3
230 1 5 6
235 0 4 4
240 1 2 3
245 0 3 3
260 0 1 1
265 0 1 1
Explanation:
create a 'Datetime' column from 'Date' and 'Time' columns using to_datetime()
create Neg and Pos columns based on sign of Charge, and create Tot column equal to 1 for each row
create minutes column to bin the rows into 5 minute intervals
use groupby() and sum() to aggregate Neg, Pos and Tot for each interval with at least one row.

Edit columns based on duplicate values found in Pandas

I have below dataframe:
No: Fee:
111 500
111 500
222 300
222 300
123 400
If data in No is duplicate, I want to keep only one fee and remove others.
Should look like below:
No: Fee:
111 500
111
222 300
222
123 400
I actually have no idea where to start, so please guide here.
Thanks.
Use DataFrame.duplicated with set empty string by DataFrame.loc:
#if need test duplicated by both columns
mask = df.duplicated(['No','Fee'])
df.loc[mask, 'Fee'] = ''
print (df)
No Fee
0 111 500
1 111
2 222 300
3 222
4 123 400
But then lost numeric column, because mixed numbers with strings:
print (df['Fee'].dtype)
object
Possible solution is use missing values if need numeric column:
df.loc[mask, 'Fee'] = np.nan
print (df)
No Fee
0 111 500.0
1 111 NaN
2 222 300.0
3 222 NaN
4 123 400.0
print (df['Fee'].dtype)
float64
df.loc[mask, 'Fee'] = np.nan
df['Fee'] = df['Fee'].astype('Int64')
print (df)
No Fee
0 111 500
1 111 <NA>
2 222 300
3 222 <NA>
4 123 400
print (df['Fee'].dtype)
Int64

Extract corresponding df value with reference from another df

There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?
Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90

How can create counts of terms in a one column and abend the counts as additional coulmns in pandas data frame

I have a panda data frame that looks like this:
ID Key
101 A
102 A
205 A
101 B
105 A
605 A
200 A
102 B
I would like to make a new table that counts the number of occurrences of "A" and "B" in the Key column and make them as new two headers. The table would then look like this:
ID A B
101 1 1
102 1 1
205 1 0
105 1 0
605 1 0
200 1 0
I have treid groubing by 'ID' and 'Key' and get the sizes as here:
df.groupby(['ID', 'Key']).size().transform('A', 'B')
But it says the series doesn't have the attribute 'transform', and actually, I am not even sure if I can have two arguments passed to 'transform'
You are close, need unstack:
df = df.groupby(['ID', 'Key']).size().unstack(fill_value=0)
print (df)
Key A B
ID
101 1 1
102 1 1
105 1 0
200 1 0
205 1 0
605 1 0
Or crosstab:
df = pd.crosstab(df['ID'], df['Key'])
print (df)
Key A B
ID
101 1 1
102 1 1
105 1 0
200 1 0
205 1 0
605 1 0

Categories