Removing rows based on the combined value of other rows - python

I want to remove the row with "SubAggregate"='All' if the rows with the same "Month" and "MainAggregate" sums ("ValueTraded") to the same as the corresponding "SubAggregate"='All' value
My idea was to group by "MainAggregate" and "Month" and if the value was equal to two times the value in the row with an 'All', then it would be deleted. I only got to the point where I grouped the data
Tester = data.groupby(["Month", "MainAggregate"], as_index=False)["ValueTraded"].sum()
Tester["ValueTraded"] = Tester["ValueTraded"] / 2
Below is an example of the data and the desired output:
DATA
Desired Output

You can compare sum per groups in new columns in GroupBy.transform with first All value by GroupBy.first and because duplicates remove only first duplicated value by add DataFrame.duplicated:
mask = df['subaAggregate'].eq('All')
s = df.groupby(["Month", "MainAggregate"])["valueTrad"].transform('sum').div(2)
s1 = (df.assign(new = df['valueTrad'].where(mask))
.groupby(["Month", "MainAggregate"])["new"].transform('first'))
out = df[s.ne(s1) | ~mask | df.duplicated(['Month','MainAggregate','subaAggregate'])]
Or compare filtred rows with only All in boolean indexing with remove duplicates by DataFrame.drop_duplicates for indices for remove from original DataFrame:
#necessary unique indices
#df = df.reset_index(drop=True)
mask = df['subaAggregate'].eq('All')
s = df.groupby(["Month", "MainAggregate"])["valueTrad"].sum().div(2).rename('new')
df1 = df[mask].drop_duplicates().join(s, on=['Month','MainAggregate'])
out = df.drop(df1.index[df1['valueTrad'].eq(df1['new'])])

Related

How to split one row into multiple rows in a generic way in pandas?

name text group
a|b a test m|l|n
I have a DataFrame like above. If there is a delimiter in a column value, I want to split it and put it in a separate line.
columns = ['name', 'text', 'group']
for column in columns:
if column == 'name' and column in df:
df = df.assign(name=df.name.str.split(delimiter)).explode(column)
The problem with this code is that, I have to use multiple if to test the actual column name string, i.e. 'name'. I want to a general way like below:
if column in df:
df = df.assign(column=df.column.str.split(delimiter)).explode(column)
But this is invalid. Any walk-around to do this?
Use [] instead dot notation:
delimiter = '|'
column = 'group'
if column in df:
df = df.assign(**{column:df[column].str.split(delimiter)}).explode(column)
print (df)
name text group
0 a|b a test m
0 a|b a test l
0 a|b a test n
Another idea if need exploding multiple columns:
#get values from columns list if exist in df.columns
cols = df.columns.intersection(columns)
print (cols)
#assign back splitted columns by dict comprehension and explode by all columns in list cols
df = df.assign(**{x: df[x].str.split(delimiter) for x in cols}).explode(cols)

Comparing values from different rows in groupby

I would like to print each time inconsistency where the a start is different from the end from the previous row, grouped by the 'id' column. In the following data, the last row would be a case of inconsistency.
start,end,id
0,2,1
1,5,2
2,10,1
5,7,2
7,9,2
11,13,1
I have managed to do this using a for loop:
def check_consistency(df):
grouped_df = df.groupby('id')
for key, group in grouped_df:
df = pd.DataFrame()
df['start'] = group['start'].iloc[1:]
df['end'] = group['end'].shift().iloc[1:]
consistent = df['start'] == df['end']
if not all(consistent):
print(key)
print(df[consistent == False])
Is there a way to achieve the same goal without using a for loop and creating an auxiliar DataFrame?
Edit: following is the expected output.
DataFrame:
df = pd.DataFrame({'start': [0,1,2,5,7,11], 'end': [2,5,10,7,9,13], 'id': [1,2,1,2,2,1]})
Expected output:
1
start end
5 11 10.0
Firstly, we sort by id. Then make a mask comparing each start with previous row end and group by id.
For each group, the first entry of mask is defaulted to True since it has no previous row and is not to be selected for our extraction.
Finally, we select those rows with mask being False (start not equal to previous row end) by using .loc with the negation of the boolean mask.
df1 = df.sort_values('id', kind='mergesort') # Merge Sort for stable sort to maintain sequence other than sort key
mask = (df1['start']
.eq(df1['end'].shift())
.groupby(df1['id']).transform(lambda x: [True] + x.iloc[1:].tolist())
)
df1.loc[~mask]
Output:
start end id
5 11 13 1

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Pandas - How to insert a new column with the count when there are multiple clauses

I have the following excel sheet, which I've imported into pandas using read_csv
df
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>FALSE</td><td>FALSE</td><td>2/1/2019</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr></tbody></table>
I want to add a new column NewOrderForDate which gives me a count of all the orders for that campaign for that date AND 1st Order = TRUE
Here's how the dataframe should look after adding this column
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th><th>NewOrderForDate </th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>2/1/2019</td><td>2</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr></tbody></table>
If I had to do this in Excel, I'd probably use
=COUNTIFS(G$2:G$11,G2,E$2:E$11,"TRUE")
Basically, I want to group by column and date and get a count of all the orders where 1st order = TRUE and write these values to a new column
GroupBy 'Campaign', count the '1st order' and add 'NewOrderForDate' column for each group.
def udf(grp_df):
grp_df['NewOrderForDate'] = len(grp_df[grp_df['1st order']==True])
return grp_df
result = df.groupby('Campaign', as_index=False, group_keys=False).apply(udf)
Use transform to keep the index shape, and sum the bool value of 1st Order:
df['NewOrderForDate'] = df.groupby(['Date', 'Campaign'])['1st order'].transform(lambda x: x.sum())

Sorting a grouped dataframe

I have a dataframe with columns ['name', 'sex', 'births', 'year']. I then group the dataframe on the basis of name to create 2 new columns "max" and "total".
trendy_names['max'] = trendy_names.groupby(['name'], as_index = False)['births'].transform('max')
trendy_names['total'] = trendy_names.groupby(['name'], as_index = False)['births'].transform('sum')
Using these 2 columns, I create a calculated column "trendiness".
trendy_names['trendiness'] = trendy_names['max']/trendy_names['total']
Then, I segregate those that have a total number of births greater than 1000.
trendy_names = trendy_names[trendy_names.total >= 1000]
Now, I want to sort the dataframe on the basis of "trendiness" column. Any thoughts?
To sort the dataframe on the basis of "trendiness" which is type: DataFrameGroupBy
1. trendy_names.reset_index()
reset_index() - converting back to a regular index
i.e converting pandas.core.groupby.DataFrameGroupBy to pandas.core.frame.DataFrame
2. trendy_names.sort_values(by = 'trendiness')

Categories