I have a df that is in chronological order (oldest to newest) of UFC fights. Each row contains both fighters. How would I create two new columns:
col_a = cumsum of number of fights R_fighter exists in,
col_b = cumsum of number of fights B_fighter exists in
So as an example, in row 1000 of the df, I'd like a cumcount of the amount of times R_fighter occurs in the dataframe from rows 0 through 999.
I'm struggling to wrap my head around this without using a for loop of some kind.
Let's assume your dataframe is called df and is indexed 0 to n. Then you can use apply and value_counts to add the cumcount columns as follows.
def myct(row,col):
return df[col][:row.name+1].value_counts()[row[col]]
df['col_a']=df.apply(lambda row: myct(row, 'R_fighter'), axis=1)
df['col_b']=df.apply(lambda row: myct(row, 'B_fighter'), axis=1)
You can use .value_counts();
df['R_fighter'].value_counts()
Or .groupby() with .size();
df.groupby('R_fighter').size()
Note: .value_counts() sorts the resulting Series in ascending order while the .groupby() method does not sort.
Related
Please consider a panda dataframe final_df with 142457 rows correctly indexed:
0
1
2
3
4
...
142452
142453
142454
142455
142456
I create / sample a new df data_test_for_all_models from this one:
data_test_for_all_models = final_df.copy().sample(frac=0.1, random_state=786)
A few indexes:
2235
118727
23291`
Now I drop rows from final_df with indexes in data_test_for_all_models :
final_df = = final_df.drop(data_test_for_all_models.index)
If I check a few indexes present in final_df :
final_df.iloc[2235]
returns wrongly a row.
I think it's a problem of reset indexes but which function does it: drop(), sample()?
Thanks.
You are using .iloc which provides integer-based indexing. You are getting the row number 2235, not the row with index 2235.
For that, you should use .loc:
final_df.loc[2235]
And you should get a KeyError.
I have a pandas dataframe containing tuples of booleans (real value, predicted value) and want to create new columns containing the amount of true/false positives/negatives.
I know i could loop through the indices and set the column value for that index after looping through the entire row, but i believe that's a pandas anti-pattern.
Is there a cleaner and more efficient way to do this?
Another alternative would be to check the whole dataframe for (True,False) values and sum the amount of matches along the columns axis (sum(axis=1)).
df['false_positives'] = df.apply(lambda x: x==(True,False)).sum(axis=1)
This seems to work fine:
def count_false_positives(row):
count = 0
for el in df.columns:
if(row[el][0] and not row[el][1]):
count+=1
return count
df.false_positives = df.apply(lambda row: count_false_positives(row), axis=1)
Objective: Based off dataframe with 5 columns, return dataframe with 3 columns including one which is the count and sort by largest count to smallest.
What I have tried:
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count'])
df = df.sort_values(by='NumInstances', ascending=False)
print(df)
Error:
ValueError: The column label 'NumInstances' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
Before this gets mark as a duplicate, I have gone through all other suggested duplicates and it seems they all suggest using the same code as I have above.
Is there something small that I am doing that may be incorrect?
Thanks!
I guess you need to remove multi-index -
Try this -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count']).reset_index()
or -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year'], as_index=False).agg(['count'])
Found the issue. Adding an agg to the NumInstances column made the NumInstances column name a tuple of ('NumInstances', 'sum'), therefore I just updated the sort code to:
df = df.sort_values(by=('NumInstances', 'sum'), ascending=False)
I have the following excel sheet, which I've imported into pandas using read_csv
df
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>FALSE</td><td>FALSE</td><td>2/1/2019</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr></tbody></table>
I want to add a new column NewOrderForDate which gives me a count of all the orders for that campaign for that date AND 1st Order = TRUE
Here's how the dataframe should look after adding this column
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th><th>NewOrderForDate </th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>2/1/2019</td><td>2</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr></tbody></table>
If I had to do this in Excel, I'd probably use
=COUNTIFS(G$2:G$11,G2,E$2:E$11,"TRUE")
Basically, I want to group by column and date and get a count of all the orders where 1st order = TRUE and write these values to a new column
GroupBy 'Campaign', count the '1st order' and add 'NewOrderForDate' column for each group.
def udf(grp_df):
grp_df['NewOrderForDate'] = len(grp_df[grp_df['1st order']==True])
return grp_df
result = df.groupby('Campaign', as_index=False, group_keys=False).apply(udf)
Use transform to keep the index shape, and sum the bool value of 1st Order:
df['NewOrderForDate'] = df.groupby(['Date', 'Campaign'])['1st order'].transform(lambda x: x.sum())
I'm trying to manipulate a dataframe using a cumsum function.
My data looks like this:
To perform my cumsum, I use
df = pd.read_excel(excel_sheet, sheet_name='Sheet1').drop(columns=['Material']) # Dropping material column
I run the rest of my code, and get my expected outcome of a dataframe cumsum without the material listed:
df2 = df.as_matrix() #Specifying Array format
new = df2.cumsum(axis=1)
print(new)
However, at the end, I need to replace this material column. I'm unsure how to use the add function to get this back to the beginning of the dataframe.
IIUC, then you can just set the material column to the index, then do your cumsum, and put it back in at the end:
df2 = df.set_index('Material').cumsum(1).reset_index()
An alternative would be to do your cumsum on all but the first column:
df.iloc[:,1:] = df.iloc[:,1:].cumsum(1)