How to display grouped variable count in each row - python

I want to show the count of a given rating for a given month in each rating's row for the data below. This would mean that on rows 0 and 3 there would be a 2 since there are two 10 ratings given in month 1.
test = {'Rating': [10,9,8,10,8,6,4,3,0,7,2,5], 'Month': [1,2,3,1,3,2,1,2,3,1,2,3]}
test_df = pd.DataFrame(data=test)
I have tried the following but it didn't help much:
test_df['Rating_totals'] = test_df.groupby(['Month'])['Rating'].count()
Is there a way to do this?

Use value_counts():
test_df.value_counts()
To sort the results by month and rating:
total_rankings = test_df[['Month', 'Rating']].value_counts().sort_index()
Use pandas apply to add the total count for each row as a new column to test_df:
test_df['total_rankings'] = test_df.apply(lambda row: total_rankings.loc[row['Month', row['Ratings']], axis=1)

Related

Extract the first number from a string number range

I have a dataset with price column as type of string, and some of the values in the form of range (15000-20000).
I want to extract the first number and convert the entire column to integers.
I tried this :
df['ptice'].apply(lambda x:x.split('-')[0])
The code just return the original column.
Try one of the following options:
Data
import pandas as pd
data = {'price': ['0','100-200','200-300']}
df = pd.DataFrame(data)
print(df)
price
0 0 # adding a str without `-`, to show that this one will be included too
1 100-200
2 200-300
Option 1
Use Series.str.split with expand=True and select the first column from the result.
Next, chain Series.astype, and assign the result to df['price'] to overwrite the original values.
df['price'] = df.price.str.split('-', expand=True)[0].astype(int)
print(df)
price
0 0
1 100
2 200
Option 2
Use Series.str.extract with a regex pattern, r'(\d+)-?':
\d matches a digit.
+ matches the digit 1 or more times.
match stops when we hit - (? specifies "if present at all").
data = {'price': ['0','100-200','200-300']}
df = pd.DataFrame(data)
df['price'] = df.price.str.extract(r'(\d+)-?').astype(int)
# same result
Here is one way to do this:
df['price'] = df['price'].str.split('-', expand=True)[0].astype('int')
This will only store first number from the range. Example: From 15000-20000 only 15000 will be stored in the price column.

Pandas - using group by and including value counts which are larger than n

I have a table which includes salary and company_location.
I was trying to calculate the mean salary of a country, its works:
wage = df.groupby('company_location').mean()['salary']
However, I have many with company_location which have less than 5 entries, I would like to exclude them from the report.
I know how to calculate countries with the top 5 entries:
Top_5 = df['company_location'].value_counts().head(5)
I am just having a problem connecting those to variables into one and making a graph out of it...
Thank you.
You can remove rows whose value occurrence is below a threshold:
df = df[df.groupby('company_location')['company_location'].transform('size') > 5]
You can do the following to only apply the groupby and aggregation to those with more than 5 records:
mask = (df['company_location'].map(df['company_location'].value_counts()) > 5)
wage = df[mask].groupby('company_location')['salary'].mean()

How to convert rows into columns (as value but not header) in Python

In the following dataset, I need to convert each row for the “description” under “name" column (for example, inventory1, inventory2 and inventory3) into two separate columns (namely description1 and description2, respectively). If I used either pviot_table or groupby, the value of the description will become header instead of a value under a column. What would be the way to generate the desired output? Thanks
import pandas as pd
df1 = { 'item':['item1','item2','item3','item4','item5','item6'],
'name':['inventory1','inventory1','inventory2','inventory2','inventory3','inventory3'],
'code':[1,1,2,2,3,3],
'description':['sales number decrease compared to last month', 'Sales number
decreased','sales number increased','Sales number increased, need to keep kpi','no sales this
month','item out of stock']}
df1=pd.DataFrame(df1)
desired output as below:
You can actually use pd.concat:
new_df = pd.concat([
(
df.drop_duplicates('name')
.drop('description', axis=1)
.reset_index(drop=True)
),
(
pd.DataFrame([pd.Series(l) for l in df.groupby('name')['description'].agg(list).tolist()])
.add_prefix('description')
),
],
axis=1)
Output:
>>> new_df
item name code description0 description1
0 item1 inventory1 1 sales number decrease compared to last month Sales number decreased
1 item3 inventory2 2 sales number increased Sales number increased, need to keep kpi
2 item5 inventory3 3 no sales this month item out of stock
One-liner version of the above, in case you want it:
pd.concat([df.drop_duplicates('name').drop('description', axis=1).reset_index(drop=True), pd.DataFrame([pd.Series(l) for l in df.groupby('name')['description'].agg(list).tolist()]).add_prefix('description')], axis=1)

How to find customers who made 2nd purchase within 30 days?

I need your quick help. I want to find a list of customer_id's and first purchase_date for customers who have made their second purchase within 30 days of their first purchase.
i.e. curstomer_id's 1,2,3 have made their 2nd purchase within 30 days.
I need curstomer_id's 1,2,3 and their respective first purchase_date.
I have more than 100k customer_id's.
How I can achieve this in pandas?
You can do it with groupby
s = df.groupby('Customer_id')['purchase_date'].apply(lambda x : (x.iloc[1]-x.iloc[0]).days<30)
out = df.loc[df.Customer_id.isin(s.index[s])].drop_duplicates('Customer_id')
Here is a way:
df2 = (df.loc[df['purchase_date']
.lt(df['Customer_id']
.map((df.sort_values('purchase_date').groupby('Customer_id').first() + pd.to_timedelta(30,'d'))
.squeeze()))])
df2 = (df2.loc[df2.duplicated('Customer_id',keep=False)]
.groupby('Customer_id').first())
You can set a boolean mask to filter the groups of customers who have made their second purchase within 30 days, as follows:
# Pre-processing to sort the data and convert date to the required date format
df = df.sort_values(['Customer_id', 'purchase_date'])
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
# Set boolean mask
mask = (((df['purchase_date'] - df['purchase_date'].groupby(df['Customer_id']).shift()).dt.days <= 30)
.groupby(df['Customer_id'])
.transform('any')
)
Then, we can already filter the transaction records of customers with second purchase within 30 days by the following code:
df[mask]
To further show the customer_id's and their respective first purchase_date, you can use:
df[mask].groupby('Customer_id', as_index=False).first()

Pandas how to filter previous rows based on later row

I've got a dataframe like this
Day,Minute,Second,Value
1,1,0,1
1,2,1,2
1,3,1,2
1,2,6,0
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Sometimes the sensor records incorrect values and gets added again but with the correct value. For example, here we should delete the second and third rows since they are being overridden by row four coming from a timestamp before them. How do I filter out the 'bad' rows like those that are unnecessary? For the example, the expected output should be:
Day,Minute,Second,Value
1,1,0,1
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Here's the pseudocode for an iterative solution(Sorry for no indents in the formatting this is my first post)
for row in dataframe:
for previous_row in rows in dataframe before row:
if previous_row > row:
delete previous row
I think there should be a vectorized solution, especially for the second loop. I also don't want to modify what I'm iterating over but I'm not sure there is another option other than duplicating the dataframe.
Here is some starter code to work with the example dataframe
import pandas as pd
data = [{'Day':1, 'Minute':1, 'Second':0, 'Value':1},
{'Day':1, 'Minute':2, 'Second':1, 'Value':2},
{'Day':1, 'Minute':2, 'Second':6, 'Value':2},
{'Day':1, 'Minute':3, 'Second':1, 'Value':0},
{'Day':1, 'Minute':2, 'Second':1, 'Value':1},
{'Day':1, 'Minute':2, 'Second':5, 'Value':1},
{'Day':2, 'Minute':0, 'Second':1, 'Value':1},
{'Day':2, 'Minute':0, 'Second':5, 'Value':2}]
df = pd.DataFrame(data)
If you have multiple rows for the same combination of Day, Minute, Second but a different Value, I am assuming you want to retain the last recorded value and discard all the previous ones considering they are "bad".
You can do this simply by using drop_duplicates:
df.drop_duplicates(subset=['Day', 'Minute', 'Second'], keep='last')
UPDATE v2:
If you need to retain the last group of ['Minute', 'Second'] combinations for each day, identify monotonically increasing Minute groups (since it's the bigger time unit of the two) and select the group with the max value of Group_Id for each ['Day']:
res = pd.DataFrame()
for _, g in df.groupby(['Day']):
g['Group_Id'] = (g.Minute.diff() < 0).cumsum()
res = pd.concat([res, g[g['Group_Id'] == max(g['Group_Id'].values)]])
OUTPUT:
Day Minute Second Value Group_Id
1 2 1 1 1
1 2 5 1 1
2 0 1 1 0
2 0 5 2 0

Categories