Python pandas wide data - finding earliest value across time series fields - python

I am working with a data frame that is written in wide format. Each book has a number of sales, but some quarters have null values because the book was not released before that quarter.
import pandas as pd
data = {'Book Title': ['A Court of Thorns and Roses', 'Where the Crawdads Sing', 'Bad Blood', 'Atomic Habits'],
'Metric': ['Book Sales','Book Sales','Book Sales','Book Sales'],
'Q1 2022': [100000,0,0,0],
'Q2 2022': [50000,75000,0,35000],
'Q3 2022': [25000,150000,20000,45000],
'Q4 2022': [25000,20000,10000,65000]}
df1 = pd.DataFrame(data)
What I want to do is find the first two available quarters for each title, and create a new dataframe with those values in two columns, like so:
I've had a hard time finding what function would do this - can you point me in the right direction? Thank you!

Some version of melt or stack would probably the the easiest way to go.
import pandas as pd
import numpy as np
data = {'Book Title': ['A Court of Thorns and Roses', 'Where the Crawdads Sing', 'Bad Blood', 'Atomic Habits'],
'Metric': ['Book Sales','Book Sales','Book Sales','Book Sales'],
'Q1 2022': [100000,np.nan,np.nan,np.nan],
'Q2 2022': [50000,75000,np.nan,35000],
'Q3 2022': [25000,150000,20000,45000],
'Q4 2022': [25000,20000,10000,65000]}
key_cols = ['Book Title','Metric']
new_cols = ['First Quarter','Second Quarter']
df1[new_cols] = (df1.set_index(key_cols)
.stack()
.groupby(level=0)
.head(2)
.values
.astype(int)
.reshape(len(df1),-1)
)
df1 = df1[key_cols+new_cols]
Output
Book Title Metric First Quarter Second Quarter
0 A Court of Thorns and Roses Book Sales 100000 50000
1 Where the Crawdads Sing Book Sales 75000 150000
2 Bad Blood Book Sales 20000 10000
3 Atomic Habits Book Sales 35000 45000

Related

Python Pandas- I want to get only columns that meet criteria in past x days- rolling dates

So I have a two separate data frames. The first has a list of dates and what international food day it is. Most days have 1 food item, but some can have many.
Date
Food
1/1/2022
Cream Puff
1/12/2022
Chicken
3/6/2022
Frozen Food
3/6/2022
Oreo
My second dataframe includes a list of people, foods, and the dates they ate the food
Date
Person
Food
12/29/2021
Jack
Cream Puff
12/30/2021
Pete
Cream Puff
1/12/2022
Jill
Jello
2/6/2022
Jill
Oreo
2/3/2022
Sara
Oreo
3/6/2022
Joel
Chicken
My goal is for each international food day to pull back everyone who ate that the day of the food day OR within the 5 days prior. I recognize that I can do a for loop with the food day dataframe and then try to find anyone between the date range and store them in a temporary data frame. However, my data set is almost 1 GB for both so the looping is taking forever. Plus looping on data frames is not ideal. Anyone have any guidance for how to do this without a loop?
You will have to evaluate the performance on the 1GB dataset you are referring to, but this is a method in pandas that would alleviate looping outside of the df.
import pandas as pd
food = pd.DataFrame({'Date': ['1/1/2022',
'1/12/2022',
'3/6/2022',
'3/6/2022'],
'Food':['Cream Puff',
'Chicken',
'Frozen Food',
'Oreo']})
person_food = pd.DataFrame({'Date':['12/29/2021',
'12/30/2021',
'1/12/2022',
'2/6/2022',
'2/3/2022',
'3/6/2022'],
'Food':['Cream Puff',
'Cream Puff',
'Jello',
'Oreo',
'Oreo',
'Chicken'],
'Person':['Jack',
'Pete',
'Jill',
'Jill',
'Sara',
'Joel']})
## Create a column for five days prior in your food df
food['end_date'] = pd.to_datetime(food['Date'])
food['start_date'] = food['end_date'] - pd.Timedelta(days = 5)
person_food['ate_date'] = pd.to_datetime(person_food['Date'])
## Join the dfs where the date the person ate the food was between the desired dates
df = pd.merge(food,
person_food,
how = 'left',
on = 'Food',
)\
.where((df['ate_date']<=df['end_date']) & (df['ate_date']>=df['start_date']))
## return the names of the people where the Person is not null
df[df['Person'].isna()==False]['Person']
You could try using pd.merge_asof which allows you to merge on datetime index allowing for a tolerance parameter.
df1 = {'Date': {0: Timestamp('2022-01-01 00:00:00'),
1: Timestamp('2022-01-12 00:00:00'),
2: Timestamp('2022-03-06 00:00:00'),
3: Timestamp('2022-03-06 00:00:00')},
'Food': {0: 'Cream Puff', 1: 'Chicken', 2: 'Frozen Food', 3: 'Oreo'}}
df2 = {'Date': {0: Timestamp('2021-12-29 00:00:00'),
1: Timestamp('2021-12-30 00:00:00'),
2: Timestamp('2022-01-12 00:00:00'),
3: Timestamp('2022-02-06 00:00:00'),
4: Timestamp('2022-02-03 00:00:00'),
5: Timestamp('2022-03-06 00:00:00')},
'Person': {0: 'Jack', 1: 'Pete', 2: 'Jill', 3: 'Jill', 4: 'Sara', 5: 'Joel'},
'Food': {0: 'Cream Puff',
1: 'Cream Puff',
2: 'Jello',
3: 'Oreo',
4: 'Oreo',
5: 'Chicken'}}
# Convert the 'Date' columns to datetime
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
# Use pd.merge_asof
out = pd.merge_asof(df2.sort_values(by='Date'), df1.sort_values(by='Date'),
on='Date', by='Food',
tolerance=pd.Timedelta(days=5), direction='backward')
>>>out
Date Person Food
0 2021-12-29 Jack Cream Puff
1 2021-12-30 Pete Cream Puff
2 2022-01-12 Jill Jello
3 2022-02-03 Sara Oreo
4 2022-02-06 Jill Oreo
5 2022-03-06 Joel Chicken
The dataframe returns all the names, dates, and foods eaten within 5 days of the Food-day

How to get rows which only have a certain value after group by, in Pandas

For example after grouping two columns the result is:
Car 1. Nissan
Purchased
Sold
Car 2 Nissan
Sold
Car 3 Nissan
Purchased
Sold
Rented
I would like to retrieve the Cars which have ONLY been sold. So just return Car 2.
However everything I have tried returns Car 1 and Car 3 as well, as they have both been sold as well.
Consedering the given dataframe :
import pandas as pd
df = pd.DataFrame({"Car": ['Car 1', 'Car 2', 'Car 3', 'Car 2', 'Car 2'],
"Model": ['Nissan', 'Nissan', 'Nissan', 'Nissan', 'Nissan'],
"Status": ['Purchased', 'Sold', 'Sold', 'Sold', 'Rented']})
>>> df
You can get the cars that have been sold (and not purchased or rented) by running this code :
df['Occur'] = df[df['Status'] == 'Sold'].groupby('Car')['Model'].transform('size')
filtered_df = df.loc[df['Occur'] == 1][['Car', 'Model', 'Status']]
filtered_df
>>> filtered_df

How can we do a find/replace on items in a dataframe, based on items in another dataframe?

I have this list which I convert into a dataframe.
labels = ['Airport',
'Amusement',
'Bridge',
'Campus',
'Casino',
'Commercial',
'Concert',
'Convention',
'Education',
'Entertainment',
'Government',
'Hospital',
'Hotel',
'Library',
'Mall',
'Manufacturing',
'Museum',
'Residential',
'Retail',
'School',
'University',
'Theater',
'Tunnel',
'Warehouse']
labels = pd.DataFrame(labels, columns=['lookup'])
labels
I have this dataframe.
df = pd.DataFrame({'Year':[2020, 2020, 2019, 2019, 2019],
'Name':['Dulles_Airport', 'Syracuse_University', 'Reagan_Library', 'AMC Theater', 'Reagan High School']})
How can I clean the items in the df, based on matches in labels? My 'labels' is totally clean and my 'df' is very messy. I would like to see the df like this.
df = pd.DataFrame({'Year':[2020, 2020, 2019, 2019, 2019],
'Name':['Airport', 'University', 'Library', 'Theater', 'School']})
df
You can use df.str.extract and nan-replacement:
labels = ['Airport', 'Amusement', 'Bridge', 'Campus', 'Casino', 'Commercial', 'Concert', 'Convention',
'Education', 'Entertainment', 'Government', 'Hospital', 'Hotel', 'Library', 'Mall', 'Manufacturing',
'Museum', 'Residential', 'Retail', 'School', 'University', 'Theater', 'Tunnel', 'Warehouse']
import pandas as pd
df = pd.DataFrame({
'Year': [2020, 2020, 2019, 2019, 2019, 1954],
'Name': ['Dulles_Airport', 'Syracuse_University', 'Reagan_Library', 'AMC Theater', 'Reagan High School', 'Shake, Rattle and Roll']
})
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
The resulting DataFrame will be
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
5 1954 Shake, Rattle and Roll NaN
If you want to keep the non-matching cells, do this:
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
df.loc[df['Match'].isnull(), 'Match'] = df['Name'][df['Match'].isnull()]
The resulting DataFrame will be
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
5 1954 Shake, Rattle and Roll Shake, Rattle and Roll
If you want to remove the non-matching cells, do this:
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
df = df.dropna()
The resulting DataFrame will be
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
Not the most pure pandas answer but you could write a function that performs a check for the string against your labels list and apply that to the Name column i.e.
def clean_labels(name):
labels = ['Airport','Amusement','Bridge','Campus',
'Casino','Commercial','Concert','Convention',
'Education','Entertainment','Government','Hospital',
'Hotel','Library','Mall','Manufacturing','Museum',
'Residential','Retail','School','University', 'Theater',
'Tunnel','Warehouse']
for item in labels:
if item in name:
return item
>>> df.Name.apply(clean_labels)
0 Airport
1 University
2 Library
3 Theater
4 School
I'm assuming here there aren't any typos when comparing the strings and it will return a NoneType for anything that doesn't match.

How do I melt a dataframe in pandas AND concatenate strings for the value

Say I have a dataframe
I want to re-shape it AND concatenate the strings
I can reshape it using melt but I lose the description. I've tried transform but no luck
Any ideas?
Code:
import pandas as pd
x = [['a', 'Electronics', 'TV', '42" plasma'], ['a', 'Electronics', 'TV', '36" LCD'], ['a', 'Electronics', 'hifi', 'cd player'], ['a', 'Electronics', 'hifi', 'record player'], ['b', 'Sports', 'Soccer', 'mens trainers'], ['b', 'Sports', 'Soccer', 'womens trainers'], ['b', 'Sports', 'golf', '9 iron']]
df = pd.DataFrame(x, columns =['id', 'category','sub_category','description'])
y = pd.melt(df, id_vars=['id'],value_vars=['category','sub category'])['description'].transform(lambda x : ' '.join(x))
There is first problem melt, need add description column to id_vars and then aggregate join with groupby, so all togehter is:
y = (pd.melt(df,
id_vars=['id','description'],
value_vars=['category','sub_category'],
value_name='Category')
.groupby(['id','Category'])['description']
.agg(' '.join)
.reset_index())
print (y)
id Category description
0 a Electronics 42" plasma 36" LCD cd player record player
1 a TV 42" plasma 36" LCD
2 a hifi cd player record player
3 b Soccer mens trainers womens trainers
4 b Sports mens trainers womens trainers 9 iron
5 b golf 9 iron

Reshaping a Pandas data frame with duplicate values

Using the using the Plotly go.Table() function and Pandas, I'm attempting to create a table to summarize some data. My data is as follows:
import pandas as pd
test_df = pd.DataFrame({'Manufacturer':['BMW', 'Chrysler', 'Chrysler', 'Chrysler', 'Brokertec', 'DWAS', 'Ford', 'Buick'],
'Metric':['Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator'],
'Dimension':['Short', 'Short', 'Short', 'Long', 'Short', 'Short', 'Long', 'Long'],
'User': ['USA', 'USA', 'USA', 'USA', 'USA', 'New USA', 'USA', 'Los USA'],
'Value':[50, 3, 3, 2, 5, 7, 10, 5]
})
My desired output is as follows (summing the Dimension by Manufacturer):
Manufacturer Short Long
Chrysler 6 2
Buick 5 5
Mercedes 7 0
Ford 0 10
I need to shape the Pandas data frame a bit (and this is where I'm running into trouble). My code was as follows:
table_columns = ['Manufacturer', 'Longs', 'Shorts']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (df[df['Manufacturer'].isin(manufacturers)]
.set_index(['Manufacturer', 'Dimension'])
['Value'].unstack()
.reset_index()[table_columns]
)
Then, create the table using the Plotly go.Table() function:
import plotly.graph_objects as go
direction_table = go.Figure(go.Table(
header=dict(
values=table_columns,
font=dict(size=12),
line_color='darkslategray',
fill_color='lightskyblue',
align='center'
),
cells=dict(
values=df_new.T, # using Transpose here
line_color='darkslategray',
fill_color='lightcyan',
align = 'center')
)
)
direction_table
The error I'm seeing is:
ValueError: Index contains duplicate entries, cannot reshape
What is the best way to work around this?
Thanks in advance!
You need to use pivot_table with aggfunc='sum' instead of set_index.unstack
table_columns = ['Manufacturer', 'Long', 'Short']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (test_df[test_df['Manufacturer'].isin(manufacturers)]
.pivot_table(index='Manufacturer', columns='Dimension',
values='Value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)
print (df_new)
Manufacturer Long Short
0 Buick 5 0
1 Chrysler 2 6
2 Ford 10 0
Note it is not the same output but I don't think your input can give the expected output
Or the same result with groupby.sum and unstack
(test_df[test_df['Manufacturer'].isin(manufacturers)]
.groupby(['Manufacturer', 'Dimension'])
['Value'].sum()
.unstack(fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)

Categories