Get rows based on minimun values of the column after group by - python

Sorry about my question, but I tried some solutions but I couldn't the right answer. I'm working with the Airbnb Boston data base and I would like to groupy by listing_id in the calendar data base and after to get the rows with minimum price and price different from 0.0.
The data base has 1308890 rows and 4 columns. There are 3585 unique listing_id.
dfc_calendar[(dfc_calendar['available'] == True)].groupby('listing_id')['price'].min()
Using isin commando comparing listing_id take a long period and stops with error after a long period. When I try to get the indexes after the groupby I got listing_id values and I need the indexes of the rows. How can I do it?
Thank you!

Not sure I got you. Shout if I got it wrong because I am not clear what difference with 0.0 means
Data
import pandas as pd
df=pd.DataFrame({'listing_id':['12345','12349','12345','12349','12345'], 'Price':[3,5,67,7,12]})
df['date'] = pd.date_range(start='1/2/2020', periods=len(df), freq='D')
df
Can go
df.groupby('listing_id')['Price'].min()
Or
df['MinPrice']=df.groupby('listing_id')['Price'].transform('min')
df
If you wanted to add availability in grouping. please try
df['MinPrice']=df.groupby('listing_id', 'available')['Price'].transform('min')
df
Or
df.loc[df.groupby('listing_id')['Price'].idxmin()]

Related

How to find the mean values of two columns for a specific year in a Pandas DataFrame?

I'm trying to find the average value of two columns 'GBP' and 'USD' based on the specific year of 2020
from the 'Date' column inside a Pandas DataFrame.
The original question: "What was the average value of the £ in 2020 against the $"
What I've done:
import pandas as pd
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/usd_vs_gbp_euro_2011_2021.csv')
print(df.groupby('Date').GBP.mean())
print(df.groupby('Date').USD.mean())
However, this code prints the mean for every year, not just the year 2020. Can anyone point out where I'm getting wrong or suggest some solutions?
Note: I'm new to Python and using DataFrames.
Assuming that the data-type of your Date column is string, this is how you do it:
df_2020 = df[df['Date'].str.contains('2020')]
USD_mean = df_2020['USD'].mean()
GBP_mean = df_2020['GBP'].mean()

To find average of data of month end date only using pandas

I have a list of company names, dates, and pe ratios.
I need to find an average of the previous 10 years data of the given date such that only month-end date is considered.
for example if I need to find average as of 31st dec, 2015..... I need to first find data of all previous month ends from 31/12/2005 to 31/12/2015. and then their average.
sample data I have
required output:
required output
here is what I have done soo far....
df = pd.read_csv('daily_valuation_ratios_cc.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
columns = ['pe', 'price_bv', 'mcap_ns', 'ev_ebidta']
df_mean = df.groupby('Company Name')[columns].resample('M').mean()
but this method is finding mean on daily basis and is showing result monthly, unlike my sample output.
i am new to pandas, pls help.
Edit:
df3 = df.groupby(['Company Name','year','month'])
df3.first()
this code works, now I just have one problem, to export dataframe to to_csv. pls help
A dataframe has a special function called groupby that selects a column, and can be aggregated.
So if you were to run, data.groupby('pe') you would get that column.
Now if you were to tack on .describe, you would get the standard deviation/mean/min/ect.
Example:
data.groupby('pe').describe()
Edit: You can also use built-in aggregate functions such as .max()/.mean()/ect. with groupby().

pandas loop through data frame for each unique value in column

I'm working with 2 csv files.
In the performance file: I have historical data on loan performance (i.e. loan 110 performance from month 1 to 7, then loan 111 performance from month 1 to 20).in the first file I have columns as follows: A= loan id, B= performance month (1 to 7), C=default amount. for each loanid there is 1 row per month of performance.
I'm trying to create a loop that gives me the first month that each loan has a default and copy the month and default amount into my second csv file which has descriptive data on each loanid. the idea is to add 2 columns on the second file and for each loanid, to retrieve the month when they first have a default value.
i'm working on jupyter notebook and so far I've imported pandas library and read the performance csv file.
any guidance would be appreciated.
import pandas as pd
data = pd.read_csv(r'c:\users\guest1\documents\python_example_performance.csv',delimiter=',')
data.head()
First of all, I can't comment as I don't have enough reputation. I would need more clarification on the issue. Could you show how the data look like? It's a bit confusing for me between the 100, 101 and the rating 1-7 or 1-20.
Based on my current understanding, I would remove non-default value first from first CSV.
Since you're using Pandas, you can go through Loc.
The syntax generally looks like this.
df = df[df[cols] > 0]
If they're duplicate, then keep the latest month or current month depends on your choice. Pandas support drop duplicate and have the option of keeping the first or last record. The syntax generally looks like this.
df = df.drop_duplicates(subset ="Col1", keep = 'last')
For more documentation, please refer to : Pandas - Drop Duplicates
Lastly , you need to perform a Join for both Data Frames based on loan ID. The syntax generally looks like this.
df = pd.merge(df1, df2, how='left', on=['LoanID'])
For more documentation, please refer to : Pandas - Merge

Python Pandas GroupBy Max Date

I have a very simple dataframe with columns: Index, Person, Item, Date. There are only 4 people and 3 items and random dates. All person/item/date combinations are unique. I am trying to get a simple pivot-table like df to print using:
import pandas as pd
mydf = pd.read_csv("Test_Data.csv",index_col=[0])
mydf = mydf.sort_values(by=['Date','Item','Person'], ascending=False)
print(mydf.groupby(['Person','Item'])['Date'].max())
however, I noticed that while the structure is what I want, the data is not. It is not returning the max date for the Person/Item combination. I thought sorting things first would help, but it did not. Do I need to create a temp df first and then join to do what I'm trying to do?
Also to be clear, there are 28 rows of data (all test data) with some People/Items being repeated but with different dates. Index is just 0 through 27.
Figured it out! Should have made sure the Date field was actually recognized as a date:
mydf['Date'] = pd.to_datetime(mydf['Date'])

Can i aggregate a dataframe monthly in Python taking into consideration also the other variables?

Sorry if the title is not clear enough. The dataset I have is 'df': representing daily data for two 'id' (1 and 2) in the month of january; for each 'id' in each day of january is associated a Value(a or b). The problem I'm having is that I want, starting from a dataset like df (the links are below), to arrive to df1. So the goal is to group monthly the data but for each 'id' and not the whole values. The Value column should be the sum of all 'a' and 'b' of a certain 'id' in a certain month.
I don't know if I have been clear explaining the problem. I hope the links below to help you. I am a very beginner in Python and I am facing may difficulties.
Thank you very much in advance.
Dataframe df head
Dataframe df end
Dataframe df1: the output I would like to obtain
From what I see, this should do:
df = df.set_index(pd.to_datetime(df['Date']))
df1= df.groupby([pd.Grouper(freq='M'), 'id']).agg({'value':'sum'}
But, as stated, you should post a reproducible example if you want te get better help.

Categories