To find average of data of month end date only using pandas - python

I have a list of company names, dates, and pe ratios.
I need to find an average of the previous 10 years data of the given date such that only month-end date is considered.
for example if I need to find average as of 31st dec, 2015..... I need to first find data of all previous month ends from 31/12/2005 to 31/12/2015. and then their average.
sample data I have
required output:
required output
here is what I have done soo far....
df = pd.read_csv('daily_valuation_ratios_cc.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
columns = ['pe', 'price_bv', 'mcap_ns', 'ev_ebidta']
df_mean = df.groupby('Company Name')[columns].resample('M').mean()
but this method is finding mean on daily basis and is showing result monthly, unlike my sample output.
i am new to pandas, pls help.
Edit:
df3 = df.groupby(['Company Name','year','month'])
df3.first()
this code works, now I just have one problem, to export dataframe to to_csv. pls help

A dataframe has a special function called groupby that selects a column, and can be aggregated.
So if you were to run, data.groupby('pe') you would get that column.
Now if you were to tack on .describe, you would get the standard deviation/mean/min/ect.
Example:
data.groupby('pe').describe()
Edit: You can also use built-in aggregate functions such as .max()/.mean()/ect. with groupby().

Related

How to find the mean values of two columns for a specific year in a Pandas DataFrame?

I'm trying to find the average value of two columns 'GBP' and 'USD' based on the specific year of 2020
from the 'Date' column inside a Pandas DataFrame.
The original question: "What was the average value of the £ in 2020 against the $"
What I've done:
import pandas as pd
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/usd_vs_gbp_euro_2011_2021.csv')
print(df.groupby('Date').GBP.mean())
print(df.groupby('Date').USD.mean())
However, this code prints the mean for every year, not just the year 2020. Can anyone point out where I'm getting wrong or suggest some solutions?
Note: I'm new to Python and using DataFrames.
Assuming that the data-type of your Date column is string, this is how you do it:
df_2020 = df[df['Date'].str.contains('2020')]
USD_mean = df_2020['USD'].mean()
GBP_mean = df_2020['GBP'].mean()

Get rows based on minimun values of the column after group by

Sorry about my question, but I tried some solutions but I couldn't the right answer. I'm working with the Airbnb Boston data base and I would like to groupy by listing_id in the calendar data base and after to get the rows with minimum price and price different from 0.0.
The data base has 1308890 rows and 4 columns. There are 3585 unique listing_id.
dfc_calendar[(dfc_calendar['available'] == True)].groupby('listing_id')['price'].min()
Using isin commando comparing listing_id take a long period and stops with error after a long period. When I try to get the indexes after the groupby I got listing_id values and I need the indexes of the rows. How can I do it?
Thank you!
Not sure I got you. Shout if I got it wrong because I am not clear what difference with 0.0 means
Data
import pandas as pd
df=pd.DataFrame({'listing_id':['12345','12349','12345','12349','12345'], 'Price':[3,5,67,7,12]})
df['date'] = pd.date_range(start='1/2/2020', periods=len(df), freq='D')
df
Can go
df.groupby('listing_id')['Price'].min()
Or
df['MinPrice']=df.groupby('listing_id')['Price'].transform('min')
df
If you wanted to add availability in grouping. please try
df['MinPrice']=df.groupby('listing_id', 'available')['Price'].transform('min')
df
Or
df.loc[df.groupby('listing_id')['Price'].idxmin()]

pandas loop through data frame for each unique value in column

I'm working with 2 csv files.
In the performance file: I have historical data on loan performance (i.e. loan 110 performance from month 1 to 7, then loan 111 performance from month 1 to 20).in the first file I have columns as follows: A= loan id, B= performance month (1 to 7), C=default amount. for each loanid there is 1 row per month of performance.
I'm trying to create a loop that gives me the first month that each loan has a default and copy the month and default amount into my second csv file which has descriptive data on each loanid. the idea is to add 2 columns on the second file and for each loanid, to retrieve the month when they first have a default value.
i'm working on jupyter notebook and so far I've imported pandas library and read the performance csv file.
any guidance would be appreciated.
import pandas as pd
data = pd.read_csv(r'c:\users\guest1\documents\python_example_performance.csv',delimiter=',')
data.head()
First of all, I can't comment as I don't have enough reputation. I would need more clarification on the issue. Could you show how the data look like? It's a bit confusing for me between the 100, 101 and the rating 1-7 or 1-20.
Based on my current understanding, I would remove non-default value first from first CSV.
Since you're using Pandas, you can go through Loc.
The syntax generally looks like this.
df = df[df[cols] > 0]
If they're duplicate, then keep the latest month or current month depends on your choice. Pandas support drop duplicate and have the option of keeping the first or last record. The syntax generally looks like this.
df = df.drop_duplicates(subset ="Col1", keep = 'last')
For more documentation, please refer to : Pandas - Drop Duplicates
Lastly , you need to perform a Join for both Data Frames based on loan ID. The syntax generally looks like this.
df = pd.merge(df1, df2, how='left', on=['LoanID'])
For more documentation, please refer to : Pandas - Merge

Python Pandas GroupBy Max Date

I have a very simple dataframe with columns: Index, Person, Item, Date. There are only 4 people and 3 items and random dates. All person/item/date combinations are unique. I am trying to get a simple pivot-table like df to print using:
import pandas as pd
mydf = pd.read_csv("Test_Data.csv",index_col=[0])
mydf = mydf.sort_values(by=['Date','Item','Person'], ascending=False)
print(mydf.groupby(['Person','Item'])['Date'].max())
however, I noticed that while the structure is what I want, the data is not. It is not returning the max date for the Person/Item combination. I thought sorting things first would help, but it did not. Do I need to create a temp df first and then join to do what I'm trying to do?
Also to be clear, there are 28 rows of data (all test data) with some People/Items being repeated but with different dates. Index is just 0 through 27.
Figured it out! Should have made sure the Date field was actually recognized as a date:
mydf['Date'] = pd.to_datetime(mydf['Date'])

computing rolling averages by integer days in pandas

I have taken some data from a csv and put it into a dataframe:
from pandas import read_csv
df = read_csv('C:\...', delimiter = ',', encoding = 'utf-8')
df2 = df.groupby(['i-j','day'])['i-j'].agg({'count'})
I would like to calculate for each 'i-j' the seven day moving average of their count. First I think I need to add the days with zero count to the table. Is there an easy way to do this by modifying my code above? In other words I would like missing values to count as 0.
Then I would need to add another column to the dataframe that calculates the average of count for each i-j for the previous seven-days. Do I need to convert the days to something that pandas recognizes as a date value in order to use some of the rolling statistical functions? Or can I just change the type of the 'date' column and proceed.
Many thanks!
There may be a better way to do this, but given your starting DataFrame of df2 the following should work.
First reindex df2 to fill in the missing days with zeros:
new_index = pd.MultiIndex.from_product([df2.index.get_level_values(0).unique(), range(31)])
df2 = df2.reindex(new_index, fill_value=0)
(I'm assuming you want 31 days, but you can change this as necessary.)
Now if you unstack this reindexed DataFrame and take the transpose, you have a DataFrame where each column is an entry of i-j and contains the counts per day:
df2.unstack().T
You can calculate the rolling mean of this DataFrame:
rm = pd.rolling_mean(df2.unstack().T, 7)
To finish, you can stack this frame of rolling means to get back to the shape of the original reindexed df2:
rm.T.stack(dropna=False)

Categories