Python Pandas GroupBy Max Date - python

I have a very simple dataframe with columns: Index, Person, Item, Date. There are only 4 people and 3 items and random dates. All person/item/date combinations are unique. I am trying to get a simple pivot-table like df to print using:
import pandas as pd
mydf = pd.read_csv("Test_Data.csv",index_col=[0])
mydf = mydf.sort_values(by=['Date','Item','Person'], ascending=False)
print(mydf.groupby(['Person','Item'])['Date'].max())
however, I noticed that while the structure is what I want, the data is not. It is not returning the max date for the Person/Item combination. I thought sorting things first would help, but it did not. Do I need to create a temp df first and then join to do what I'm trying to do?
Also to be clear, there are 28 rows of data (all test data) with some People/Items being repeated but with different dates. Index is just 0 through 27.

Figured it out! Should have made sure the Date field was actually recognized as a date:
mydf['Date'] = pd.to_datetime(mydf['Date'])

Related

To find average of data of month end date only using pandas

I have a list of company names, dates, and pe ratios.
I need to find an average of the previous 10 years data of the given date such that only month-end date is considered.
for example if I need to find average as of 31st dec, 2015..... I need to first find data of all previous month ends from 31/12/2005 to 31/12/2015. and then their average.
sample data I have
required output:
required output
here is what I have done soo far....
df = pd.read_csv('daily_valuation_ratios_cc.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
columns = ['pe', 'price_bv', 'mcap_ns', 'ev_ebidta']
df_mean = df.groupby('Company Name')[columns].resample('M').mean()
but this method is finding mean on daily basis and is showing result monthly, unlike my sample output.
i am new to pandas, pls help.
Edit:
df3 = df.groupby(['Company Name','year','month'])
df3.first()
this code works, now I just have one problem, to export dataframe to to_csv. pls help
A dataframe has a special function called groupby that selects a column, and can be aggregated.
So if you were to run, data.groupby('pe') you would get that column.
Now if you were to tack on .describe, you would get the standard deviation/mean/min/ect.
Example:
data.groupby('pe').describe()
Edit: You can also use built-in aggregate functions such as .max()/.mean()/ect. with groupby().

Get rows based on minimun values of the column after group by

Sorry about my question, but I tried some solutions but I couldn't the right answer. I'm working with the Airbnb Boston data base and I would like to groupy by listing_id in the calendar data base and after to get the rows with minimum price and price different from 0.0.
The data base has 1308890 rows and 4 columns. There are 3585 unique listing_id.
dfc_calendar[(dfc_calendar['available'] == True)].groupby('listing_id')['price'].min()
Using isin commando comparing listing_id take a long period and stops with error after a long period. When I try to get the indexes after the groupby I got listing_id values and I need the indexes of the rows. How can I do it?
Thank you!
Not sure I got you. Shout if I got it wrong because I am not clear what difference with 0.0 means
Data
import pandas as pd
df=pd.DataFrame({'listing_id':['12345','12349','12345','12349','12345'], 'Price':[3,5,67,7,12]})
df['date'] = pd.date_range(start='1/2/2020', periods=len(df), freq='D')
df
Can go
df.groupby('listing_id')['Price'].min()
Or
df['MinPrice']=df.groupby('listing_id')['Price'].transform('min')
df
If you wanted to add availability in grouping. please try
df['MinPrice']=df.groupby('listing_id', 'available')['Price'].transform('min')
df
Or
df.loc[df.groupby('listing_id')['Price'].idxmin()]

How to update/apply validation to pandas columns

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

Return a column of 'days in month' from monthly index Python

I have a time series of monthly values and I would like to calculate the number of days in that month (to then divide the number by to get a daily average for that month).
I have used calendar.monthrange() to calculate this by looping through the values, but I was looking at the pandas.DataFrame.apply method (https://medium.com/#rtjeannier/pandas-101-cont-9d061cb73bfc) and wondering how it was possible to use that instead of a loop?
The code below gives me the output I would like, but for efficiency (and learning) purposes I'd like to understand the better way of doing this by using the apply method rather than a loop.
import pandas as pd
import calendar
df = pd.DataFrame()
df['temp'] = pd.date_range(start='01-Jan-2000', end='31-Dec-2018', freq='MS')
df['value'] = 5
df.set_index('temp', inplace=True)
days_list = []
for val in df.index:
days_list.append(calendar.monthrange(val.year, val.month)[1])
df['days_in_month'] = days_list
I can find the number of days for one row of the index nice and easily by using this:
calendar.monthrange(df.index[0].year, df.index[0].month)[1]
But then if I tried to do it for a number of values (see below) it throws an error, I am missing the methodology on how to get between the two.
calendar.monthrange(df.index.year, df.index.month)[1]
The end goal would to create a column (like the loop does) but more efficiently and without the needless creation of a list, looping through, then adding the list to the dataframe.
Use map with df.index:
df['days_in_month'] = df.index.map(lambda val: calendar.monthrange(val.year, val.month)[1])
How about getting the index column to a regular column and then using daysinmonth:
df['days_in_month'] = df.index.daysinmonth

Python Excel rows with same value checking

I’m trying get my Python program to verify an excel spreadsheet that looks like this:
The first column is the order number and there may be one or more rows with the same number. Then there's the last column which indicates the row status (OK or not).
I want to check if all rows for a given order number have been
marked as OK.
I have found something called pandas and if anyone could give me a help with idea how to handle it? There's also an option called groupby - could I use this to group by order numbers and then verify if all rows for this order number have been marked as OK?
You are on the right track. Just import the data and pivot it using pandas and check if number of empty counts are > 0. I used a dummy data since I couldn't take from your image:-
import pandas as pd
df = pd.DataFrame()
df['no'] = [1,1,1,2,1,2,1,3]
df['ok'] = ['OK','Empty','OK','Empty','Empty','OK','OK','OK']
df['cnt'] = 1
a = df.pivot_table(index=['no'],columns=['ok'],values='cnt', aggfunc='count')
a.reset_index(inplace=True)
a.fillna(0, inplace=True)
print a

Categories