Usage of WHERE CLAUSE in Python - python

Trying to write SQL queries in Python.
I have a data frame df with columns date, ID and amount
Every month I am getting a new load of data. I have to calculate the average amount for a particular ID for the last 12 months (means we will have 12 records for that one ID).
Currently, my approach is
M1 = pd.date_range(first_day_of_month, last_day_of_12_month, freq='D').strftime("%Y%d%m").tolist()
df["new"] = df[(df['date'].isin(M1))]['amount'].mean()
Now I want to upload this average as a new column, each ID with current (latest) time stamp has average of last 12 months amount. Tried using groupby but was not able to apply properly.

mask = d.date.between(datetime.datetime(2019,1,1),datetime.datetime(2019,12,31))
d[].groupby(['ID'])['amount'].mean()
I guess ? maybe ?

Related

In python pandas, How do you eliminate rows of data that fail to meet a condition of grouped data?

I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
Original Data Set
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
Results of two lines of code
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query

How to best calculate average issue age for a range of historic dates (pandas)

Objective:
I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref
Created
resolved
a1
20/4/2021
15/5/2021
a2
21/4/2021
15/6/2021
a3
23/4/2021
15/7/2021
Endpoint: "df2"
Date
Avg_age
1/1/2021
x
2/1/2021
y
3/1/2021
z
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far:
I got this to work in what feels like a very poor way.
create a date range (pd.date_range(start,finish,freq="D")
I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
once done I just convert the list into a series for my final result, which I can then graph or whatever.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems:
This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.
I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)
You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series as before.

how can I group column by date and get the average from the other column in python?

From the dataframe below:
I would like to group column 'datum' by date 01-01-2019 and so on. and get an average at the same time on column 'PM10_gemiddelde'.
So now all 01-01-2019 (24 times) is on hour base and i need it combined to 1 and get the average on column ' PM10_gemiddelde' at the same time. See picture for the data.
besides that, PM10_gemiddelde has also negative data. How can i erase that data in python easily?
Thank you!
ps. im new with python
What you are trying to do can be achieve by:
data[['datum','PM10_gemiddelde']].loc[data['PM10_gemiddelde'] > 0 ].groupby(['datum']).mean()
You can create a new column with the average of PM10_gemiddelde using groupby along with transform. Try the following:
Assuming your dataframe is called df, start first by removing the negative data:
new_df = df[df['PM10_gemiddelde'] > 0]
Then, you can create a new column that contains the average value for every date:
new_df['avg_col'] = new_df.groupby('datum')['PM10_gemiddelde'].transform('mean')

groupby agg using a date offset or similar

Sample of dataset below
Trying to create a groupby that will give me the number of months that I specify eg last 12 months, last 36 months etc.
My groupby that rolls up my whole dataset for each 'client' is below. rolled_ret is just a custom function that geometrically links whatever performance array it gets, we can pretend is is sum()
df_client_perf = df_perf.groupby(df_perf.CLIENT_NAME)['GROUP_PERFORMANCE'].agg(Client_Return = rolled_ret)
If I put .rolling(12) I can take the most recent entry to get the previous 12 months but there is obviously a better way to do this.
Worth saying that the period column is a monthly period datetime type using to_period
thanks in advance
PERIOD,CLIENT_NAME,GROUP_PERFORMANCE
2020-03,client1,0.104
2020-04,client1,0.004
2020-05,client1,0.23
2020-06,client1,0.113
2020-03,client2,0.0023
2020-04,client2,0.03
2020-05,client2,0.15
2020-06,client2,0.143
lets say for example that I wanted to do a groupby to SUM the latest three months of data, my expected output of the above would be
client1,0.347
client2,0.323
also - I would like a way to return nan if the dataset is missing the minimum number of periods, as you can do with the rolling function.
Here is my answer.
I've used a DatetimeIndex because the method last does not work with period. First I sort values based on the PERIOD column, then I set it as Index to keep only the last 3 months (or whatever you provide), then I do the groupby the same way as you.
df['PERIOD'] = pd.to_datetime(df['PERIOD'])
(df.sort_values(by='PERIOD')
.set_index('PERIOD')
.last('3M')
.groupby('CLIENT_NAME')
.GROUP_PERFORMANCE
.sum())
# Result
CLIENT_NAME GROUP_PERFORMANCE
client1 0.347
client2 0.323

Getting values from csv file in python, large dataset

I have a csv file with 500 companies stock values for 5 years (2013-2017). The columns I have are: date, open, high, low, close, volume and name. I would like to be able to compare these companies, to see which 20 of them are the best. I was thinking about just using the mean, but since the stocks value of the first data collected (jan 2013) are different (some starts of at 30 usd, and others at 130 usd), it's hard to really compare which ones that has been the best during these 5 years. I would therefore want to have the values of the first date of every company as the zero-point. Basically I want to subtract the close value from the first date to the rest of the datas collected.
My problem is that, firstly, I have a hard time getting to the first dates close value. Somehow I want to write somthing like "data.loc(data['close']).iloc(0)". But since it's a dataframe I can't find a value of a row, nor iterate through the dataframe.
Secondly, I'm not sure how I can differentiate between the companies. I want to do the procedure with the zero-point for every of these 500 companies, so somehow I need to know when to start over.
The code I have now is
def main():
data = pd.read_csv('./all_stocks_5yr.csv', usecols = ['date', 'close', 'Name'])
comp_name = sorted(set(data.Name))
number_of = comp_name.__len__()
comp_mean = []
for i in comp_name:
frames = data.loc[data['Name'] == i]
comp_mean.append([i, frames['close'].mean()])
print(comp_mean)
But this will only give me the mean, without using the zero-point
Another idea I had was to just compare the closing price from first value (January 1, 2013) with the price from the last value (December 31, 2017) to see how much the stock has increased/decreased, what I'm not sure about here is how I will reach the close values from these dates, for every single of the 500 companies.
Do you have any recommendations for any of the methods?
Thank you in advance

Categories