I am trying to figure out a way to do this sum in one line, or without having to create another dataframe in memory.
I have a DF with 3 columns. ['DateCreated', 'InvoiceNumber', 'InvoiceAmount']
I am trying to SUM the invoice amount during certain date ranges.
I have this working, but I want to do it without having to create a DF then sum the column. Any help is appreciated.
yesterday_sales_df = df[(df['DateCreated'] > yesterday_date) & (df['DateCreated'] < tomorrow_date)]
yesterday_sales_total = yesterday_sales_df['InvoiceAmount'].sum()
print(yesterday_sales_total)
Thanks
You can try with loc
yesterday_sales_total = df.loc[(df['DateCreated'] > yesterday_date) & (df['DateCreated'] < tomorrow_date), 'InvoiceAmount'].sum()
You can use this as well
# filter df with query
yesterday_sales_total = df.query("#yesterday_date < DateCreated < #tomorrow_date")['InvoiceAmount'].sum()
try between:
sales_total = df[df['DateCreated'].between(yesterday_date, tomorrow_date)]['InvoiceAmount'].sum()
if it's nessesary set inclusive argument (inclusive='both' by default)
Related
I'm trying to create a new column in a DataFrame that comes from a CSV file. What makes a this little bit tricky is that the values from this new column depends on conditions from other columns from the DataFrame.
The output column depends on the values from the following columns from this dataframe: VaccineCode | Occurrence | VaccineN | firstVaccineDate
So if the condition is met for a specific vaccine, I have to sum the respective date from the ApplicationDate column, in order to tell the vaccine date of the second dose.
My code:
import pandas as pd
import datetime
from datetime import timedelta, date, datetime
df = pd.read_csv(path_csv, engine='python', sep=';')
criteria_Astrazeneca = (df.VaccineCode == 85) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_Pfizer = (df.VaccineCode == 86) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_CoronaVac = (df.VaccineCode == 87) & (df.Occurrence == 1) & (df.VaccineN == 1)
days_pfizer = 56
days_coronaVac = 28
days_astraZeneca = 84
What I've tried so far:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
This works until the point that I have to complete the same New_Column with the others results, like this:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df['New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df['New_Column'] = df[criteria_AstraZeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
Naturally, the problem with this approach comes from the fact that the next statement overwrites those before, so I end up just with the New_Column filled with the results that came from the last statement. I need a way to put all results in the same column.
My last try was:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df[criteria_Pfizer].loc[:,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
But it gives the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(ilocs[0], value, pi)
Thank you very much #ddejohn, the first link helped me to solve my problem as follows:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df.loc[criteria_Pfizer,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df.loc[criteria_Astrazeneca,'New_Column'] = df[criteria_Astrazeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
That way, the first statement create the column and fill with the coronavac indexes and the next ones fill the same column just in the respective indexes.
Problem solved, thanks again.
You could also use an data frame transform to create a new rule
I have a pandas dataframe of 62 rows x 10 columns. Each row contains numbers and if any of the numbers are within a certain range then return a string into the last column.
I have unsuccessfully tried the .apply method to use a function to make the assessment. I have also tried to import as a series but then the .apply method causes problems because it is a list.
df = pd.read_csv(results)
For example, in the image attached, if any value from Base 2019 to FY26 Load is between 0.95 and 1.05 then return 'Acceptable' into the last column otherwise return 'Not Acceptable'.
Any help, even a start would be much appreciated.
This should perform as expected:
results = "input.csv"
df = pd.read_csv(results)
low = 0.95
high = 1.05
# The columns to check
cols = df.columns[2:]
df['Acceptable?'] = (df[cols] > low).any(axis=1) & (df[cols] < high).all(axis=1)
I have a data set and I want to drop some rows with a faster method. I had tried the following code but it took a long time
I want to drop every user who makes less than 3 operations.
every operation is stored in a row in which user_id is not the ID of my data
undesirable_users=[]
for i in range(len(operations_per_user)):
if operations_per_user.get_value(operations_per_user.index[i])<=3:
undesirable_users.append(operations_per_user.index[i])
for i in range(len(undesirable_users)):
data = data.drop(data[data.user_id == undesirable_users[i]].index)
data is a dataframe and operation_per_user is a series created by: operation_per_user = data['user_id'].value_counts().
Why not just filter them? You don't need to loop at all.
You can get the filtered indexes by:
operations_per_user.index[operations_per_user <= 3]
And then you can filter these indexes from the df, making the solution:
data = data[data['user_id'] not in (operations_per_user.index[operations_per_user <= 3])]
EDIT
My understanding is that you want to remove any user that occurs less than 3 times in the data. You won't need to create a value_counts list for that, you could do a groupby and find the counts and then filter on that basis.
filtered_user_ids = data.groupby('user_id').filter(lambda x: len(x) <= 3)['user_id'].tolist()
data = data[~data[user_id].isin(filtered_user_ids)]
If data is a pandas DataFrame, and it contains both user_id and operations_per_user as columns, you should perform the drop with:
data = data.drop(data.loc[data['operations_per_user'] <= 3].index)
Edit
Instead of creating a seperate series, you could add operations_per_user to data with:
data['operations_per_user'] = data.loc[:, 'user_id'].value_counts()
You could either perform the drop as above or perform the selection with the inverse logical condition:
data = data.loc[data['operations_per_user' > 3]]
Original
It would be preferable if you could supply some more information about the variables used in your code.
If operations_per_user is a pandas Series, your first loop could be improved with:
undesirable_users=[]
for i in operations_per_user.index:
if operations_per_user.loc[i] <= 3:
undesirable_users.append(i)
The function get_value() is deprecated, use loc or iloc instead. This is a good summary of loc and iloc, and here is a great pandas cheatsheet to reference.
You can use python lists as iterators; for your second loop:
for user in undesirable_users:
data = data.drop(data.loc[data['user_id'] == user].index)
Rather than dropping, you can simply select the rows you want to keep reverting the logical condition.
First, select the user to keep only.
Then get a boolean list, length equal to data rows.
Finally, select the rows to keep.
keepusers = operation_per_user.loc[operation_per_user > 3]
tokeep = [uid in keepuser for uid in data['user_id']]
newdata = data.loc[tokeep]
I have a pandas dataframe with a column that indicates which hour of the day a particular action was performed. So df['hour'] is many rows each with a value from 0 to 23.
I am trying to create dummy variables for things like 'is_morning', for example:
if df['hour'] >= 5 and < 12 then return 1, else return 0
A for loop doesn't work given the size of the data set, and I've tried some other stuff like
df['is_morning'] = df['hour'] >= 5 and < 12
Any suggestions??
You can just do:
df['is_morning'] = (df['hour'] >= 5) & (df['hour'] < 12)
i.e. wrap each condition in parentheses, and use &, which is an and operation that works across the whole vector/column.
so I have a large pandas DataFrame that contains about two months of information with a line of info per second. Way too much information to deal with at once, so I want to grab specific timeframes. The following code will grab everything before February 5th 2012:
sunflower[sunflower['time'] < '2012-02-05']
I want to do the equivalent of this:
sunflower['2012-02-01' < sunflower['time'] < '2012-02-05']
but that is not allowed. Now I could do this with these two lines:
step1 = sunflower[sunflower['time'] < '2012-02-05']
data = step1[step1['time'] > '2012-02-01']
but I have to do this with 20 different DataFrames and a multitude of times and being able to do this easily would be nice. I know pandas is capable of this because if my dates were the index rather than a column, it's easy to do, but they can't be the index because dates are repeated and therefore you receive this error:
Exception: Reindexing only valid with uniquely valued Index objects
So how would I go about doing this?
You could define a mask separately:
df = DataFrame('a': np.random.randn(100), 'b':np.random.randn(100)})
mask = (df.b > -.5) & (df.b < .5)
df_masked = df[mask]
Or in one line:
df_masked = df[(df.b > -.5) & (df.b < .5)]
You can use query for a more concise option:
df.query("'2012-02-01' < time < '2012-02-05'")