I have yet another Python question. This one probably can be achieved with help of a loop, however I was looking for a leaner solution
Suppose that I have a data frame like this one:
I am looking for a code to generate column ID which is no more than a descending counter for when the value in column Sold changes - ie, for each Salesman I would like to have the ID column retrieving the number of days left until the sold value changes.
For example, on date 01/01/2018, salesman Joe would be having ID = 2 because the signal changes in 2 days.
Any ideas on how to solve this one?
Many thanks.
J
Setup:
df = pd.DataFrame([
pd.Series(pd.date_range('1/1/2018', '1/7/2018').append(pd.date_range('1/1/2018', '1/7/2018'))),
pd.Series(['Joe']*7 + ['Helen']*7),
pd.Series([1,1,0,0,0,0,1,0,1,1,0,1,0,0]),
]).T
df.columns = ['date', 'salesman', 'sold']
df['date'] = pd.to_datetime(df['date'])
Computation:
df['changes'] = df.groupby('salesman')['sold'].expanding().apply(lambda x: (np.diff(x) != 0).sum()).reset_index(drop = True)
df['id'] = df.groupby(['salesman', 'changes']).apply(lambda grp: pd.Series(len(grp) - grp.sort_values('date').reset_index().index)).reset_index(drop = True)
df.drop('changes', axis = 1, inplace = True)
Results:
>>> df
date salesman sold id
0 2018-01-01 Joe 1 2
1 2018-01-02 Joe 1 1
2 2018-01-03 Joe 0 4
3 2018-01-04 Joe 0 3
4 2018-01-05 Joe 0 2
5 2018-01-06 Joe 0 1
6 2018-01-07 Joe 1 1
7 2018-01-01 Helen 0 1
8 2018-01-02 Helen 1 2
9 2018-01-03 Helen 1 1
10 2018-01-04 Helen 0 1
11 2018-01-05 Helen 1 1
12 2018-01-06 Helen 0 2
13 2018-01-07 Helen 0 1
Explanation:
create a 'changes' column that increments every-time an individual salesperson's 'sold' field changes. Then for each increment group (still grouped by salesperson), get the length of this group (which is equal to how subsequent rows of this value there are) and subtract from that value the index of each row, sorted by date. The result of that subtraction will be a series that descends from the length of the group to 1. Reset the index and merge back to your original dataframe. It's a bit of a confusing solution but it should work.
Related
This question already has an answer here:
How to count unique occurrences grouping by changing time period?
(1 answer)
Closed 1 year ago.
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. There are two variants of this problem:
easier - sum:
created_at value
01-01-2012 5
02-01-2012 6
05-01-2012 1
05-01-2012 1
01-02-2012 3
02-02-2012 2
05-02-2012 1
which should output:
Date Month to date sum Year to date sum
01-01-2012 5 5
02-01-2012 11 11
05-01-2012 13 13
01-02-2012 3 14
02-02-2012 5 15
05-02-2012 6 16
and harder - count unique:
created_at value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
05-02-2012 d
which should output:
Date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 1 3
05-02-2012 2 4
The data is, of course, in Pandas data frame.The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values.
I was checking out Pandas window functions, but it doesn't look like a solution.
Try with groupby:
Cumulative sum:
df["created_at"] = pd.to_datetime(df["created_at"], format="%d-%m-%Y")
df["Month to date sum"] = df.groupby(df["created_at"].dt.month)["value"].transform('cumsum')
df["Year to date sum"] = df.groupby(df["created_at"].dt.year)["value"].transform('cumsum')
>>> df
created_at value Month to date sum Year to date sum
0 2012-01-01 5 5 5
1 2012-01-02 6 11 11
2 2012-01-05 1 12 12
3 2012-02-01 3 3 15
4 2012-02-02 2 5 17
5 2012-02-05 1 6 18
Cumulative unique count:
df2["created_at"] = pd.to_datetime(df2["created_at"], format="%d-%m-%Y")
df2["Month to date unique"] = df2.groupby(df2["created_at"].dt.month)["value"].apply(lambda x: (~x.duplicated()).cumsum())
df2["Year to date unique"] = df2.groupby(df2["created_at"].dt.year)["value"].apply(lambda x: (~x.duplicated()).cumsum())
>>> df2
created_at value Month to date unique Year to date unique
0 2012-01-01 a 1 1
1 2012-01-02 b 2 2
2 2012-01-05 c 3 3
3 2012-02-01 a 1 3
4 2012-02-02 a 1 3
5 2012-02-05 d 2 4
In one data frame (called X) I have Patient_admitted_id, Date, Hospital_ID of tested covid positive patients (I show this data frame below). I want to generate a separated data frame (called Y) with Dates of the calendar, total number of covid Cases and cumulative cases.
I dont know how to generate the column Cases
X data frame:
data = {'Patient_admitted_id': [214321,224323,3234234,23423],
'Date': ['2021-01-22', '2021-01-22','2021-01-22','2021-01-20'], # This is just an example I have created here, the real X data frame contains proper date values generated with Datatime
'Hospital_ID': ['1', '2','3','2'],
}
X = pd.DataFrame(data, columns=['Patient_admitted_id','Date', 'Hospital_ID' ])
X
Patient_admitted_id Date Hospital_ID
0 214321 2021-01-22 1
1 224323 2021-01-22 2
2 3234234 2021-01-22 3
3 23423 2021-01-20 2
...
Desirable Y data frame:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
...
Use DataFrame.resample by days with counts by Resampler.size with Series.cumsum for cumulative counts:
X['Date']= pd.to_datetime(X['Date'])
df = X.resample('D', on='Date').size().reset_index(name='Cases')
df['Cumulative'] = df['Cases'].cumsum()
print (df)
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
You can use groupby on Date column and call size to get the count of individual Dates, you can then simply call Cumsum on cases to get the desired output
out = X.groupby('Date').size().to_frame('Cases').reset_index()
out['Cumulative'] = out['Cases'].cumsum()
out variable holds the desired dataframe.
OUTPUT:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-22 3 4
Just adding a solution with pd.Grouper
X['Date']= pd.to_datetime(X['Date'])
df = X.groupby(pd.Grouper(key='Date', freq='D')).size().reset_index(name='Cases')
df['Cumulative'] = df.Cases.cumsum()
df
Output
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
I have a dataframe made up of daily data across a number of columns;
A B C D
01/01/2020 12 3 2 1
02/01/2020 8 14 5 1
03/01/2020 45 4 1 3
.
.
.
.
31/12/2021 5 1 5 3
The data is generated automatically but I would to be able to overwrite data by month or by date.
I understand something like this could reset a value but is there anyway to do it in bulk by month or between two certain dates?
df.set_value('C', 'x', 10)
Any help much appreciated!
Create DatetimeIndex first and the set values in DataFrame.loc, also here working partialy string indexing for set values of month:
df.index = pd.to_datetime(df.index, dayfirst=True)
df.loc['2020-01-02','C'] = 100
df.loc['2020-01','B'] = 500
df.loc['2020-01-01':'2020-01-02','A'] = 0
#select multiple columns by list
df.loc['2020-01-03':'2021-12-31', ['C','D']] = 1000
print (df)
A B C D
2020-01-01 0 500 2 1
2020-01-02 0 500 100 1
2020-01-03 45 500 1000 1000
2021-12-31 5 1 1000 1000
Let's say we have this dataset:
df = pd.DataFrame({'ID': [1,1,1,1], 'Year': [2007, 2008, 2010,2011], 'Program': ['A', 'B', 'A', 'A']})
ID Year Program
0 1 2007 A
1 1 2008 B
2 1 2010 A
3 1 2011 A
I'd like to groupby ID and Year and then for each row within that group create a new variable Any, check whether the next year exists. If that year+1 exists then it should be 1 and if it does not, it should be 0, and the final row, should be Nan:
ID Year Program Any
0 1 2007 A 1.0
1 1 2008 B 0.0
2 1 2010 A 1.0
3 1 2011 A NaN
I apologize that I do not have any 'what I've tried'. Once I've gotten past using groupby, I cannot figure out how to access the entire groups data while assigning values to each individual row.
If pair ID, Year are unique, a merge on ID, Year would work:
s = df.merge(df.assign(Year=df['Year'].sub(1),
dummy=1), on=['ID','Year'],
how='left')['dummy']
df['Any'] = s.fillna(0)
Output, note that the largest year also filled with 0:
ID Year Program Any
0 1 2007 A 1.0
1 1 2008 B 0.0
2 1 2010 A 1.0
3 1 2011 A 0.0
Here is one way with groupby + diff
s=df.groupby('ID')['Year'].diff(-1)
s[s.notnull()]=s.eq(-1).astype(int)
s
Out[209]:
0 1.0
1 0.0
2 1.0
3 NaN
Name: Year, dtype: float64
df['Any']=s
I would like to calculate the number of instances two criteria are fulfilled in a Pandas DataFrame at a different index value. A snipped of the DataFrame is:
GDP USRECQ
DATE
1947-01-01 NaN 0
1947-04-01 NaN 0
1947-07-01 NaN 0
1947-10-01 NaN 0
1948-01-01 0.095023 0
1948-04-01 0.107998 0
1948-07-01 0.117553 0
1948-10-01 0.078371 0
1949-01-01 0.034560 1
1949-04-01 -0.004397 1
I would like to count the number of observation for which USRECQ[DATE+1]==1 and GDP[DATE]>a if GDP[DATE]!='NAN'.
By referring to DATE+1 and DATE I mean that the value of USRECQ should be check at the subsequent date for which the value of GDP is examined. Unfortunately, I do not know how to address the deal with the different time indices in my selection. Can someone kindly advise me on how to count the number of instances properly?
One may of achieving this is to create a new column to show what the next value of 'USRECQ' is:
>>> df['USRECQ NEXT'] = df['USRECQ'].shift(-1)
>>> df
DATE GDP USRECQ USRECQ NEXT
0 1947-01-01 NaN 0 0
1 1947-04-01 NaN 0 0
2 1947-07-01 NaN 0 0
3 1947-10-01 NaN 0 0
4 1948-01-01 0.095023 0 0
5 1948-04-01 0.107998 0 0
6 1948-07-01 0.117553 0 0
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
9 1949-04-01 -0.004397 1 NaN
Then you could filter your DataFrame according to your requirements as follows:
>>> a = 0.01
>>> df[(df['USRECQ NEXT'] == 1) & (df['GDP'] > a) & (pd.notnull(df['GDP']))]
DATE GDP USRECQ USRECQ NEXT
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
To count the number of rows in a DataFrame, you can just use the built-in function len.
I think the DataFrame.shift method is the key to what you seek in terms of looking at the next index.
And Numpy's logical expressions can come in really handy for these sorts of things.
So if df is your dataframe then I think what you're looking for is something like
count = df[np.logical_and(df.shift(-1)['USRECQ'] == 1,df.GDP > -0.1)]
The example I used to test this is on github.