I have a dataframe:
df = pd.DataFrame({'year':[2000,2000,2000,2001,2001,2002,2002,2002],'ID':['a','b','c','a','b','a','b','c'],'values':[1,2,3,4,5,7,8,9]})
I would like to create a column that has the lag value of each ID-year, for example, ID'a' in 2000 has a value of 1, so ID'a' in 2001 should have a pre-value of 1. The key point is that if an ID doesn't have an value in the previous year (so the year is not continuous for some ID), then the pre-value should be NaN, instead of having the value from two years ago. For example, ID'c' doesn't show up in 2001, then for 2002, ID'c' should have pre-value = NaN.
Ideally, the final output should look like the following:
I tried the df.groupby(['ID'])['values'].shift(1), but it gives the following:
The problem is that when ID'c' doesn't have a value one year ago, the value two years ago is used. I also tried multiindex shift, which gives me the same result.
df.set_index(['year','ID'], inplace = True)
df.groupby(level=1)['values'].shift(1)
The thing that works is the answer mentioned here. But since my dataframe is fairly large, the merge kills the kernel. So far, I haven't figured out a better way to do it. I hope I explained my problem clearly.
Suppose the year column is unique for each id, i.e, there are no duplicated years for each specific id, then you can shift the value firstly and then replace shifted values where the difference between the year at the current row and previous row is not equal to 1 with NaN:
import pandas as pd
import numpy as np
df['pre_value'] = df.groupby('ID')['values'].shift(1)
df['pre_value'] = df.pre_value.where(df.groupby('ID').year.diff() == 1, np.nan)
df
a reindex approach
def reindex_min_max(df):
mn = df.year.min()
mx = df.year.max() + 1
d = df.set_index('year').reindex(pd.RangeIndex(mn, mx, name='year'))
return pd.concat([d, d['values'].shift().rename('pre_value')], axis=1)
df.groupby('ID')[['year', 'values']].apply(reindex_min_max) \
.sort_index(level=[1, 0]).dropna(subset=['values']).reset_index()
Related
I have a data frame with months (by year), and ID number. I am trying to calculate the attrition rate, but I am getting stuck on obtaining unique ID counts when a month equals a certain month in pandas.
ID.
Month
1
Sept. 2022
2
Oct. 2022
etc... with possible duplicates in ID and 1.75 years worth of data.
import pandas as pd
path = some path on my computer
data = pd.read_excel(path)
if data["Month"] == "Sept. 2022":
ID_SEPT = data["ID."].unique()
return ID_SEPT
I am trying to discover what I am doing incorrect here in this if-then statement. Ideally I am trying to collect all the unique ID values per each month per year to then calculate the attrition rate. Is there something obvious I am doing wrong here?
Thank you.
I tried an id-then statement and I was expecting unique value counts of ID per month.
You need to use one of the iterator functions, like items().
for (columnName, columnData) in data.iteritems():
if columnName = 'Month'
[code]
The way you do this with a dataframe, conceptually, is to filter the entire dataframe to be just the rows where your comparison is true, and then do whatever (get uniques) from there.
That would look like this:
filtered_df = df[df['Month'] == 'Sept. 2022']
ids_sept = list(filtered_df['ID.'].unique()
The first line there can look at a little strange, but what it is doing is:
df['Month'] == 'Sept. 2022' will return an array/column/series (it actually returns a series) of True/False whether or not the comparison is, well, true or false.
You then run that series of bools through df[series_of_bools] that filters the dataframe to return only the rows where it is True.
Thus, you have a filter.
If you are looking for the number of unique items, rather than the list of unique items, you can also use filtered_df['ID.'].nunique() and save yourself the step later of getting the length of the list.
You are looking for pandas.groupby.
Use it like this to get the unique values of each Group (Month)
data.groupby("Month")["ID."].unique() # You have a . after ID in your example, check if thats correct
try this
data[data.Month=='Sept. 2022']['ID.'].unique()
i'm a new in the programming world so, i have a question about a dataframe and iteration problem.
(i'm using python)
i have the follow:
this is my dataframe
in the first column (x) i have the date and in the second column (y), i have some values (the total shape is (119,2))
my question is:
if i want to select the date "2020-12-01" and sum the 14 previous values and asing this result to this date and do the same for the next date, how can i do that ?
(i put the blue color over the date, and red over the values that i want to add to blue value, in the previous image )
i tried to do the follow:
final_value = 0
for i in data["col_name"]:
final_value = data["col_name"].iloc[i:14].sum()
but the output is 0.
so, can someone give me some ideas to solve it problem?
thanks to read me
Convert x column time to datetime
df['x'] = pd.to_datetime(df['x'], format='%Y-%m-%d')
Use rolling to select 14 days to add up
df.rolling("14D", on="x").sum()['y']
Try this on your full dataset
I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance
I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]
I have a csv dataset where I want to calculate the average for all rows The average is calculated from data start at column 14. This is what I have done so far but I am still not getting the average value. Can someone help me with this?
I am also getting confused with this Axis thing.
file = ('dataset.csv')
df = pd.read_csv(file)
d_col = df[df.columns[14:]]
mean_value = d_col['mean'] = d_col.mean(axis=1, skipna=True, numeric_only=True)
print mean_value
d_col.to_csv('out.csv')
It's a very strange indexing syntax you're using. A clearer way should be:
d_col = df.iloc[:, 14:]
axis = 0 means taking the average by column, and axis = 1 by the row, which you seem to be doing correctly. I'm not sure what exactly you mean by not getting the average. The d_col should contain your original data and a new column named "mean" containing the result.
Because you don't provide sample data see the following sample code. The first column is some text column that should be ignored, whereas the other columns in the DataFrame df are the ones that should be used to calculate the mean value.
# prepare some dataset
letters = 'abcdefghijklmnopqrstuvwxyz'
rows = 10
col1 = np.array(list(letters))[np.random.permutation(len(letters))[:rows]]
df = pd.concat([pd.DataFrame(col1), pd.DataFrame(np.random.randn(rows, 10))], axis=1)
result = df.iloc[:, 1:].mean(axis=1)
The result then looks like this:
0 0.693024
1 -0.356701
2 0.082385
3 -0.115622
4 -0.060414
5 0.104119
6 -0.435787
7 0.023327
8 -0.144272
9 0.363254
dtype: float64
/edit: Change answer above to use df.iloc instead of df[df.columns[...] as the latter makes problem in case two columns have the same name. Please mark peidaqi's answer as the correct one.
The issue lied here , I was saving d_col as the output csv file instead of mean_value . It's silly but i guess that's how you learn to pickup things. Thanks #peidaqi and others for your explanation.
Currently I have a series of string as a column in pandas dataframe which represents a particular year in a "yyyy-yyyy" format for example "2004-2005" is a single string value in this column.
I wanted to know if there is anyway to convert this from string to something similar to datetime format.
The purpose for this is to calculate the difference between the values of this column and other similar column in "Years". For example something similar to below:
col 1 col2 Answer(Total years)
2004-2005 2006-2007 3
Note: One of the ways I thought of doing was to make a dictionary mapping each year to a unique integer value and then calculate the difference between them.
Although I was wondering if there is any simpler way of doing it.
It looks like you subtracting the last year in column 2 with the first year in column 1. In which case I'd use str.extract (and convert the result to a number):
In [11]: pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[11]:
0 2004
Name: col 1, dtype: int64
In [12]: pd.to_numeric(df['col2'].str.extract('-(\d{4})')) - pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[12]:
0 3
dtype: int64
What do you mean by "something similar to a datetime object." Datetimes aren't designed to represent date ranges.
If you want to create a pair of datetime objects you could do something like this:
[datetime.datetime.strptime(x, '%Y') for x in '2005-2006'.split('-')]
Alternatively you could try using a Pandas date_range object if that's closer to what you want.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.date_range.html
If you are trying to find the difference between the lowest year and the highest year, here is a go at it
col1="2004-2005"
col2="2006-2007"
col1=col1.split("-") # make a list of the years in col1 ['2004', '2005']
col2=col2.split("-") # make a list of the years in col2 ['2006', '2007']
biglist=col1+col2 #add the two list
biglist.sort() #sort the list from lowest year to highest year
Answer=int(biglist[len(biglist)-1])-int(biglist[0]) #find the difference between lowest and highest year