getting difference in years between two column in pandas - python

I have a table as follows. The first column is the year, the second column is the type of pavement treatment, the third column is the score of the pavement. I need to create a third column called 'year diff' by subtracting the year of the last treatment from the year of the current score. For example, year 2014 need to subtract 2013, since treatment 9 is done in 2013, and the result which is 1 need to be recorded in col['year diff'] in the corresponding cell. And year 2022 need to subtract 2020 since treatment 10 is done in 2020.
Thanks a lot everyone for your help.
Sincerely
Wilson

Use:
#check not missing values
m = df['treatment'].notnull()
#create groups starting not missing values
s = m.cumsum()
#add missing values for first group and for not missing values
mask = (s == 0) | m
#subtract score with first score per group
out = df['score'] - df['score'].groupby(s).transform('first')
#add missing values
df['year diff'] = np.where(mask, np.nan, out)
print (df)
year treatment score year diff
0 2010 NaN 1 NaN
1 2011 NaN 2 NaN
2 2012 NaN 3 NaN
3 2013 9.0 4 NaN
4 2014 NaN 5 1.0
5 2015 NaN 6 2.0
6 2016 NaN 7 3.0
7 2017 NaN 8 4.0
8 2018 NaN 9 5.0
9 2019 NaN 10 6.0
10 2020 10.0 11 NaN
11 2021 NaN 12 1.0
12 2022 NaN 13 2.0
13 2023 NaN 14 3.0
14 2024 NaN 15 4.0
15 2025 12.0 16 NaN
16 2026 NaN 17 1.0
17 2027 NaN 18 2.0

IIUC , you could use:
df['identifier']=(df['year'].diff().eq(1)&df['treatment'].notnull()).cumsum()
df['year diff ']=df.groupby('identifier')['identifier'].apply\
(lambda x: pd.Series(np.where(x!=0,pd.Series(pd.factorize(x)[0]+1).cumsum().shift(),np.nan))).values
print(df)
Or if you need to consider the difference of scores based on the value in treatment:
df['identifier']=(df['year'].diff().eq(1) &df['treatment'].notnull()).cumsum()
df['year diff']=df.groupby('identifier')['score']\
.apply(lambda x : pd.Series(np.where(x!=0,x.diff().expanding().sum(),np.nan))).reset_index(drop=True)
df.loc[df['identifier']==0,'year diff']=np.nan
print(df)
year treatment score identifier year diff
0 2010 NaN 1 0 NaN
1 2011 NaN 2 0 NaN
2 2012 NaN 3 0 NaN
3 2013 9.0 4 1 NaN
4 2014 NaN 5 1 1.0
5 2015 NaN 6 1 2.0
6 2016 NaN 7 1 3.0
7 2017 NaN 8 1 4.0
8 2018 NaN 9 1 5.0
9 2019 NaN 10 1 6.0
10 2020 10.0 11 2 NaN
11 2021 NaN 12 2 1.0
12 2022 NaN 13 2 2.0
13 2023 NaN 14 2 3.0
14 2024 NaN 15 2 4.0
15 2025 12.0 16 3 NaN
16 2026 NaN 17 3 1.0
17 2027 NaN 18 3 2.0

If you want to do by using a for loop:
df = pd.DataFrame(mydata)
mylist = df.index[df['treatment'] != ''].tolist()
And now we subtract the year values
re_list= []
for index,row in df.iterrows():
if index > min(mylist):
m = [i for i in mylist if i <= index]
re_list.append(df.iloc[index]['year'] - df.iloc[max(m)]['year'])
else:
re_list.append(0)
df['Result'] = re_list

Related

Pandas: start a new group on every non-NA value

I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64

Forward fill non na values with last observation carried forwards in Python

Suppose I had a column in a dataframe like :
colname
Na
Na
Na
1
2
3
4
Na
Na
Na
Na
2
8
5
44
Na
Na
Does anyone know of a function to forward fill the Non NA values with the first value in the non na run? To produce :
colname
Na
Na
Na
1
1
1
1
Na
Na
Na
Na
2
2
2
2
Na
Na
Use GroupBy.transform with GroupBy.first by compare values for missing values by Series.isna with cumulative sum by Series.cumsum, last correct NaNs by Series.where with Series.duplicated:
s = df['colNaNme'].isna().cumsum()
df['colNaNme'] = df.groupby(s)['colNaNme'].transform('first').where(s.duplicated())
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Or filter only non missing values by invert mask m and processing only these groups:
m = df['colNaNme'].isna()
df.loc[~m, 'colNaNme'] = df[~m].groupby(m.cumsum())['colNaNme'].transform('first')
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Solution with non groupby:
m = df['colNaNme'].isna()
m1 = m.cumsum().shift().bfill()
m2 = ~m1.duplicated() & m.duplicated(keep=False)
df['colNaNme'] = df['colNaNme'].where(m2).ffill().mask(m)
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
You could try groupby and cumsum with shift and transform('first'):
>>> df.groupby(df['colname'].isna().ne(df['colname'].isna().shift()).cumsum()).transform('first')
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>
Or try something like:
>>> x = df.groupby(df['colname'].isna().cumsum()).transform('first')
>>> x.loc[~x.duplicated()] = np.nan
>>> x
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>

Getting corresponding values in a groupby

I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft

How to select conditional rows with groupby?

I want to select rows with groupby conditions.
import pandas as pd
import numpy as np
dftest = pd.DataFrame({'A':['Feb',np.nan,'Air','Flow','Feb',
'Beta','Cat','Feb','Beta','Air'],
'B':['s','s','t','s','t','s','t','t','t','t'],
'C':[5,4,3,2,1,7,6,5,4,3],
'D':[4,np.nan,3,np.nan,2,
np.nan,2,3,np.nan,7]})
def filcols3(df,dd):
if df.iloc[0]['D']==dd:
return df
dd=4
grp=dftest.groupby('B').apply(filcols3,dd)
the result of grp is:
A B C D
B
s 0 Feb s 5 4.0
1 NaN s 4 NaN
3 Flow s 2 NaN
5 Beta s 7 NaN
this is what I want.
while if I use the following code(part 2)
def filcols3(df,dd):
if df.iloc[0]['D']<=dd:
return df
dd=3
the result is:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 Air t 3.0 3.0
3 NaN NaN NaN NaN
4 Feb t 1.0 2.0
5 NaN NaN NaN NaN
6 Cat t 6.0 2.0
7 Feb t 5.0 3.0
8 Beta t 4.0 NaN
9 Air t 3.0 7.0
I'm surprise for this result, I mean to get
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
what's wrong with the code of part 2? how to get the final result I want?
apply's behaviour is a little non-intuitive here, but if the idea is to filter out entire groups based on a specific condition per group, you can use GroupBy.transform and get a mask to filter df:
df[df.groupby('B')['D'].transform('first') <= 3]
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
Or, fixing your code,
df[df.groupby('B')['D'].transform(lambda x: x.values[0] <= 3)]
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
May Check with filter
dftest.groupby('B').filter(lambda x : any(x['D'].head(1)<=3))
Out[538]:
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
Or without groupby drop_duplicates
s=df.drop_duplicates('B').D<=3
df[df.B.isin(df.loc[s.index,'B'][s])]
Out[550]:
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0

In python, reading multiple CSV's, with different headers, into one dataframe

I have dozens of csv files with similar (but not always exactly the same) headers. For instance, one has:
Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site
One has:
Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site
(Notice one lacks "U_Global" and "U_IR", the other has "Diffuse2" instead of "Diffuse")
I know how to pass multiple csv's into my script, but how do I have the csv's only pass values to columns in which they currently have values? And perhaps pass "Nan" to all other columns in that row.
Ideally I'd have something like:
'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"
The other caveat, is that this dataframe needs to be appended to. It needs to remain as multiple csv files are passed into it. I think I'll probably have it write out to it's own csv at the end (it's eventually going to NETCDF4).
Assuming you have the following CSV files:
test1.csv:
year,month,day,Direct
1992,1,1,11
2013,5,30,11
2004,9,1,11
test2.csv:
year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203
test3.csv:
year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date
Solution:
import glob
import pandas as pd
files = glob.glob(r'd:/temp/test*.csv')
def get_merged(files, **kwargs):
df = pd.read_csv(files[0], **kwargs)
for f in files[1:]:
df = df.merge(pd.read_csv(f, **kwargs), how='outer')
return df
print(get_merged(files))
Output:
year month day Direct Direct Direct2 File3
0 1992 1 1 11.0 21.0 201.0 text1
1 2013 5 30 11.0 21.0 202.0 text2
2 2004 9 1 11.0 21.0 203.0 text3
3 2016 1 1 NaN NaN NaN unmatching_date
UPDATE: usual idiomatic pd.concat(list_of_dfs) solution wouldn't work here, because it's joining by indexes:
In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
or using index_col=None explicitly:
In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
The following more idiomatic solution works, but it changes original order of columns and rows / data:
In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
...:
...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
...:
...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
...:
Out[224]:
month day year Direct Direct Direct2 File3
0 1 1 1992 11.0 21.0 201.0 text1
1 1 1 2016 NaN NaN NaN unmatching_date
2 5 30 2013 11.0 21.0 202.0 text2
3 9 1 2004 11.0 21.0 203.0 text3
Can't pandas take care of this automagically?
http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
If your indices overlap, don't forget to add 'ignore_index=True'
First, run through all the files to define the common headers :
csv_path = './csv_files'
csv_separator = ','
full_headers = []
for fn in os.listdir(csv_path):
with open(fn, 'r') as f:
headers = f.readline().split(csv_separator)
full_headers += full_headers + list(set(full_headers) - set(headers))
Then write your header line into your output file, and run again through all the files to fill it.
You can use : csv.DictReader(open('myfile.csv')) to be able to match the headers to their designated column simply.

Categories