creating some new columns based on old ones - python

Create a new column Age. This is the current year (2019) minus the car's Year. If Age is 0, set it to 1.
Q1.Create a new column Age. This is the current year (2019) minus the
car's Year. If Age is 0, set it to 1. My codes are below:
df = pd.read_csv('car_train_data.csv',sep= ";",index_col= 0)
#method1
df["age"]= df[['age']].applymap(lambda x: 1 if pd(x) = 0)
#method2
df["age"]= (df["age"] == 0).all(1)
But it's failed...anyone can help with this??

I'm not sure what the data you have is, but try this.
import numpy as np
df['age'] = np.where(df['age']==0, 1, df['age'])

Toy example,
Input
year model
0 2019 a
1 2018 b
2 2017 c
3 2019 d
df=pd.DataFrame({
'year':[2019, 2018, 2017, 2019],
'model':['a','b','c','d']
})
df['age'] = np.where((2019 - df.year)==0,1,(2019 - df.year))
df
Output
year model age
0 2019 a 1
1 2018 b 1
2 2017 c 2
3 2019 d 1

Related

Pandas groupby mean of only positive values

How to get mean of only positive values after groupby in pandas?
MWE:
import numpy as np
import pandas as pd
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
print(flights.shape)
print(flights.iloc[:2,:4])
print()
not_cancelled = flights.dropna(subset=['dep_delay','arr_delay'])
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
print(df.head())
This gives all avg_delay2 values as 16.66.
(336776, 19)
year month day dep_time
0 2013 1 1 517.0
1 2013 1 1 533.0
year month day arr_delay avg_delay2
0 2013 1 1 12.651023 16.665681
1 2013 1 2 12.692888 16.665681
2 2013 1 3 5.733333 16.665681
3 2013 1 4 -1.932819 16.665681
4 2013 1 5 -1.525802 16.665681
Which is WRONG.
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581
When I do the same thing in R:
library(nycflights13)
not_cancelled = flights %>%
filter( !is.na(dep_delay), !is.na(arr_delay))
df = not_cancelled %>%
group_by(year,month,day) %>%
summarize(
# average delay
avg_delay1 = mean(arr_delay),
# average positive delay
avg_delay2 = mean(arr_delay[arr_delay>0]))
head(df)
It gives correct output for avg_delay2.
year month day avg_delay1 avg_delay2
2013 1 1 12.651023 32.48156
2013 1 2 12.692888 32.02991
2013 1 3 5.733333 27.66087
2013 1 4 -1.932819 28.30976
2013 1 5 -1.525802 22.55882
2013 1 6 4.236429 24.37270
How to do this in Pandas?
I would filter the positive before groupby
df = (not_cancelled[not_cancelled.arr_delay >0].groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df.head()
because, as in your code, df is an separate dataframe after the groupby operation has completed, and
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
assign the same value to df['avg_delay2']
Edit: Similar to R, you can do both in one shot using agg:
def mean_pos(x):
return x[x>0].mean()
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
df.head()
Note that from pandas 23, using dictionary in gropby agg is deprecated and will be removed in future, so we can not use that method.
Warning
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version.
So, I to tackle that problem in this specific case, I came up with another idea.
Create a new column making all non-positive values nans, then do the usual groupby.
import numpy as np
import pandas as pd
# read data
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# select flights that are not cancelled
df = flights.dropna(subset=['dep_delay','arr_delay'])
# create new column to fill non-positive with nans
df['arr_delay_pos'] = df['arr_delay']
df.loc[df.arr_delay_pos <= 0,'arr_delay_pos'] = np.nan
df.groupby(['year','month','day'])[['arr_delay','arr_delay_pos']].mean().reset_index().head()
It gives:
year month day arr_delay arr_delay_positive
0 2013 1 1 12.651023 32.481562
1 2013 1 2 12.692888 32.029907
2 2013 1 3 5.733333 27.660870
3 2013 1 4 -1.932819 28.309764
4 2013 1 5 -1.525802 22.558824
Sanity check
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581

Python Pandas: Index is overlapped? How to fix

I have created a dataframe after importing weather data, now called "weather".
The end goal is to be able to view data for specific month and year.
It started like this:
Then I ran weather = weather.T to transform the graph making it look like:
Then I ran weather.columns=weather.iloc[0] to make the graph look like:
But the "year" column and "month" column are located in the index (i think?). How would i get it so it looks like:
Thanks for looking! Will appreciate any help :)
Please note I will remove the first row with the years in it. So don't worry about this part
This just means the pd.Index object underlying your pd.DataFrame object, unbeknownst to you, has a name:
df = pd.DataFrame({'YEAR': [2016, 2017, 2018],
'JAN': [1, 2, 3],
'FEB': [4, 5, 6],
'MAR': [7, 8, 9]})
df.columns.name = 'month'
df = df.T
df.columns = df.iloc[0]
print(df)
YEAR 2016 2017 2018
month
YEAR 2016 2017 2018
JAN 1 2 3
FEB 4 5 6
MAR 7 8 9
If this really bothers you, you can use reset_index to elevate your index to a series and then drop the extra heading row. You can, at the same time, remove the column name:
df = df.reset_index().drop(0)
df.columns.name = ''
print(df)
month 2016 2017 2018
1 JAN 1 2 3
2 FEB 4 5 6
3 MAR 7 8 9

Getting the years since an event

I am working on a dataset with pandas in which a maintenance work is done at a location. The maintenance is done at random intervals, sometimes a year, and sometimes never. I want to find the years since the last maintenance action at each site if an action has been made on that site. There can be more than one action for a site and the occurrences of actions are random. For the years prior to the first action, it is not possible to know the years since action because that information is not in the dataset.
I give only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year.
Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
Given this dataset; I want to have a dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 1 100 1
A 2015 0 150 2
A 2016 0 300 3
A 2017 0 80 4
B 2015 1 250 1
B 2016 1 60 1
B 2017 0 110 2
Please observe the Year 2015 is filtered out for Site B because that year is prior to the first action for that site.
Many thanks in advance!
I wrote the code myself. It is messy but does the job for me. :)
The solution assumes that df_select has an integer index.
df_select = (df_select[df_select['Site'].map((df_select.groupby('Site')['Action'].max() == 1))])
years_since_action = pd.Series(dtype='int64')
gbo = df_select.groupby('Site')
for (key,group) in gbo:
indices_with_ones = group[group['Action']==1].index
indices = group.index
group['Years_since_action'] = 0
group.loc[indices_with_ones,'Years_since_action'] = 1
for idx_with_ones in indices_with_ones.sort_values(ascending=False):
for idx in indices:
if group.loc[idx,'Years_since_action']==0:
if idx>idx_with_ones:
group.loc[idx,'Years_since_action'] = idx - idx_with_ones + 1
years_since_action = years_since_action.append(group['Years_since_action'])
df_final = pd.merge(df_select,pd.DataFrame(years_since_action),how='left',left_index=True,right_index=True)
Here is how I will approach it:
import pandas as pd
from io import StringIO
import numpy as np
s = '''Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
'''
ss = StringIO(s)
df = pd.read_csv(ss, sep=r"\s+")
df_maintain = df[df.Action==1][['Site', 'Year']]
df_maintain.reset_index(drop=True, inplace=True)
df_maintain
def find_last_maintenance(x):
df_temp = df_maintain[x.Site == df_maintain.Site]
gap = [0]
for ind, row in df_temp.iterrows():
if (x.Year >= row['Year']):
gap.append(x.Year - row['Year'] + 1)
return gap[-1]
df['Gap'] = df.apply(find_last_maintenance, axis=1)
df = df[df.Gap !=0]
This generates the desired output.

How to efficiently add multiple columns to pandas data frame with values that depend on other dynamic columns

How can I use better solution instead of following codes? in big data set with lots of columns this code takes too much time
import pandas as pd
df = pd.DataFrame({'Jan':[10,20], 'Feb':[3,5],'Mar':[30,4],'Month':
[3,2],'Year':[2016,2016]})
# Jan Feb Mar Month Year
# 0 10 3 30 3 2016
# 1 20 5 4 2 2016
df1['Antal_1']= np.nan
df1['Antal_2']= np.nan
for i in range(len(df)):
if df['Yaer'][i]==2016:
df['Antal_1'][i]=df.iloc[i,df['Month'][i]-1]
df['Antal_2'][i]=df.iloc[i,df['Month'][i]-2]
else:
df['Antal_1'][i]=df.iloc[i,-1]
df['Antal_2'][i]=df.iloc[i,-2]
df
# Jan Feb Mar Month Year Antal_1 Antal_2
# 0 10 3 30 3 2016 30 3
# 1 20 5 4 2 2016 5 20
You should see a marginal speed-up by using df.apply instead of iterating rows:
import pandas as pd
df = pd.DataFrame({'Jan': [10, 20], 'Feb': [3, 5], 'Mar': [30, 4],
'Month': [3, 2],'Year': [2016, 2016]})
df = df[['Jan', 'Feb', 'Mar', 'Month', 'Year']]
def calculator(row):
m1 = row['Month']
m2 = row.index.get_loc('Month')
return (row[int(m1-1)], row[int(m1-2)]) if row['Year'] == 2016 \
else (row[m2-1], row[m2-2])
df['Antal_1'], df['Antal_2'] = list(zip(*df.apply(calculator, axis=1)))
# Jan Feb Mar Month Year Antal_1 Antal_2
# 0 10 3 30 3 2016 30 3
# 1 20 5 4 2 2016 5 20
It's not clear to me what you want to do in the case of the year not being 2016, so I've made the value 100. Show an example and I can finish it. If it's just NaNs, then you can remove the first two lines from below.
df['Antal_1'] = 100
df['Antal_2'] = 100
df.loc[df['Year']==2016, 'Antal_1'] = df[df.columns[df.columns.get_loc("Month")-1]]
df.loc[df['Year']==2016, 'Antal_2'] = df[df.columns[df.columns.get_loc("Month")-2]]

Map dataframe column value by another column's value

My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0

Categories