This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
I need to overwrite certain values in a dataframe column, conditional on the values in another column.
The issue I have is that I can identify and replace certain rows with a string but do not know how to replace with data from another column
I have attempted the code below but have encountered the following error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
when I run my example code shown below
df['Year'] = np.where(df['CallDtYear'] != 0, df['CallDtYear'], df['Year'])
I have also tried .iloc but don't know how to replace my chosen rows with data from another row, as opposed to a string value
My dataframe, df, is:
ID CallDtYear Year
EJ891119 2024 0
EJ522806 0 2023
ED766836 2019 0
EK089367 2023 2024
EK414703 2026 2026
EI684097 0 2021
And I want my expected output to yield
ID CallDtYear Year
EJ891119 2024 2024
EJ522806 0 2023
ED766836 2019 2019
EK089367 2023 2023
EK414703 2026 2026
EI684097 0 2021
You're close just use df.pop. df.pop will remove the column from the dataframe and returns its values:
df['Year'] = np.where(df['CallDtYear'] != 0, df['CallDtYear'], df.pop('Year'))
df
ID CallDtYear Year
0 EJ891119 2024 2024
1 EJ522806 0 2023
2 ED766836 2019 2019
3 EK089367 2023 2023
4 EK414703 2026 2026
5 EI684097 0 2021
Related
sorry, if my title sounds a bit confusing. What I'm basically trying to do is adding new rows in a data frame, where I duplicate the value of each unique value of one column, while another column's new values are changing.
This is what my data frame looks like:
id
year
01
2022
02
2022
03
2022
...
...
99
2022
And I want it to look like this:
id
year
01
2022
01
2023
01
2024
02
2022
02
2023
02
2024
03
2022
...
...
99
2024
I.e. I want for every id to add the years 2023 and 2024 in the year column. I tried doing this with an apply function, but it always didn't work out, could you guys help me out in solving this?
years = [2022 + i for i in range(3)]
# or
years = [2022,2023, 2024]
pd.DataFrame({
'id': np.repeat((data:=df.id.to_numpy()), len(years)).reshape(-1,len(years)).flatten(),
'year': np.repeat(np.array(years), data.shape[0]).reshape(len(years), data.shape[0]).T.flatten()
})
You can simply make a list comprehension and concat all dataframe years wirh increments of your desire. For example:
pd.concat([df.assign(year=df.year+increment) for increment in range(0,3)]).sort_values(by='id').reset_index(drop=True)
This will increment your dataframe to three years as follows. You can play around with range for the desired number of extensions:
id
year
1
2022
1
2023
1
2024
2
2022
2
2023
2
2024
3
2022
3
2023
3
2024
A quick solution would be to make two copies of your current dataframe and change accordingly the year date to 2023 and 2024. After you do that, concatenate all 3 datasets together using pd.concat.
I have a dataframe called LCI. In this dataframe, the index corresponds to a year+1. So index 0 is year 1 and so on. Some years contain values. I made a list of which years contain values. The list looks for like this [1,5,10,15,...,60]
years
year_reality
CO2
CH4
NO2
CO
1
2021
7.016
6.180
1.222
2
2022
2
0
0
3
2023
0
0
0
What I now want to do, is multiply the corresponding value of a year to another column of values called DynCFs. DynCFs looks like this
years
year_reality
CO2
CH4
NO2
CO
1
2021
3
6
2
2
2022
4
2
7
3
2023
3
7
6
so for example: LCI.loc[(0),'CO2']DynCFs['CO2'] = [37.016
47.016
37.016]
and call this new dataframe/column tempDLCA. (different name for each new column)
I want to make a new dataframe which is equal to the sum of the columns of tempDLCA, but only the values of the same years should be added up.
so for example:
year_reality
CO2
2021
7.016*3
2022
7.016*4
2023
7.016*3
and
year_reality
CO2
2022
2*3
2023
2*4
2024
2*3
should give this (what I will call dynLCA in the code)
year_reality
CO2
2021
7.016*3
2022
7.016x4 + 2x3
2023
7.016x3+2x4
2024
2*3
ps.: i used x because * was not recognised by stackoverflow for some reason
I tried the following, but the output is only for the last i of listedValues(), so 60.
for i in listedValues:
tempDLCA= pd.DataFrame()
tempDLCA['Year_reality']= np.arange(2021+(i-1),4021+(i-1),1)
tempDLCA['CO2'] = LCI.loc[(i-1),'CO2']*DynCFs['CO2']
tempDLCA['CO'] = LCI.loc[(i-1),'CO']*DynCFs['CO']
tempDLCA['NO2'] = LCI.loc[(i-1),'NO2']*DynCFs['NO2']
tempDLCA['CH4'] = LCI.loc[(i-1),'CH4']*DynCFs['CH4']
dynLCA= pd.concat([DLCA,tempDLCA], ignore_index=True).groupby(['Year_reality'], as_index = False).sum()
dynLCA
What I am doing wrong?
I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1
My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0
id marks year
1 18 2013
1 25 2012
3 16 2014
2 16 2013
1 19 2013
3 25 2013
2 18 2014
suppose now I group the above on id by python command.
grouped = file.groupby(file.id)
I would like to get a new file with only the row in each group with recent year that is highest of all the year in the group.
Please let me know the command, I am trying with apply but it ll only given the boolean expression. I want the entire row with latest year.
I cobbled this together using this: Python : Getting the Row which has the max value in groups using groupby
So basically we can groupby the 'id' column, then call transform on the 'year' column and create a boolean index where the year matches the max year value for each 'id':
In [103]:
df[df.groupby(['id'])['year'].transform(max) == df['year']]
Out[103]:
id marks year
0 1 18 2013
2 3 16 2014
4 1 19 2013
6 2 18 2014