I have the following dataframe:
myDF = pd.DataFrame({'quarter':['Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q4'],
'year':[2018,2018,2018,2018,2019,2019,2019,2019,2020,2020,2020,2020]})
which looks like:
quarter year
0 Q1 2018
1 Q2 2018
2 Q3 2018
3 Q4 2018
4 Q1 2019
5 Q2 2019
6 Q3 2019
7 Q4 2019
8 Q1 2020
9 Q2 2020
10 Q3 2020
11 Q4 2020
I can calculate the mean of the index values:
print(np.mean(myDF.index))
5.5
...but I would like to produce a list of the mean index values for each year.
I can create a new variable based on index values and find the mean of those values as follows:
myDF['idx'] = myDF.index
print(myDF.groupby('year')['idx'].apply(list))
print(myDF.groupby('year')['idx'].apply(np.mean).tolist())
to produce:
year
2018 [0, 1, 2, 3]
2019 [4, 5, 6, 7]
2020 [8, 9, 10, 11]
Name: idx, dtype: object
[1.5, 5.5, 9.5]
However, I don't seem to be able to manipulate the index values directly. I've tried applying various versions of the above to DataFrameGroupBy objects but I get the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'index'
So, whilst I have a solution, creating a new variable based on the index seems a bit redundant. Can the required list of means be created without the need to alter the original dataframe?
Related
I have a data set like this:
dfdict = {
'year' : [2021, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022, 2022],
'value' : [1,2,3,4,5,6,7,8,9,10]
}
df = pd.DataFrame(dfdict)
I also have a dictionary whose keys are years and values are the limit values of each year I want to apply a condition:
limitdict = {
'2021' : [2, 4],
'2022' : [7, 8]
}
How can I show the rows of df whose values for each year are either smaller than the lower limit or larger than the upper limit of the limitdict? The result will look like:
year value
0 2021 1
4 2021 5
5 2022 6
8 2022 9
9 2022 10
Another possible solution:
# astype is needed because your dictionary keys are strings
year = df['year'].astype('str')
df[(
df['value'].lt([limitdict[x][0] for x in year]) |
df['value'].gt([limitdict[x][1] for x in year])
)]
Or:
year = df['year'].astype('str')
z1, z2 = zip(*[limitdict[x] for x in year])
df[(df['value'].lt(z1) | df['value'].gt(z2))]
Output:
year value
0 2021 1
4 2021 5
5 2022 6
8 2022 9
9 2022 10
I suggest splitting the dataframe by year and then using between to filter out values in the range specified in the limitdict. Note that I am using the ~ symbol to filter out values within the range specified in the limitdic: df_year[~df_year.value.between(limitdict[str(year)][0],limitdict[str(year)][1])].
list_of_dataframes = []
for year in df.year.unique():
df_year = df[df.year == year]
list_of_dataframes.append(df_year[~df_year.value.between(limitdict[str(year)][0],limitdict[str(year)][1])])
output_df = pd.concat(list_of_dataframes)
This returns:
year value
0 2021 1
4 2021 5
5 2022 6
8 2022 9
9 2022 10
I have already looked for this type of question but none of them really answers my question.
Suppose I have two dataframes and the indices of these are NOT consistent. df2 is a subset of df1 and I want to remove all the rows in df1 that are present in df2.
I already tried the following but it's not giving me the result I'm looking.
df1[~df1.index.isin(df2.index)]
Unfortunately, I can't share the original data with you however, the number of columns in the two dataframes are 14.
Here's an example of what I'm looking for:
df1 =
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31
df2 =
month year sale
0 1 2012 55
1 10 2014 31
and I'm looking for:
df =
month year sale
0 4 2014 40
1 7 2013 84
You could create a multi-index with all the columns in each dataframe. From that point you have just to drop the indices of the second from the first one:
df1.set_index(list(df1.columns)).drop(df2.set_index(list(df2.columns)).index).reset_index()
Result with your example data:
month year sale
0 4 2014 40
1 7 2013 84
Use left join by DataFrame.merge and indicator parameter, then compare new column for Series.eq (==) and filter by boolean indexing:
df = df1[df1.merge(df2, indicator=True, how='left')['_merge'].eq('left_only')]
print (df)
month year sale
1 4 2014 40
2 7 2013 84
So what you want is to remove by values, not by index.
Use concatenate and drop:
comp = pd.concat([df1, df2]).drop_duplicates(keep=False)
Example:
df1 = pd.DataFrame({'month': [1, 4, 7, 10], 'year': [2012, 2014, 2013, 2014], 'sale': [55, 40, 84, 31]})
df2 = pd.DataFrame({'month': [1, 10], 'year': [2012, 2014], 'sale': [55, 31]})
pd.concat([df1, df2]).drop_duplicates(keep=False)
Result:
month sale year
1 4 40 2014
2 7 84 2013
can you try below:
df1[~df1.isin(df2)]
This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
I need to overwrite certain values in a dataframe column, conditional on the values in another column.
The issue I have is that I can identify and replace certain rows with a string but do not know how to replace with data from another column
I have attempted the code below but have encountered the following error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
when I run my example code shown below
df['Year'] = np.where(df['CallDtYear'] != 0, df['CallDtYear'], df['Year'])
I have also tried .iloc but don't know how to replace my chosen rows with data from another row, as opposed to a string value
My dataframe, df, is:
ID CallDtYear Year
EJ891119 2024 0
EJ522806 0 2023
ED766836 2019 0
EK089367 2023 2024
EK414703 2026 2026
EI684097 0 2021
And I want my expected output to yield
ID CallDtYear Year
EJ891119 2024 2024
EJ522806 0 2023
ED766836 2019 2019
EK089367 2023 2023
EK414703 2026 2026
EI684097 0 2021
You're close just use df.pop. df.pop will remove the column from the dataframe and returns its values:
df['Year'] = np.where(df['CallDtYear'] != 0, df['CallDtYear'], df.pop('Year'))
df
ID CallDtYear Year
0 EJ891119 2024 2024
1 EJ522806 0 2023
2 ED766836 2019 2019
3 EK089367 2023 2023
4 EK414703 2026 2026
5 EI684097 0 2021
I have created a dataframe after importing weather data, now called "weather".
The end goal is to be able to view data for specific month and year.
It started like this:
Then I ran weather = weather.T to transform the graph making it look like:
Then I ran weather.columns=weather.iloc[0] to make the graph look like:
But the "year" column and "month" column are located in the index (i think?). How would i get it so it looks like:
Thanks for looking! Will appreciate any help :)
Please note I will remove the first row with the years in it. So don't worry about this part
This just means the pd.Index object underlying your pd.DataFrame object, unbeknownst to you, has a name:
df = pd.DataFrame({'YEAR': [2016, 2017, 2018],
'JAN': [1, 2, 3],
'FEB': [4, 5, 6],
'MAR': [7, 8, 9]})
df.columns.name = 'month'
df = df.T
df.columns = df.iloc[0]
print(df)
YEAR 2016 2017 2018
month
YEAR 2016 2017 2018
JAN 1 2 3
FEB 4 5 6
MAR 7 8 9
If this really bothers you, you can use reset_index to elevate your index to a series and then drop the extra heading row. You can, at the same time, remove the column name:
df = df.reset_index().drop(0)
df.columns.name = ''
print(df)
month 2016 2017 2018
1 JAN 1 2 3
2 FEB 4 5 6
3 MAR 7 8 9
id marks year
1 18 2013
1 25 2012
3 16 2014
2 16 2013
1 19 2013
3 25 2013
2 18 2014
suppose now I group the above on id by python command.
grouped = file.groupby(file.id)
I would like to get a new file with only the row in each group with recent year that is highest of all the year in the group.
Please let me know the command, I am trying with apply but it ll only given the boolean expression. I want the entire row with latest year.
I cobbled this together using this: Python : Getting the Row which has the max value in groups using groupby
So basically we can groupby the 'id' column, then call transform on the 'year' column and create a boolean index where the year matches the max year value for each 'id':
In [103]:
df[df.groupby(['id'])['year'].transform(max) == df['year']]
Out[103]:
id marks year
0 1 18 2013
2 3 16 2014
4 1 19 2013
6 2 18 2014