Pandas DataFrame Comprehensions [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
Problem: Add a new column to a DataFrame and populate with the values of a column from another DataFrame, depending on a condition, in one line of code similar to list comprehensions.
Example code:
I create a DataFrame called df with some pupil information
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz',
'Maricopa', 'Yuma'])
Then a second DataFrame called df_extra which has a string representation of the year:
extra_data = {'year': [2012, 2013, 2014],
'yr_string': ['twenty twelve','twenty thirteen','twenty fourteen']}
df_extra = pd.DataFrame(extra_data)
Now how to add the values yr_string as a new column to df where the numerical years match in one line of code?
I can easily do this with a couple of for loops, but would really like to know if this is possible to do in one line, similar to list comprehensions?
I have searched questions already on here, but there is nothing discussing adding a new column to an existing DataFrame from another DataFrame based on a condition in one line.

You can merge the dataframe on the year column.
df.merge(df_extra, how='left', on=['year'])
# name reports year yr_string
# 0 Jason 4 2012 twenty twelve
# 1 Molly 24 2012 twenty twelve
# 2 Tina 31 2013 twenty thirteen
# 3 Jake 2 2014 twenty fourteen
# 4 Amy 3 2014 twenty fourteen
Basically this says "pull the data from df_extra into df anywhere that the year column matches in df". Note this will return a copy, not modify the dataframe in place.
List comprehensions are still Python loops (that might not be totally technically accurate). With the pandas.merge() method, you get to take advantage of the vectorized, optimized backend code that Pandas uses to operate on its dataframes. Should be faster.

Related

How do I filter a dataframe based on complicated conditions? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 months ago.
Right now my dataframes look like this (I simplified it cause the original has hundreds of rows)
import pandas as pd
Winner=[[1938,"Italy"],[1950,"Uruguay"],[2014,"Germany"]]
df=pd.DataFrame(Winner, columns=['Year', 'Winner'])
print(df)
MatchB=[[1938,"Germany",1.0],[1938,"Germany",2.0],[1938,"Brazil",1.0],[1950,"Italy",2.0],[1950,"Spain",2.0],[1950,"Spain",1.0],[1950,"Spain",1.0],[1950,"Brazil",1.0],
[2014,"Italy",2.0],[2014,"Spain",3.0],[2014,"Germany",1.0]]
df2B=pd.DataFrame(MatchB, columns=['Year', 'Away Team Name','Away Team Goals'])
df2B
I would like to filter df2B so that I will have the rows where the "Year" and "Away Team Name" match df:
Filtered List (Simplified)
I check google but can't find anything useful
You can merge.
df = pd.merge(left=df, right=df2B, left_on=["Year", "Winner"], right_on=["Year", "Away Team Name"])
print(df)
Output:
Year Winner Away Team Name Away Team Goals
0 2014 Germany Germany 1.0

Efficient way of converting year_week to datetime in pandas

I have a pandas df with two columns year and week_number.
df = pd.DataFrame({'year': [2019, 2020, 2021, 2022], 'week_number':[3,12,38,42]})
df
year week_number
0 2019 3
1 2020 12
2 2021 38
3 2022 42
I know I can apply something like following to each row and convert them to datetime values however, I want to know if there is more efficient way to do this for the big dataframes and store the results in third column?
import datetime
single_day = "2013-26"
converted_date = datetime.datetime.strptime(single_day + '-1', "%Y-%W-%w")
print(converted_date)
I wouldn't say your way is inefficient, but if you want a fully vectorized way, without having to import another library, and which appends your dataframe, this might be what you're looking for
import pandas as pd
df = pd.DataFrame({'year': [2019, 2020, 2021, 2022], 'week_number':[3,12,38,42]})
df['date'] = pd.to_datetime((df['year']*100+df['week_number']).astype(str) + '0', format='%Y%W%w')
df
If you are on Python >= 3.8, use datetime.date.fromisocalendar. Also works for datetime.
# 11. May 2022 is a Wednsesday in the 19h week
>>> date.fromisocalendar(2022, 19, 3)
datetime.date(2022, 5, 11)
As new Column:
df['date'] = df[['year', 'week_number']].apply(lambda args: date.fromisocalendar(args[0], args[1], 1), axis=1)
Use apply to loop over rows (axis=1) and a lambda function that concatenates the two columns as a string and then do exactly the thing you did it above :) Perhaps this wasn't the answer you were looking for thou, since you looking for the most efficent solution. However, this does the job!
df['convert_date']=df.apply(lambda x: datetime.strptime(f"{x.year}-{x.week_number}" + '-1', "%Y-%W-%w"), axis=1)

Filter the duplicate rows of a pandas dataframe, keeping rows with the latest date only

I have a pandas dataframe:
d = {'title':['GrownUps', 'Toy Story', 'Toy Story', 'Avatar', 'Avatar', 'Avatar'], 'year': [2012, 1995, 2000, 2005, 2006, 2010]}
dataset=pd.DataFrame(d)
From the dataframe above I want to locate and the duplicate movie title (i.e Toy Story, Avatar). To do so, I use the following code:
dataset[dataset.duplicated(subset=['title'],keep=False)]
From the rows returned I would like to keep per duplicate movie the most recent one (e.g the maximum of the column year) and store to a list the indexes of the rows not having the maximum year so I can filter them from the initial dataset.
So my final dataset should look like this:
d = {'title':['GrownUps', 'Toy Story', 'Avatar'], 'year': [2012, 2000, 2010]}
dataset=pd.DataFrame(d)
I kept only Toy Story of 2000 instead of 1995, and Avatar of year 2010 not 2005 or 2006
This could be very useful if someone wants to use a different aggregate rather than max(), like mean(), sum(), etc.
We can sort by ascending order of "year", then drop duplicates on "title" keeping the last row (since that has the latest year), then restoring the original ordering of rows:
df.sort_values('year').drop_duplicates('title', keep='last').sort_index()
title year
0 GrownUps 2012
2 Toy Story 2000
5 Avatar 2010
This avoids a groupBy operation (which is relatively slower) and maintains the original ordering of rows.

How to delete the last two rows of a df with pandas [duplicate]

This question already has answers here:
How to delete the last row of data of a pandas dataframe
(11 answers)
Closed 4 years ago.
Here is the code I'm playing with. I want to delete the last two lines of the file. I'm actually working on a bigger file and the last two lines fluctuate. Once I get it to work on this small format, I will implement it in my primary source code.
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz',
'Maricopa', 'Yuma'])
df
df.drop(dr.index[-2])
This will remove the second row from the bottom but I am trying to delete to rows that will be followed by NaN
Beter is select all rows without last 2 by iloc:
df = df.iloc[:-2]
print (df)
name year reports
Cochice Jason 2012 4
Pima Molly 2012 24
Santa Cruz Tina 2013 31
You can use df.tail to achieve that too -
df.drop(df.tail(n).index,inplace=True)
You can try like this way to remove last 2 rows?
df = df[:-2]
Output:
After removing last 2 rows
name year reports
Cochice Jason 2012 4
Pima Molly 2012 24
Santa Cruz Tina 2013 31
Working Demo: https://repl.it/repls/UnacceptableWrithingQuotes

Pandas dataframe.set_index() deletes previous index and column

I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice
You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31
the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!
No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().

Categories