I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()
Related
I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0
This question already has answers here:
Keeping NaN values and dropping nonmissing values
(2 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a Pandas dataframe with one column, director_name, containing directors of movies and another column, death_year, containing either NaN or a float which describes the year they passed away (example: 1996.00). How do I drop all the rows which possess directors that have died as expressed by a float being in the death_year column?
nconst director_name birth_year death_year
0 nm0061671 Mary Ellen Bauder 1967.00 NaN
1 nm0061865 Joseph Bauer NaN 1996.00
2 nm0062070 Bruce Baum 1981.00 NaN
3 nm0062195 Axel Baumann NaN 2015.00
4 nm0062798 Pete Baxter 1954.00 NaN
So in the data frame above, rows 1 and 3 would be dropped because Joseph Bauer died in 1996 and Axel Baumann died in 2015. The result being a dataframe of only living directors:
nconst director_name birth_year death_year
0 nm0061671 Mary Ellen Bauder 1967.00 NaN
1 nm0062070 Bruce Baum 1981.00 NaN
2 nm0062798 Pete Baxter 1954.00 NaN
The DataFrame is huge, it contains too many rows to physically go through and make sure someone didn't enter the death year incorrectly such as 0000.000 by mistake.
You can use .loc and .notna():
df.loc[df['birth_year'].notna()].reset_index(drop=True)
If you want to drop rows by death_year use .isna():
df.loc[df['death_year'].isna()].reset_index(drop=True)
Output:
nconst director_name birth_year death_year
0 nm0061671 Mary Ellen Bauder 1967.00 NaN
1 nm0062070 Bruce Baum 1981.00 NaN
2 nm0062798 Pete Baxter 1954.00 NaN
In both cases we have the same output for the sample you pasted. You can choose what is better to use for the whole dataframe.
I noticed that when 'death_year' is not NaN, birth_year is.
df.dropna(subset=['birth_year'], inplace=True)
I'm currently wrangling a big data set of 2 mio rows from Lyft for a Udacity project. The DataFrame looks like this:
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 NaN NaN 37.775232 -122.224498
As I try to express in the last line, I have a lot of NaN values for id and name, however, latitude and longitude are mostly never empty. My assumption is that I could actually extract the name from other rows given a certain combination of latitude and longitude.
Once I have the name, I would try filling the NaN values for id using name
dict_id = dict(zip(df['name'], df['id']))
df['id'] = df['id'].fillna(df['name'].map(dict_id))
However, I struggle because with latitude and longitude I have two values to match against the name.
You can left merge the dataframe with the copy of it after dropna , then rename the columns:
m = df.merge(df.dropna(subset=['name']),on=['latitude','longitude'],
how='left',suffixes=('','_y'))
out = (m.drop(['id','name'],1).rename(columns={'id_y':'id','name_y':'name'})
.reindex(df.columns,axis=1))
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 237.0 Fruitvale BART Station 37.775232 -122.224498
I have a dataset that contains many timestamps associated with different ships and ports.
obj_id timestamp port
0 4 2019-10-01 Houston
1 2 2019-09-01 New York
2 4 2019-07-31 Boston
3 1 2019-07-28 San Francisco
4 2 2019-10-15 Miami
5 1 2019-09-01 Honolulu
6 1 2019-08-01 Tokyo
I want to build a dataframe that contains a single record for the latest voyage by ship (obj_id), by assigning the latest timestamp/port for each obj_id as a 'destination', and the second latest timestamp/port as the 'origin'. So the final result would look something like this:
obj_id origin_time origin_port destination_time destination_port
0 4 2019-07-31 Boston 2019-10-01 Houston
1 2 2019-09-01 New York 2019-10-15 Miami
3 1 2019-07-28 Tokyo 2019-09-01 Honolulu
I've successfully filtered the latest timestamps for each obj_id through this code but still can't figure a way to filter the second latest timestamp, let alone pull them both into a single row.
df.sort_values(by ='timestamp', ascending = False).drop_duplicates(['obj_id'])
Using groupby.agg with first, last:
dfg = df.sort_values('timestamp').groupby('obj_id').agg(['first', 'last']).reset_index()
dfg.columns = [f'{c1}_{c2}' for c1, c2 in dfg.columns]
obj_id_ timestamp_first timestamp_last port_first port_last
0 1 2019-07-28 2019-09-01 San Francisco Honolulu
1 2 2019-09-01 2019-10-15 New York Miami
2 4 2019-07-31 2019-10-01 Boston Houston
You want to sort the trips by timestamp so we can get the most recent voyages, then group the voyages by object id and grab the first and second voyage per object, then merge.
groups = df.sort_values(by = "timestamp", ascending = False).groupby("obj_id")
pd.merge(groups.nth(1), groups.nth(0),
on="obj_id",
suffixes=("_origin", "_dest"))
Make sure your timestamp column is the proper timestamp data type though, otherwise your sorting will be messed up.
I have a pivot table and I want to plot the values for the 12 months of each year for each town.
2010-01 2010-02 2010-03
City RegionName
Atlanta Downtown NaN NaN NaN
Midtown 194.263702 196.319964 197.946962
Alexandria Alexandria NaN NaN NaN
West
Landmark- NaN NaN NaN
Van Dom
How can I select only the values for each region of each town? I thought maybe it would be better to change the column names with years and months to datetime format and set them as index. How can I do this?
The result must be:
City RegionName
2010-01 Atlanta Downtown NaN
Midtown 194.263702
Alexandria Alexandria NaN
West
Landmark- NaN
Van Dom
Here's some similar dummy data to play with:
idx = pd.MultiIndex.from_arrays([['A','A', 'B','C','C'],
['A1','A2','B1','C1','C2']], names=['City','Region'])
idcol = pd.date_range('2012-01', freq='M', periods=12)
df = pd.DataFrame(np.random.rand(5,12), index=idx, columns=[t.strftime('%Y-%m') for t in idcol])
Let's see what we've got:
print(df.ix[:,:3])
2012-01 2012-02 2012-03
City Region
A A1 0.513709 0.941354 0.133290
A2 0.734199 0.005218 0.068914
B B1 0.043178 0.124049 0.603469
C C1 0.721248 0.483388 0.044008
C2 0.784137 0.864326 0.450250
Let's convert these to a datetime: df.columns = pd.to_datetime(df.columns)
Now to plot you just need to transpose:
df.T.plot()
Update after your updated your question:
Use stack, and then reorder if you want:
df = df.stack().reorder_levels([2,0,1])
df.head()
City Region
2012-01-01 A A1 0.513709
2012-02-01 A A1 0.941354
2012-03-01 A A1 0.133290
2012-04-01 A A1 0.324518
2012-05-01 A A1 0.554125