Can't call column names in a dataframe after GROUPBY

Can't call column names in a dataframe after GROUPBY - python

Sorry about the formatting. Can't seem to figure that out.
Unable to refer to column names in a grouped dataframe.
dfbidFeed
MARKET 30D_PCLICK
Dallas 5
Dallas 4
Houston 10
Houston 13
New York 7
New York 12
St. Louis 15
St. Louis 14
Then I run this code to compute the mean and the sum of CLICKS, grouped by MARKET
dfclicks_grouped = dfbidFeed['30D_PCLICK']
.groupby([dfbidFeed['MARKET']])
.agg({"returns": [np.mean, np.sum]})
returns:
MARKET mean sum
Dallas 4.5 9
Houston 11.5 23
New York 9.5 19
St. Louis 14.5 29
I want to be able to merge dfclicks_grouped with another df. However, I can't seem to refer to the columns in dfclicks_grouped.
It doesn't behave like a dataframe where I can refer to the column name and get a result. (i.e. dfbidFeed['MARKET'] returns the list of markets.
When I do this (dfclicks_grouped['MARKET'], I get: keyError: 'MARKET'
How do I make the colunn names in dfclicks_grouped callable?

Related

Python/Pandas for loop through a list only working on the last item in the list

This is a bit strange to me...
I have a DataFrame with a 'utility' column and an 'envelope' column.
I have a list of cities that get sent special envelopes:
['Chicago', 'New York', 'Dallas', 'LA']
I need to loop through each value in the utility column, check if it's in the list of cities that get sent special envelopes, and if it is, add the utility name to the envelope column.
This is the code I wrote to do that:
utilityEnv = ['Chicago', 'New York', 'Dallas', 'LA']
for i in utilityEnv :
print(i)
for j in df.index :
if i in df.at[j, 'utility'] :
print('true')
df.at[j, 'envelope'] = df.at[j, 'utility']
else :
df.at[j, 'envelope'] = 'ABF'
When I run this code, it prints the utility name, then a bunch of 'true'-s for each utility as it's supposed to each time it's going to set the envelope column to equal the utility column, but, the final df shows that the envelope columns were set to equal the utility column ONLY for LA, and not for any of the other cities. Even though there are many 'true'-s printed for the other utilities which means it made it into that block for other utilities as well.
For example:
This is what happens:
utility envelope
0 Chicago ABF
1 New York ABF
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas ABF
7 LA LA
8 Chicago ABF
9 Austin ABF
This is what supposed to happen:
utility envelope
0 Chicago Chicago
1 New York New york
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas Dallas
7 LA LA
8 Chicago Chicago
9 Austin ABF
Sorry about the formatting I had to do it on my phone
Any idea why this is happening??

Use Series.where with Series.isin
df['envelope']=df['utility'].where(df['utility'].isin(utilityEnv), 'ABF')
Output
utility envelope
0 Chicago Chicago
1 New York New York
2 Austin ABF
3 Sacramento ABF
4 Boston ABF
5 LA LA
6 Dallas Dallas
7 LA LA
8 Chicago Chicago
9 Austin ABF
This is much faster than using loops, panda methods are created for these things.
Here I show you
a correct code with loops but you should not use this
for i in df.index:
val = df.at[i,'utility']
if val in utilityEnv:
df.at[i,'envelop']=val
else:
df.at[i,'envelop']='ABF'

Add new column to dataframe with value_counts

i have two datasets:
-population: shows the population of USA states, organized alphabetically.
-data: has more than 200,000 rows
population.head()
state population
0 Alabama 4887871
1 Alaska 737438
2 Arizona 7171646
3 Arkansas 3013825
4 California 39557045
i'm trying to add a new column called "Incidents" from the other data set.
I tried: population['incidents'] = data.state.value_counts().sort_index()
but i'm getting the following result:
state population incidents
0 Alabama 4887871 NaN
1 Alaska 737438 NaN
2 Arizona 7171646 NaN
3 Arkansas 3013825 NaN
4 California 39557045 NaN
what can i do to fix this??
EDIT:
data.state.value_counts().sort_index()
Alabama 5373
Alaska 1292
Arizona 2268
Arkansas 2753
California 15975
Colorado 3069
Connecticut 2984
Delaware 1643
District of Columbia 3091
Florida 14610
Georgia 8717
````````````````````````

If you wanna add a specific column from one dataset to the other dataset you do it like this
population['incidents'] = data[['columntoappend']]
Your RHS (right hand side ) must be one column which in your case is not.
https://www.google.com/amp/s/www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/amp/

The way to do this is as follows, provided that your length of your indices are consistent:
population['incidents'] = [x for x in data.state.value_counts().sort_index()]
I can't really explain why your approach results in NaN objects though. In any case, it would be incorrect as well as you're assigning entire series to each row in the population dataset. With the list comprehension, you're assigning one value to each row.

Difference between 2 columns 2 dataframes based on index

I have 2 dataframes I would like to analyze.
df1:
City Time Day
5866128 Los Angeles 3.5 01/09/2019
5172728 New York 14 09/09/2019
4787428 Boston 9 10/09/2019
df2:
City Time Day
5866128 Los Angeles 3.5 01/09/2019
2478987 Denver 10 07/09/2019
5172728 New York 24 09/09/2019
4787428 Boston 4 10/09/2019
1478712 Austin 7 10/09/2019
I would like to create a third dataframe that would contain only the rows where df2['Time']-df1['Time'] != 0 based on the index and the rows that are only available in df2.
Desired output :
City Time Day
2478987 Denver 10 07/09/2019
5172728 New York 10 09/09/2019
4787428 Boston -5 10/09/2019
1478712 Austin 7 10/09/2019
I tried to play with numpy.where(), but I can't make it work to compare only the same index.
Thanks

df2['Time'] = df2['Time'].sub(df1['Time'], fill_value=0)
df2[df2.Time.ne(0)]
or
df2.assign(Time = df2['Time'].sub(df1['Time'], fill_value= 0)).loc[lambda x:x.Time.ne(0)]

Try this one:
df3 = (df1.join(df2, rsuffix="_2")
df3 = df3.loc[df["Time"]!=df["Time_2"]]
df3["Time"]=df3["Time_2"]-df3["Time"]
df3=df3.drop(["Time_2", "Day_2", "City_2"], axis=1)

How to get maximum values from pandas.value_counts for each variable?

I am actually trying to get only maximum values for each year returned from the pandas.value_count() function?
I have tried using the apply function with lambda function but was not successful:
print(match_won_by_team.apply(lambda x : match_won_by_team[x].index[0]))
remove_duplicate_match_codes = data.drop_duplicates(subset='match_code', keep='first').reset_index(drop=True)
match_won_by_team = remove_duplicate_match_codes.groupby('year')['winner'].value_counts()
print('Match won by each team in respective seasons:- ', match_won_by_team)
I am expecting the output to display 2008: Rajasthan Royals: 13, 2009: Delhi Daredevils: 10 and so on from the series.
2008 Rajasthan Royals 13
Kings XI Punjab 10
Chennai Super Kings 9
2009 Delhi Daredevils 10
Deccan Chargers 9
Royal Challengers Bangalore 9
2010 Mumbai Indians 11
Chennai Super Kings 9
Deccan Chargers 8
I am getting this error when I am using the apply function and lambda on it.
AttributeError: 'numpy.int64' object has no attribute 'index'

IIUC:
I think you need to use the following:
remove_duplicate_match_codes.groupby('year')['winner'].apply(lambda x: x.value_counts().head(1))
This will apply value_counts to each part of winners by Year and use head to retrieve the first record, or the winner with the most counts in that year.

Make row operations faster in pandas

I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.
Here is the original dataset and a sample screenshot is provided below.
The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values.
Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06
Here is how I solved the problem.
First I defined a function quarter_rows() which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row
import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
def quarter_rows(row):
for i in range(0, len(row), 3):
row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
return row
Now I do some subsetting and cleanup of the data to leave only what I need to work with
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]
I then used apply to apply the function to all rows.
housing3 = housing3.apply(quarter_rows, axis=1)
I get the expected result. A sample is shown below
But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.
I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows() function.
I've tried python lambdas but every way I tried threw an exception.
I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.
Thanks

I think you can use instead apply use resample by quarters and aggregate mean, but first convert column names to month periods by to_period:
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
Testing:
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
#for testing slect only 10 first rows and columns from jan 2000 to jun 2000
housing3 = housing.set_index(["State","RegionName"]).ix[:10, '2000-01' : '2000-06']
print (housing3)
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
State RegionName
NY New York NaN NaN NaN NaN NaN NaN
CA Los Angeles 204400.0 207000.0 209800.0 212300.0 214500.0 216600.0
IL Chicago 136800.0 138300.0 140100.0 141900.0 143700.0 145300.0
PA Philadelphia 52700.0 53100.0 53200.0 53400.0 53700.0 53800.0
AZ Phoenix 111000.0 111700.0 112800.0 113700.0 114300.0 115100.0
NV Las Vegas 131700.0 132600.0 133500.0 134100.0 134400.0 134600.0
CA San Diego 219200.0 222900.0 226600.0 230200.0 234400.0 238500.0
TX Dallas 85100.0 84500.0 83800.0 83600.0 83800.0 84200.0
CA San Jose 364100.0 374000.0 384700.0 395700.0 407100.0 416900.0
FL Jacksonville 88000.0 88800.0 89000.0 88900.0 89600.0 90600.0
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
print (housing3)
2000Q1 2000Q2
State RegionName
NY New York NaN NaN
CA Los Angeles 207066.666667 214466.666667
IL Chicago 138400.000000 143633.333333
PA Philadelphia 53000.000000 53633.333333
AZ Phoenix 111833.333333 114366.666667
NV Las Vegas 132600.000000 134366.666667
CA San Diego 222900.000000 234366.666667
TX Dallas 84466.666667 83866.666667
CA San Jose 374266.666667 406566.666667
FL Jacksonville 88600.000000 89700.000000

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't call column names in a dataframe after GROUPBY - python

Related

Python/Pandas for loop through a list only working on the last item in the list

Add new column to dataframe with value_counts

Difference between 2 columns 2 dataframes based on index

How to get maximum values from pandas.value_counts for each variable?

Make row operations faster in pandas

Categories

Resources