Percent Change in for-loop - python

I have a dataframe where I've set both the District and the Year as a multilevel index. I want to calculate the percentage change for each column ('DEM', 'REP', etc) for each district for each year.
I have consulted this previous SO question and tried using the following code:
for idx, districts_bydistrict_select in districts_bydistrict.groupby(level=[0, 1]):
y = districts_bydistrict.pct_change()
print(pd.DataFrame(y))
However it is not recognizing to start the pct_change() calculation over when there is a new District. I realize I am probably missing some part of the for-loop.

You can simply specify the level in your groupby.
districts_bydistrict.groupby(level='Year').pct_change()
You can unstack the districts so that you just have time in the index, compute pct_change, and then restack the districts.
districts_bydistrict.unstack('DISTRICTS').pct_change().stack()

Related

How to find the mean of subseries in DataFrames?

My personnel side project right now is to analyze GDP growth rates per capita. More specifically, I want to find the average growth rate for each decade since 1960, and then analyze it.
I pulled data from the World Bank API("wbgapi")as a DataFrame:
import pandas as pd
import wbgapi as wb
gdp=wb.data.DataFrame('NY.GDP.PCAP.KD.ZG')
gdp.head()
Output:
gdp
I then used nested for loops to calculate the mean for every decade and added it to a new dataframe.
row, col = gdp.shape
meandata = pd.DataFrame(columns = ['Country', 'Decade', 'MeanGDP', 'Region'])
for r in range (0, row, 1):
countrydata = gdp.iloc[r]
for c in range (0, col-9, 10):
decade = 1960+c
tenyeargdp = countrydata.array[c:c+10].mean()
meandata = meandata.append({'Country': gdp.iloc[r].name, 'Decade': decade, 'MeanGDP': tenyeargdp}, ignore_index=True)
meandata.head(10)
The code works and generates the following output: meandata
However, I have a few questions about this step:
Is there a more efficient way to do access the subseries of dataframes? I read that "for loops" should never be used for dataframes and that one should vectorize operations on dataframes?
Is the complexity O(n^2) since there are 2 for loops?
The second step is to group the individual countries by region, for future analysis. To do so I rely on the World Bank API which has its own Region, which each has a list of member economies/countries.
I iterated through the regions and the member list of each region. If a Country is part of the Region list I added that region series.
Since an economy/country can be part of multiple regions(ie the 'USA' can be part of NA and HIC(high-income country)), I concatenated the region to the previously added regions.
for rg in wb.region.list():
for co in wb.region.members(rg['code']):
str1 ='-'+meandata.loc[meandata['Country']==co, ['Region']].astype(str)
meandata.loc[meandata['Country']==co, ['Region']] = rg['code']+ str1
The code works mostly, however, sometimes it gives the error message that 'meandata' is not defined. I use Jupyter-Lab.
Additionally, Is there a simpler/more efficient way of doing the second step?
Thanks for reading and helping. Also, this is my first python/pandas coding experience, and as such general feedback is appreciated.
Consider to use groupby:
The aggregation will be based on columns you insert inside a List of columns in groupby functions.
In sample below I get the mean for 'County' and 'Region'.
metadata = metadata.groupby(['County','Region']).agg('MeanGDP':'mean').reset_index()

Pandas dataframe: summing cell data from a group of rows, storing in a new column

As a part of a treatment for a health related issue, I need to measure my liquid intake (along with some other parameters), registring the amount of liquid every time I drink. I have a dataframe, of several months of such registration.
I want to sum my daily amount in an additional column (in red, image below)
As you may see, I wish like to store it in the first column of the slice returned by df.groupby(df['Date'])., for all the days.
I tried the following:
df.groupby(df.Date).first()['Total']= df.groupby(df.Date)['Drank'].fillna(0).sum()
But seems not to be the way to do it.
Greatful for any advice.
Thanks
Michael
use fact False==0
first row of date will be where data is not equal to shift() of date
merge() to sum
## construct a data set
d = pd.date_range("1-jan-2021", "1-mar-2021", freq="2H")
A = np.random.randint(20,300,len(d)).astype(float)
A.ravel()[np.random.choice(A.size, A.size//2, replace=False)] = np.nan
df = pd.DataFrame({"datetime":d, "Drank":A})
df = df.assign(Date=df.datetime.dt.date, Time=df.datetime.dt.time).drop(columns=["datetime"]).loc[:,["Date","Time","Drank"]]
## construction done
# first row will have different date to shift
# merge Total back
df.assign(row=df.Date.eq(df.Date.shift())).merge(df.groupby("Date", as_index=False).agg(Total=("Drank","sum")).assign(row=0),
on=["Date","row"], how="left").drop(columns="row")

groupby.mean function dividing by pre-group count rather than post-group count

So I have the following dataset of trade flows that track imports, exports, by reporting country and partner countries. After I remove some unwanted columns, I edit my data frame such that trade flows between country A and country B is showing. I'm left with something like this:
[My data frame image] 1
My issue is that I want to be able to take the average of imports and exports for every partner country ('partner_code') per year, but when I run the following:
x = df[(df.location_code.isin(["IRN"])) &
df.partner_code.isin(['TCD'])]
grouped = x.groupby(['partner_code']).mean()
I end up getting an average of all exports divided by all instances where there is a 'product_id' (so a much higher number) rather than averaging imports or exports by total for all the years.
Taking the average of the following 5 export values gives an incorrect average:
5 export values
Wrong average
In pandas, we can groupby multiple columns, based on my understanding you want to group by partner, country and year.
The following line would work:
df = df.groupby(['partner_code', 'location_code', 'year'])['import_value', 'export_value'].mean()
Please note that the resulting dataframe is has MultiIndex index.
For reference, the official documentation: DataFrame.groupby documentation

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

Selective summation of columns in a pandas dataframe

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day's data for one state. As many people know, the pandemic is hitting different states differently -- New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:
date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463
To get the entire data set I am doing this:
import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not "NY". What I want to do, though, is exclude multiple states with something like a "not in" function (this doesn't work):
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')
but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;
The end result I am looking for is like this (using the sample data above and summing on the 'positive' column, with 'NY' standing in for 'tristate'):
date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371
Any help would be welcome.
to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total
df_f = all_states.groupby(['date',
np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate')])\
['positive'].sum()\
.unstack()\
.assign(total=lambda x: x.sum(axis=1))
print (df_f)
not_tristate tristate total
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
or with pivot_table, you get similar result with:
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate'))\
.pivot_table(index='date', columns='state', values='positive',
aggfunc='sum', margins=True))
state not_tristate tristate All
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
All 220763 1269537 1490300
You can exclude multiple values of states by using isin with a NOT(~) sign:
all_states[~(all_states['state'].isin(["NY", "NJ", "CT"]))]
So, your code would be:
not_tristate = all_states[~(all_states['state'].isin(['NY','NJ','CT']))].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

Categories