How to use multiple pandas functions in a single variable python - python

I want to drop a column level and columns to the right, from data downloaded from yahoo finance.
FAANG = yf.download(['AAPL','GOOGL','NFLX','META','AMZN','SPY'],
start = '2008-01-01',end = '2022-12-31')
FAANG_AC = FAANG.drop(FAANG.columns[6:36],axis=1)
FAC = FAANG_AC.droplevel(0,axis=1)
How do I combine .drop and .droplevel into a single variable, so that I do not have to use multiple variables in this situation?

You don't need to use intermediate variables. You can chain everything:
FAANG = (yf.download(['AAPL','GOOGL','NFLX','META','AMZN','SPY'],
start='2008-01-01', end = '2022-12-31')
.drop(FAANG.columns[6:36], axis=1)
.droplevel(0, axis=1)
)

You can add inplace=True as a parameter for when calling those methods. Like:
FAANG.drop(FAANG.columns[6:36],axis=1, inplace=True)
Careful: it will modify the FAANG variable.
Reference: https://www.askpython.com/python-modules/pandas/inplace-true-parameter

Related

How to loop through a pandas data frame using a columns values as the order of the loop?

I have two CSV files which I’m using in a loop. In one of the files there is a column called "Availability Score"; Is there a way that I can make the loop iterate though the records in descending order of this column? I thought I could use Ob.sort_values(by=['AvailabilityScore'],ascending=False) to change the order of the dataframe first, so that when the loop starts in will already be in the right order. I've tried this out and it doesn’t seem to make a difference.
# import the data
CF = pd.read_csv (r'CustomerFloat.csv')
Ob = pd.read_csv (r'Orderbook.csv')
# Convert to dataframes
CF = pd.DataFrame(CF)
Ob = pd.DataFrame(Ob)
#Remove SubAssemblies
Ob.drop(Ob[Ob['SubAssembly'] != 0].index, inplace = True)
#Sort the data by thier IDs
Ob.sort_values(by=['CustomerFloatID'])
CF.sort_values(by=['FloatID'])
#Sort the orderbook by its avalibility score
Ob.sort_values(by=['AvailabilityScore'],ascending=False)
# Loop for Urgent Values
for i, rowi in CF.iterrows():
count = 0
urgent_value = 1
for j, rowj in Ob.iterrows():
if(rowi['FloatID']==rowj['CustomerFloatID'] and count < rowi['Urgent Deficit']):
Ob.at[j,'CustomerFloatPriority'] = urgent_value
count+= rowj['Qty']
You need to add inplace=True, like this:
Ob.sort_values(by=['AvailabilityScore'],ascending=False, inplace=True)
sort_values() (like most Pandas functions nowadays) are not in-place by default. You should assign the result back to the variable that holds the DataFrame:
Ob = Ob.sort_values(by=['CustomerFloatID'], ascending=False)
# ...
BTW, while you can pass inplace=True as argument to sort_values(), I do not recommend it. Generally speaking, inplace=True is often considered bad practice.

Functionto create a DF

I want to create a DF from another DF using a function like this:
def create_df_region(df,region):
df = pd.DataFrame(index=df_reduced.index)
df['Cons'] = df_reduced['ind_{region}'.format()].value
Problem is: ind_{} can assume values like ind_s, ind_n, ind_no and I want to pass these values when creating the DF because n means norh, s means south and so on.
then, to create the df:
df_south = create_df_region(df_reduced, s)
when s mean the south beacuse in the df_reduced i have columns ind_s, ind_s...
How can I do it as the way i am trying abive is not working.
You need to return the newly created dataframe at the end of the function,
use .values instead of .value and use f-string for retrieving the source column name, as follows:
def create_df_region(df, region):
df = pd.DataFrame(index=df_reduced.index)
df['Cons'] = df_reduced[f'ind_{region}'].values # use .values instead of .value
return df
Also, when you call the function, you need to pass a string 's' instead of the variable name s as follows:
df_south = create_df_region(df_reduced, 's')
Use f'ind_{region}' instead .format():
def create_df_region(df_reduced,region):
df = pd.DataFrame(index=df_reduced.index)
df['Cons'] = df_reduced[f'ind_{region}'].value
*I've also changed the first parameter of the function from df to df_reduced to make sense.

Pandas dataframe groupby value_counts

I tried this code it is perfectly working , but when i remove "RIAGENDR" from
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
it show me error but why what is the reason ??
please help me with that !!!
dm=ds[ds["RIAGENDR"]=="male"]
dm.RIDAGEYR=pd.cut(dm.RIDAGEYR,[18,30,40,50,60,70,80,100])
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
dx=dx.value_counts()
dx=dx.unstack()
dx = dx.apply(lambda x: x/x.sum(), axis=0)
#dx=dx.to_string(float_format="%.3f")
dx
```
you should not use [] when you have single variable:
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
or one way is to use:
dx=dm.groupby(by=["RIAGENDR"])
you can get this on following link somewhere in Hierarchical Indexes:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
You should remove the [] brackets from within the groupby operation.
To be clear, if you want to group by 2 variables you should use the below code which uses a list of variables:
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
If you want to group by 1 variable, you should not use a list of variables, just a single one:
dx=dm.groupby("RIAGENDR")["DMDMARTL"]

How to pass variables to 'use_cols' argument of pd.read_excel

I have a code as below
df = pd.read_excel(filepath,sheet_name=sheet_name,skiprows=skiprows, use_cols='A:O')
This works just fine. However, the columns change from sheet to sheet, hence I want to provide an input option to the user where the enter the column names (A,B..) for from_col & to_col variables & then use those names in the use_cols argument.
However, I am not able to use the variable directly in the argument use_cols. What I am doing now is as below
from_col = 'A'
to_col='O'
a_l = string.ascii_uppercase
w_l=a_l[a_l.index(from_col):a_l.index(to_col)]
df = pd.read_excel(filepath,sheet_name=sheet_name,skiprows=skiprows, use_cols=w_l)
Now, the question is, is there a way to pass variables to 'use_cols' argument of pd.read_excel directly? or a simpler way than what I aa using now?
Update
The code above that is am using is not working properly, it reads upto column O no matter what variable I pass in from_col & to_col, not sure why. The list w_l updates properly, but use_cols seems to be ignoring it!
You can simply create a string and pass it as an argument like this:
from_col = 'A'
to_col='O'
w_l = f"{from_col}:{to_col}" # 'A:O'
df = pd.read_excel(filepath, usecols=w_l)

nested for loops, using values to create columns

I'm pretty new to python programming. I read a csv file to a dataframe with median house price of each month as columns. Now I want to create columns to get the mean value of each quarter. e.g. create column housing['2000q1'] as mean of 2000-01, 2000-02, and 2000-03, column housing['2000q2'] as mean of 2000-04,2000-05, 2000-06]...
raw dataframe named 'Housing'
I tried to use nested for loops as below, but always come with errors.
for i in range (2000,2017):
for j in range (1,5):
Housing[i 'q' j] = Housing[[i'-'j*3-2, i'-'j*3-1, i'_'j*3]].mean(axis=1)
Thank you!
Usually, we work with data where the rows are time, so it's good practice to do the same and transpose your data by starting with df = Housing.set_index('CountyName').T (also, variable names should usually start with a small letter, but this isn't important here).
Since your data is already in such a nice format, there is a pragmatic (in the sense that you need not know too much about datetime objects and methods) solution, starting with df = Housing.set_index('CountyName').T:
df.reset_index(inplace = True) # This moves the dates to a column named 'index'
df.rename(columns = {'index':'quarter'}, inplace = True) # Rename this column into something more meaningful
# Rename the months into the appropriate quarters
df.quarter.str.replace('-01|-02|-03', 'q1', inplace = True)
df.quarter.str.replace('-04|-05|-06', 'q2', inplace = True)
df.quarter.str.replace('-07|-08|-09', 'q3', inplace = True)
df.quarter.str.replace('-10|-11|-12', 'q4', inplace = True)
df.drop('SizeRank', inplace = True) # To avoid including this in the calculation of means
c = df.notnull().sum(axis = 1) # Count the number of non-empty entries
df['total'] = df.sum(axis = 1) # The totals on each month
df['c'] = c # only ssign c after computing the total, so it doesn't intefere with the total column
g = df.groupby('quarter')[['total','c']].sum()
g['q_mean'] = g['total']/g['c']
g
g['q_mean'] or g[['q_mean']] should give you the required answer.
Note that we needed to compute the mean manually because you had missing data; otherwise, df.groupby('quarter').mean().mean() would have immediately given you the answer you needed.
A remark: the technically 'correct' way would be to convert your dates into a datetime-like object (which you can do with the pd.to_datetime() method), then run a groupby with a pd.TimeGrouper() argument; this would certainly be worth learning more about if you are going to work with time-indexed data a lot.
You can achieve this using pandas resampling function to compute quarterly averages in a very simple way.
pandas resampling: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
offset names summary: pandas resample documentation
In order to use this function, you need to have only time as columns, so you should temporarily set CountryName and SizeRank as indexes.
Code:
QuarterlyAverage = Housing.set_index(['CountryName', 'SizeRank'], append = True)\
.resample('Q', axis = 1).mean()\
.reset_index(['CountryName', 'SizeRank'], drop = False)
Thanks to #jezrael for suggesting axis = 1 in resampling

Categories