Pandas best practice: remove column names? - python

I'm sorting a dataframe containing stock capitalisations from largest to smallest row-wise (I will compute the ratio of top 10 stocks vs the whole market as a proxy for concentration).
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='last').to_numpy(), index=stock_mv.columns)
stock_mv = stock_mv.apply(f, axis=1)
When I do this, however, the column names (tickers) no longer make sense. I read somewhere that you shouldn't delete column names or have them set to the same thing.
What is the best practice thing to do in this situation?
Thank you very much - I am very much a novice coder.

If I understand your problem right, you want to sort a dataframe row-wise. If that's the case, Try this:
stock_mv = stock_mv.sort_values(axis=1, ascending=False)

Related

How to find the mean of subseries in DataFrames?

My personnel side project right now is to analyze GDP growth rates per capita. More specifically, I want to find the average growth rate for each decade since 1960, and then analyze it.
I pulled data from the World Bank API("wbgapi")as a DataFrame:
import pandas as pd
import wbgapi as wb
gdp=wb.data.DataFrame('NY.GDP.PCAP.KD.ZG')
gdp.head()
Output:
gdp
I then used nested for loops to calculate the mean for every decade and added it to a new dataframe.
row, col = gdp.shape
meandata = pd.DataFrame(columns = ['Country', 'Decade', 'MeanGDP', 'Region'])
for r in range (0, row, 1):
countrydata = gdp.iloc[r]
for c in range (0, col-9, 10):
decade = 1960+c
tenyeargdp = countrydata.array[c:c+10].mean()
meandata = meandata.append({'Country': gdp.iloc[r].name, 'Decade': decade, 'MeanGDP': tenyeargdp}, ignore_index=True)
meandata.head(10)
The code works and generates the following output: meandata
However, I have a few questions about this step:
Is there a more efficient way to do access the subseries of dataframes? I read that "for loops" should never be used for dataframes and that one should vectorize operations on dataframes?
Is the complexity O(n^2) since there are 2 for loops?
The second step is to group the individual countries by region, for future analysis. To do so I rely on the World Bank API which has its own Region, which each has a list of member economies/countries.
I iterated through the regions and the member list of each region. If a Country is part of the Region list I added that region series.
Since an economy/country can be part of multiple regions(ie the 'USA' can be part of NA and HIC(high-income country)), I concatenated the region to the previously added regions.
for rg in wb.region.list():
for co in wb.region.members(rg['code']):
str1 ='-'+meandata.loc[meandata['Country']==co, ['Region']].astype(str)
meandata.loc[meandata['Country']==co, ['Region']] = rg['code']+ str1
The code works mostly, however, sometimes it gives the error message that 'meandata' is not defined. I use Jupyter-Lab.
Additionally, Is there a simpler/more efficient way of doing the second step?
Thanks for reading and helping. Also, this is my first python/pandas coding experience, and as such general feedback is appreciated.
Consider to use groupby:
The aggregation will be based on columns you insert inside a List of columns in groupby functions.
In sample below I get the mean for 'County' and 'Region'.
metadata = metadata.groupby(['County','Region']).agg('MeanGDP':'mean').reset_index()

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

Pandas How does the corrwith() work in this funcrion?

The Function is to find the correlation of any store with another store
input=store number which is to be compared
output=dataframe with correlation coefficient values
def calcCorr(store):
a=[]
metrix=pre_df[['TOT_SALES','TXN_PER_CUST']]```#add metrics as required e.g.
,'TXN_PER_CUST'
for i in metrix.index:
a.append(metrix.loc[store].corrwith(metrix.loc[i[0]]))
df= pd.DataFrame(a)
df.index=metrix.index
df=df.drop_duplicates()
df.index=[s[0] for s in df.index]
df.index.name="STORE_NBR"
return df
I dont' understand this part :corrwith(metrix.loc[i[0]])) Why there has a [0]? Thanks for your help!
The dataframe pre_df is looked like this:
enter image description here
As commented, this should not be the way to go as it produces a lot of duplicates (looping through all the rows but only keep the first level). The function can be written as:
def calcCorr1(store, df):
return pd.DataFrame({k:df.loc[store].corrwith(df.loc[k])
for k in df.index.unique('STORE_NBR')
}).T
Notice that instead of looping through all the rows, we just loop through the unique values in the first level (STORE_NBR) only. Since each store contains many rows, we are looking at a magnitude less of runtime here.

Select specific rows on pandas based on condition

I have a dataframe containing a column called bmi (Body Mass Index) containing int values
I have to separate the values in bmi column into Under weight, Normal, Over weight and Obese based on the values. Below is the loop for the same
However I am getting an error. I am a beginner. Just started coding 2 weeks back.
Generally speaking, using a for loop in pandas is usually a bad idea. Pandas allows you to manipulate data easily. For example, if you want to filter by some condition:
print(df[df["bmi"] > 30])
will print all rows where there bmi>30. It works as follows: df[condition]. Condition in this case is "bmi" is larger then 30, so our condition is df["bmi"] > 30. Notice the line df[df["bmi"] > 30] returns all rows that satisfy the condition. I printed them, but you can manipulate them whatever you like.
Even though it's a bad technique (or used only for specific need), you can of course iterate through dataframe. This is not done via for l in df, as df is a dataframe object. To iterate through it you can use iterrows:
for index, row in df.iterrows():
if (row["bmi"] > 30)
print("Obese")
Also for next time please provide your code inline. Don't paste an image of it
If your goal is to separate into different labels, I suggest the following:
df.loc[df[df["bmi"] > 30, "NewColumn"] = "Obese"
df.loc[df[df["bmi"] < 18.5, "NewColumn"] = "Underweight"
.loc operator allows me to manipulate only part of the data. It's format is [rows, columns]. So the above code takes on rows where bmi>30, and it takes only "NewColumn" (change it whatever you like) which is a new column. It puts the value on the right to this column. That way, after that operation, you have a new column in your dataframe which has "Obese/Underweight" as you like.
As side note - there are better ways to map values (e.g pandas' map and others) but if you are a beginner, it's important to understand simple methods to manipulate data before diving into more complex one. That's why I am avoiding into explaining more complex method
First of all, As mentioned in the comment you should post text/code instead of screenshots.
You could do binning in pandas:
bmi_labels = ['Normal', 'Overweight', 'Obese']
cut_bins = [18.5, 24.9, 29.9, df["bmi"].max()]
df['bmi_label'] = pd.cut(df['bmi'], bins=cut_bins, labels=bmi_labels)
Here, i have made a seperate column (bmi_label) to store label but you could can do it in same column (bmi) too.

What is wrong with this lambda function? Pandas and Python dataframe

I wrote a lambda function that should be fast, but this is taking a very long time. Is there a better way to write this?
fn = lambda x: shape(df[df.CustomerCard_Num == x.CustomerCard_Num])[0]
df['tottrans'] = df.apply(fn, axis = 1)
Basically, I have a big database of transactions (rows). A set of rows might correspond to different customers (Customer card number if a column in df, multiple rows might have the same df.CustomerCard_Num.)
I am trying to count the number of rows for each customer with this lambda function. But it does not seem to work quickly. Should I be using groupby?
There is a built in way:
df.CustomerCard_Num.value_counts()
See the docs

Categories