Finding median of entire pandas Data frame - python

I'm trying to find the median flow of the entire dataframe. The first part of this is to select only certain items in the dataframe.
There were two problems with this, it included parts of the data frame that aren't in 'states'. Also, the median was not a single value, it was based on row. How would I get the overall median of all the data in the dataframe?

Two options:
1) A pandas option:
df.stack().median()
2) A numpy option:
np.median(df.values)

The DataFrame you pasted is slightly messy due to some spaces. But you're going to want to melt the Dataframe and then use median() on the new melted Dataframe:
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2['value'].median())
Your Dataframe may be slightly different, but the concept is the same. Check the comment that I left about to understand pd.melt(), especially the value_vars and id_vars arguments.
Here is a very detailed way of how I went about cleaning and getting the correct answer:
# reading in on clipboard
df = pd.read_clipboard()
# printing it out to see and also the column names
print(df)
print(df.columns)
# melting the DF and then printing the result
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2)
# Creating a new DF so that no nulls are in there for ease of code readability
# using .copy() to avoid the Pandas warning about working on top of a copy
df3 = df2.dropna().copy()
# there were some funky values in the 'value' column. Just getting rid of those
df3.loc[df3.value.isin(['Columbia', 'of']), 'value'] = 99
# printing out the cleaned version and getting the median
print(df3)
print(df3['value'].median())

Related

Python Pandas drop a specific raneg of columns that are all NaN

I'm attempting to drop a range of columns in a pandas dataframe that have all NaN. I know the following code:
df.dropna(axis=1, how='all', inplace = True)
Will search all the columns in the dataframe and drop the ones that have all NaN.
However, when I extend this code to a specific range of columns:
df[df.columns[48:179]].dropna(axis=1, how='all', inplace = True)
The result is the original dataframe with no columns removed. I also no for a fact that the selected range has multiple columns with all NaN's
Any idea what I might be doing wrong here?
Don't use inplace=True. Instead do this:
cols = df.columns[48:179]
df[cols] = df[cols].dropna(axis=1, how='all')
inplace=True can only used when you apply changes to the whole dataframe. I won't work in range of columns. Try to use dropna without inplace=True to see the results(in a jupyter notebook)

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Here is a dummy example of the DF I'm working with ('ETC' represents several columns):
df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
'ETC':['a','b','c','d','e','f','g','h','i']})
picture of df (sorry not enough reputation yet)
And here is what I want to end up with:
df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
'INVYR':[2000,2000,2000,1990,1990,2001],
'ETC':['a','b','c','e','g','i']})
picture of df1
NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.
So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:
df1 = []
for i in df['PlotCode'].unique():
j = df[df['PlotCode']==i]
k = j[j['INVYR']==j['INVYR'].min()]
df1.append(k)
df1 = pd.concat(df1)
This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.
Thank you in advance!
Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]

Pandas colnames not found after grouping and aggregating

Here is my data
threats = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv', index_col = 0)
And here is my code -
df = (threats
.query('threatened>0')
.groupby(['continent', 'threat_type'])
.agg({'threatened':'size'}))
However df.columns only Index(['threatened'], dtype='object') is the result. That is, only the threatened column is displaying not the columns I have actually grouped by i.e continent and threat_type although present in my data frame.
I would like to perform operation on the continent column of my data frame, but it is not displaying as one of the columns. For eg - continents = df.continent.unique(). This command gives me a key error of continent not found.
After groupby...pandas put the groupby columns in the index. Always reset index after doing groupby in pandas and don't do drop=True.
After your code.
df = df.reset_index()
And then you will get required columns.

Filling a dataframe with multiple dataframe values

I have some 100 dataframes that need to be filled in another big dataframe. Presenting the question with two dataframes
import pandas as pd
df1 = pd.DataFrame([1,1,1,1,1], columns=["A"])
df2 = pd.DataFrame([2,2,2,2,2], columns=["A"])
Please note that both the dataframes have same column names.
I have a master dataframe that has repetitive index values as follows:-
master_df=pd.DataFrame(index=df1.index)
master_df= pd.concat([master_df]*2)
Expected Output:-
master_df['A']=[1,1,1,1,1,2,2,2,2,2]
I am using for loop to replace every n rows of master_df with df1,df2... df100.
Please suggest a better way of doing it.
In fact df1,df2...df100 are output of a function where the input is column A values (1,2). I was wondering if there is something like
another_df=master_df['A'].apply(lambda x: function(x))
Thanks in advance.
If you want to concatenate the dataframes you could just use pandas concat with a list as the code below shows.
First you can add df1 and df2 to a list:
df_list = [df1, df2]
Then you can concat the dfs:
master_df = pd.concat(df_list)
I used the default value of 0 for 'axis' in the concat function (which is what I think you are looking for), but if you want to concatenate the different dfs side by side you can just set axis=1.

Pandas how to concat two dataframes without losing the column headers

I have the following toy code:
import pandas as pd
df = pd.DataFrame()
df["foo"] = [1,2,3,4]
df2 = pd.DataFrame()
df2["bar"]=[4,5,6,7]
df = pd.concat([df,df2], ignore_index=True,axis=1)
print(list(df))
Output: [0,1]
Expected Output: [foo,bar] (order is not important)
Is there any way to concatenate two dataframes without losing the original column headers, if I can guarantee that the headers will be unique?
Iterating through the columns and then adding them to one of the DataFrames comes to mind, but is there a pandas function, or concat parameter that I am unaware of?
Thanks!
As stated in merge, join, and concat documentation, ignore index will remove all name references and use a range (0...n-1) instead. So it should give you the result you want once you remove ignore_index argument or set it to false (default).
df = pd.concat([df, df2], axis=1)
This will join your df and df2 based on indexes (same indexed rows will be concatenated, if other dataframe has no member of that index it will be concatenated as nan).
If you have different indexing on your dataframes, and want to concatenate it this way. You can either create a temporary index and join on that, or set the new dataframe's columns after using concat(..., ignore_index=True).
I don't think the accepted answer answers the question, which is about column headers, not indexes.
I am facing the same problem, and my workaround is to add the column names after the concatenation:
df.columns = ["foo", "bar"]

Categories