Pandas colnames not found after grouping and aggregating - python

Here is my data
threats = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv', index_col = 0)
And here is my code -
df = (threats
.query('threatened>0')
.groupby(['continent', 'threat_type'])
.agg({'threatened':'size'}))
However df.columns only Index(['threatened'], dtype='object') is the result. That is, only the threatened column is displaying not the columns I have actually grouped by i.e continent and threat_type although present in my data frame.
I would like to perform operation on the continent column of my data frame, but it is not displaying as one of the columns. For eg - continents = df.continent.unique(). This command gives me a key error of continent not found.

After groupby...pandas put the groupby columns in the index. Always reset index after doing groupby in pandas and don't do drop=True.
After your code.
df = df.reset_index()
And then you will get required columns.

Related

How do I manipulate the columns in my financial data

I've a dataframe in csv for different stock options and their close/high/low/open etc. But the format of the data is a bit difficult to work with. When calculating the returns using the adjusted close value, I've to create a new df each time to drop the null values.
Original Data
How do I convert it to the following format instead?
Converted data
The best way I could think of is to pivot the data:
# Drop date column (as it is already in the index), and pivot on Feature
df2 = df.drop(columns="Date").pivot(columns="Feature")
# Swap the column levels, so that Feature is first, then Ticker
df2 = df2.swaplevel(0, 1, 1)
# Stack the columns, so Ticker is one column, increasing the number of rows
df2 = df2.stack()
# Reset in the index, but keep it (the Date column)
df2.reset_index(inplace=True)
# Sort the rows on the Ticker and Date
df2.sort_values(["level_1", "Date"], inplace=True)
# Rename the Ticker column
df2.rename(columns={"level_1": "Ticker"}, inplace=True)
# Reset the index
df2.reset_index(drop=True, inplace=True)
This could all be run once, rather than defining df2 each time:
df2 = df.drop(columns="Date").pivot(columns="Feature") \
.swaplevel(0, 1, 1).stack().reset_index() \
.sort_values(["level_1", "Date"]) \
.rename(columns={"level_1": "Ticker"}).reset_index(drop=True)
Hopefully this works for you!

swap pandas column to values in another column

I have a pandas dataframe- got it from API so don't have much control over the structure of it- similar like this:
I want to have datetime a column and value as another column. Any hints?
you can use T to transform the dataframe and then reseindex to create a new index column and keep the current column you may need to change its name form index
df = df.T.reset_index()
df.columns = df.iloc[0]
df = df[1:]

Adding correction column to dataframe

I have a pandas dataframe I read from a csv file with df = pd.read_csv("data.csv"):
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,5,8
2020-01-03,place2,2,9
I also have a dataframe with corrections df_corr = pd.read_csv("corrections .csv")
date,location,value
2020-01-02,place2,-1
2020-01-03,place2,2
How do I apply these corrections where date and location match to get the following?
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,4,8
2020-01-03,place2,4,9
EDIT:
I got two good answers and decided to go with set_index(). Here is how I did it 'non-destructively'.
df = pd.read_csv("data.csv")
df_corr = pd.read_csv("corr.csv")
idx = ['date', 'location']
df_corrected = df.set_index(idx).add(
df_corr.set_index(idx).rename(
columns={"value": "value1"}), fill_value=0
).astype(int).reset_index()
It looks like you want to join the two DataFrames on the date and location columns. After that its a simple matter of applying the correction by adding the value1 and value columns (and finally dropping the column containing the corrections).
# Join on the date and location columns.
df_corrected = pd.merge(df, df_corr, on=['date', 'location'], how='left')
# Apply the correction by adding the columns.
df_corrected.value1 = df_corrected.value1 + df_corrected.value
# Drop the correction column.
df_corrected.drop(columns='value', inplace=True)
Set date and location as index in both dataframes, add the two and fillna
df.set_index(['date','location'], inplace=True)
df1.set_index(['date','location'], inplace=True)
df['value1']=(df['value1']+df1['value']).fillna(df['value1'])

Pandas Series from two-columned DataFrame produces a Series of NaN's

state_codes = pd.read_csv('name-abbr.csv', header=None)
state_codes.columns = ['State', 'Code']
codes = state_codes['Code']
states = pd.Series(state_codes['State'], index=state_codes['Code'])
name-abbr.csv is a two-columned CSV file of US state names in the first column and postal codes in the second: "Alabama" and "AL" in the first row, "Alaska" and "AK" in the second, and so forth.
The above code correctly sets the index, but the Series is all NaN. If I don't set the index, the state names correctly show. But I want both.
I also tried this line:
states = pd.Series(state_codes.iloc[:,0], index=state_codes.iloc[:,1])
Same result. How do I get this to work?
Here is reason called alignment, it means pandas try match index of state_codes['State'].index with new index of state_codes['Code'] and because different get missing values in output, for prevent it is necessary convert Series to numpy array:
states = pd.Series(state_codes['State'].to_numpy(), index=state_codes['Code'])
Or you can use DataFrame.set_index:
states = state_codes.set_index('Code')['State']

Finding median of entire pandas Data frame

I'm trying to find the median flow of the entire dataframe. The first part of this is to select only certain items in the dataframe.
There were two problems with this, it included parts of the data frame that aren't in 'states'. Also, the median was not a single value, it was based on row. How would I get the overall median of all the data in the dataframe?
Two options:
1) A pandas option:
df.stack().median()
2) A numpy option:
np.median(df.values)
The DataFrame you pasted is slightly messy due to some spaces. But you're going to want to melt the Dataframe and then use median() on the new melted Dataframe:
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2['value'].median())
Your Dataframe may be slightly different, but the concept is the same. Check the comment that I left about to understand pd.melt(), especially the value_vars and id_vars arguments.
Here is a very detailed way of how I went about cleaning and getting the correct answer:
# reading in on clipboard
df = pd.read_clipboard()
# printing it out to see and also the column names
print(df)
print(df.columns)
# melting the DF and then printing the result
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2)
# Creating a new DF so that no nulls are in there for ease of code readability
# using .copy() to avoid the Pandas warning about working on top of a copy
df3 = df2.dropna().copy()
# there were some funky values in the 'value' column. Just getting rid of those
df3.loc[df3.value.isin(['Columbia', 'of']), 'value'] = 99
# printing out the cleaned version and getting the median
print(df3)
print(df3['value'].median())

Categories