How do I manipulate the columns in my financial data - python

I've a dataframe in csv for different stock options and their close/high/low/open etc. But the format of the data is a bit difficult to work with. When calculating the returns using the adjusted close value, I've to create a new df each time to drop the null values.
Original Data
How do I convert it to the following format instead?
Converted data

The best way I could think of is to pivot the data:
# Drop date column (as it is already in the index), and pivot on Feature
df2 = df.drop(columns="Date").pivot(columns="Feature")
# Swap the column levels, so that Feature is first, then Ticker
df2 = df2.swaplevel(0, 1, 1)
# Stack the columns, so Ticker is one column, increasing the number of rows
df2 = df2.stack()
# Reset in the index, but keep it (the Date column)
df2.reset_index(inplace=True)
# Sort the rows on the Ticker and Date
df2.sort_values(["level_1", "Date"], inplace=True)
# Rename the Ticker column
df2.rename(columns={"level_1": "Ticker"}, inplace=True)
# Reset the index
df2.reset_index(drop=True, inplace=True)
This could all be run once, rather than defining df2 each time:
df2 = df.drop(columns="Date").pivot(columns="Feature") \
.swaplevel(0, 1, 1).stack().reset_index() \
.sort_values(["level_1", "Date"]) \
.rename(columns={"level_1": "Ticker"}).reset_index(drop=True)
Hopefully this works for you!

Related

Adding correction column to dataframe

I have a pandas dataframe I read from a csv file with df = pd.read_csv("data.csv"):
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,5,8
2020-01-03,place2,2,9
I also have a dataframe with corrections df_corr = pd.read_csv("corrections .csv")
date,location,value
2020-01-02,place2,-1
2020-01-03,place2,2
How do I apply these corrections where date and location match to get the following?
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,4,8
2020-01-03,place2,4,9
EDIT:
I got two good answers and decided to go with set_index(). Here is how I did it 'non-destructively'.
df = pd.read_csv("data.csv")
df_corr = pd.read_csv("corr.csv")
idx = ['date', 'location']
df_corrected = df.set_index(idx).add(
df_corr.set_index(idx).rename(
columns={"value": "value1"}), fill_value=0
).astype(int).reset_index()
It looks like you want to join the two DataFrames on the date and location columns. After that its a simple matter of applying the correction by adding the value1 and value columns (and finally dropping the column containing the corrections).
# Join on the date and location columns.
df_corrected = pd.merge(df, df_corr, on=['date', 'location'], how='left')
# Apply the correction by adding the columns.
df_corrected.value1 = df_corrected.value1 + df_corrected.value
# Drop the correction column.
df_corrected.drop(columns='value', inplace=True)
Set date and location as index in both dataframes, add the two and fillna
df.set_index(['date','location'], inplace=True)
df1.set_index(['date','location'], inplace=True)
df['value1']=(df['value1']+df1['value']).fillna(df['value1'])

Pandas colnames not found after grouping and aggregating

Here is my data
threats = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv', index_col = 0)
And here is my code -
df = (threats
.query('threatened>0')
.groupby(['continent', 'threat_type'])
.agg({'threatened':'size'}))
However df.columns only Index(['threatened'], dtype='object') is the result. That is, only the threatened column is displaying not the columns I have actually grouped by i.e continent and threat_type although present in my data frame.
I would like to perform operation on the continent column of my data frame, but it is not displaying as one of the columns. For eg - continents = df.continent.unique(). This command gives me a key error of continent not found.
After groupby...pandas put the groupby columns in the index. Always reset index after doing groupby in pandas and don't do drop=True.
After your code.
df = df.reset_index()
And then you will get required columns.

Match 2 data frames by date and column name to get values

I have two data frames (they are already in a data frame format but for illustration, I created them as a dictionary first):
first = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul']}
df1 = pd.DataFrame(first)
And
second = {
'Date':['2013-02-28','2013-03-31','2013-05-30','2014-10-31'],
'Felix':['Value1_x','Value2_x','Value3_x','Value4_x'],
'Peter':['Value1_y','Value2_y','Value3_y','Value4_y']}
df2 = pd.DataFrame(second)
Now, I'd like to add an additional column to df1 containing the values of df2 if the df1.Date matches the df2.Date by year and month (the day does not usually match since df1 contains end of month dates) AND if the column name of df2 matches the according df1.Name values.
So the result should look like this:
df_new = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul'],
'Values':['Value1_x','Value2_x','Value3_y','NaN']}
df_new = pd.DataFrame(df_new)
Do you have any suggestions how to solve this problem?
I considered creating additional columns for year and month (df1['year']= df1['Date'].dt.year) and then matching df1[(df1['year'] == df2['year']) & (df1['month'] == df2['month'])] and calling the df2.column but I cant figure out how to put everything together
In general, try not to post your data sets as images, b/c it's hard to help you out then.
I think the easiest thing to do would be to create a column in each data frame where the Date is rounded to the first day of each month.
df1['Date_round'] = df1['Date'] - pd.offsets.MonthBegin(1)
df2['Date_round'] = df2['Date'] - pd.offsets.MonthBegin(1)
Then reshape df2 using melt.
df2_reshaped = df2.melt(id_vars=['Date','Date_round'], var_name='Name', value_name='Values')
And then you can join the data frames on Date_round and Name using pd.merge.
df = pd.merge(df1, df2_reshaped.drop('Date', axis=1), how='left', on=['Date_round', 'Name'])

how can I use reset_index with the multi grouped values(Hierarchical format) in Pandas Python

this is my data format, I want to reset the index and wanna make it in one table format, so I can take the count of all the id's which is 2nd row and can plot them with the histogram by date and the count,
any simple idea?
if reset_index() is not working, you can convert the table manually also.
Assume df1 is your existing data frame, we'll create df2 (new one) that you want.
df2 = pd.DataFrame()
df2['DateTime'] = df1.index.get_level_values(0).tolist()
df2['ID1'] = df1.index.get_level_values(1).tolist()
df2['ID2'] = df1['ID2'].values.tolist()
df2['Count'] = df1['Count'].values.tolist()

convert string to float and align columns in dataframe

I have a Dataframe df where a Date column belongs to each Value column. The contents are (still) formatted as strings:
Date Value1 Date Value2
Index
0 30.01.2001 20,32 30.05.2005 50,55
1 30.02.2001 19,5 30.06.2005 49,21
2 30.03.2001 21,45 30.07.2005 48,1
my issues (in order of priority):
I do not manage to convert the Value columns to float even after I successfully converted the ',' to '.' with
df.replace(to_replace=",", value='.', inplace=True, regex=True)
so what can you suggest how I can convert to float? I suspect the reason for not working is that there are sometimes only one decimal after the comma. How can I solve this?
How can I align the dates so that Value2's dates match those of Value1 (so it needs to be moved downwards until it matches, provided that the rows continue till the present day)?
what is the most efficient way to iterate through the columns in order to do the formatting of the columns?
EDIT:
Based on the answers so far...how can I iterate through the larger dataframe and split it into single ones/series as suggested? (I have issues adding counter integers to the df's , i.e. df1, df2, df3...)
A number of steps:
# I wanted to split into two dataframes
df1 = df.iloc[:, :2].copy()
# rename column to Date because my `read_csv` parsing
df2 = df.iloc[:, 2:].copy().rename(columns={'Date.1': 'Date'})
# As EdChum suggested.
df1.Value1 = df1.Value1.str.replace(',', '.').astype(float)
df2.Value2 = df2.Value2.str.replace(',', '.').astype(float)
# Convert to dates
df1.Date = pd.to_datetime(df1.Date.str.replace('.', '/'))
df2.Date = pd.to_datetime(df2.Date.str.replace('.', '/'))
# set dates as index in anticipation for `pd.concat`
df1 = df1.set_index('Date')
df2 = df2.set_index('Date')
# line up dates.
pd.concat([df1, df2], axis=1)

Categories