convert string to float and align columns in dataframe - python

I have a Dataframe df where a Date column belongs to each Value column. The contents are (still) formatted as strings:
Date Value1 Date Value2
Index
0 30.01.2001 20,32 30.05.2005 50,55
1 30.02.2001 19,5 30.06.2005 49,21
2 30.03.2001 21,45 30.07.2005 48,1
my issues (in order of priority):
I do not manage to convert the Value columns to float even after I successfully converted the ',' to '.' with
df.replace(to_replace=",", value='.', inplace=True, regex=True)
so what can you suggest how I can convert to float? I suspect the reason for not working is that there are sometimes only one decimal after the comma. How can I solve this?
How can I align the dates so that Value2's dates match those of Value1 (so it needs to be moved downwards until it matches, provided that the rows continue till the present day)?
what is the most efficient way to iterate through the columns in order to do the formatting of the columns?
EDIT:
Based on the answers so far...how can I iterate through the larger dataframe and split it into single ones/series as suggested? (I have issues adding counter integers to the df's , i.e. df1, df2, df3...)

A number of steps:
# I wanted to split into two dataframes
df1 = df.iloc[:, :2].copy()
# rename column to Date because my `read_csv` parsing
df2 = df.iloc[:, 2:].copy().rename(columns={'Date.1': 'Date'})
# As EdChum suggested.
df1.Value1 = df1.Value1.str.replace(',', '.').astype(float)
df2.Value2 = df2.Value2.str.replace(',', '.').astype(float)
# Convert to dates
df1.Date = pd.to_datetime(df1.Date.str.replace('.', '/'))
df2.Date = pd.to_datetime(df2.Date.str.replace('.', '/'))
# set dates as index in anticipation for `pd.concat`
df1 = df1.set_index('Date')
df2 = df2.set_index('Date')
# line up dates.
pd.concat([df1, df2], axis=1)

Related

How do I manipulate the columns in my financial data

I've a dataframe in csv for different stock options and their close/high/low/open etc. But the format of the data is a bit difficult to work with. When calculating the returns using the adjusted close value, I've to create a new df each time to drop the null values.
Original Data
How do I convert it to the following format instead?
Converted data
The best way I could think of is to pivot the data:
# Drop date column (as it is already in the index), and pivot on Feature
df2 = df.drop(columns="Date").pivot(columns="Feature")
# Swap the column levels, so that Feature is first, then Ticker
df2 = df2.swaplevel(0, 1, 1)
# Stack the columns, so Ticker is one column, increasing the number of rows
df2 = df2.stack()
# Reset in the index, but keep it (the Date column)
df2.reset_index(inplace=True)
# Sort the rows on the Ticker and Date
df2.sort_values(["level_1", "Date"], inplace=True)
# Rename the Ticker column
df2.rename(columns={"level_1": "Ticker"}, inplace=True)
# Reset the index
df2.reset_index(drop=True, inplace=True)
This could all be run once, rather than defining df2 each time:
df2 = df.drop(columns="Date").pivot(columns="Feature") \
.swaplevel(0, 1, 1).stack().reset_index() \
.sort_values(["level_1", "Date"]) \
.rename(columns={"level_1": "Ticker"}).reset_index(drop=True)
Hopefully this works for you!

Adding correction column to dataframe

I have a pandas dataframe I read from a csv file with df = pd.read_csv("data.csv"):
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,5,8
2020-01-03,place2,2,9
I also have a dataframe with corrections df_corr = pd.read_csv("corrections .csv")
date,location,value
2020-01-02,place2,-1
2020-01-03,place2,2
How do I apply these corrections where date and location match to get the following?
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,4,8
2020-01-03,place2,4,9
EDIT:
I got two good answers and decided to go with set_index(). Here is how I did it 'non-destructively'.
df = pd.read_csv("data.csv")
df_corr = pd.read_csv("corr.csv")
idx = ['date', 'location']
df_corrected = df.set_index(idx).add(
df_corr.set_index(idx).rename(
columns={"value": "value1"}), fill_value=0
).astype(int).reset_index()
It looks like you want to join the two DataFrames on the date and location columns. After that its a simple matter of applying the correction by adding the value1 and value columns (and finally dropping the column containing the corrections).
# Join on the date and location columns.
df_corrected = pd.merge(df, df_corr, on=['date', 'location'], how='left')
# Apply the correction by adding the columns.
df_corrected.value1 = df_corrected.value1 + df_corrected.value
# Drop the correction column.
df_corrected.drop(columns='value', inplace=True)
Set date and location as index in both dataframes, add the two and fillna
df.set_index(['date','location'], inplace=True)
df1.set_index(['date','location'], inplace=True)
df['value1']=(df['value1']+df1['value']).fillna(df['value1'])

How to add rows with missing dates in correct order of a dataframe?

I have a DataFrame with column 'Date' (format e.g. 2020-06-26). Type of this column is str. This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...The other column 'Reviews' is made of text. There are duplicate dates, so the dateframe can have multiple reviews on a given date or no reviews on another date. I wrote a code to find which dates are missing. I have a list (insert_dates) with 3 missing dates in format %Y-%m-%d.
When I try to append these 3 dates to my dataframe df nothing changes, Len(df) remains the same. Here's what I simply did:
row = pd.Series([insert_dates[0],None], index=['Date', 'Review'])
row1 = pd.Series([insert_dates[1],None], index=['Date', 'Review'])
row2 = pd.Series([insert_dates[2],None], index=['Date', 'Review'])
df.append(row, ignore_index=True)
df.append(row1, ignore_index=True)
df.append(row2, ignore_index=True)
df.head()
What should I do?
append is not an in-place operation.
You can sort the dates directly with sort_values if your date format is YYYY-MM-DD. For cases such as day-first dates, you should use pd.to_datetime before sorting.
df = df.append([row, row1, row2], ignore_index=True)
df = df.sort_values(by='Date', ascending=False)

Divide two dataframes with multiple columns (column specific)

I have two identical sized dataframes (df1 & df2). I would like to create a new dataframe with values that are df1 column1 / df2 column1.
So essentially df3 = df1(c1)/df2(c1), df1(c2)/df2(c2), df1(c3)/df2(c3)...
I've tried the below code, however both give a dataframe filled with NaN
#attempt 1
df3 = df2.divide(df1, axis='columns')
#attempt 2
df3= df2/df1
You can try the following code:
df3 = df2.div(df1.iloc[0], axis='columns')
To use the divide function, the indexes of the dataframes need to match. In this situation, df1 was beginning of month values, df2 was end of month. The questions can be solved by:
df3 = df2.reset_index(drop=True)/df1.reset_index(drop=True)
df3.set_index(df2.index,inplace=True) ##set the index back to the original (ie end of month)

Match 2 data frames by date and column name to get values

I have two data frames (they are already in a data frame format but for illustration, I created them as a dictionary first):
first = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul']}
df1 = pd.DataFrame(first)
And
second = {
'Date':['2013-02-28','2013-03-31','2013-05-30','2014-10-31'],
'Felix':['Value1_x','Value2_x','Value3_x','Value4_x'],
'Peter':['Value1_y','Value2_y','Value3_y','Value4_y']}
df2 = pd.DataFrame(second)
Now, I'd like to add an additional column to df1 containing the values of df2 if the df1.Date matches the df2.Date by year and month (the day does not usually match since df1 contains end of month dates) AND if the column name of df2 matches the according df1.Name values.
So the result should look like this:
df_new = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul'],
'Values':['Value1_x','Value2_x','Value3_y','NaN']}
df_new = pd.DataFrame(df_new)
Do you have any suggestions how to solve this problem?
I considered creating additional columns for year and month (df1['year']= df1['Date'].dt.year) and then matching df1[(df1['year'] == df2['year']) & (df1['month'] == df2['month'])] and calling the df2.column but I cant figure out how to put everything together
In general, try not to post your data sets as images, b/c it's hard to help you out then.
I think the easiest thing to do would be to create a column in each data frame where the Date is rounded to the first day of each month.
df1['Date_round'] = df1['Date'] - pd.offsets.MonthBegin(1)
df2['Date_round'] = df2['Date'] - pd.offsets.MonthBegin(1)
Then reshape df2 using melt.
df2_reshaped = df2.melt(id_vars=['Date','Date_round'], var_name='Name', value_name='Values')
And then you can join the data frames on Date_round and Name using pd.merge.
df = pd.merge(df1, df2_reshaped.drop('Date', axis=1), how='left', on=['Date_round', 'Name'])

Categories