Pivot a dataframe by splitting a string & format specific columns - python

I am faced with a problem that is above my level of pandas - but might well be simple once I know the steps.
I have a dataframe with column names as below and I want to extract the period from the string of each column and pivot the period to the row as in the second example below.
I also want to format each column differently - currently it is just a number, but some should be % and some numbers and with certain amount of decimals. What I have now and what I want is outlined below.
I have tried a few things - creating a multi index with a string splitting method and then pivoting the multi index. I feel I am on the right track but just cannot make it work at present. Any help appreciated.
what I have now in a dataframe
client_return_12m,client_return_36m,client_return_60m,client_sharpe_12m,client_sharpe_36m,client_sharpe_60m
0.34116,0.56439,0.701156,0.74320,0.82349,0.76889
after
period,client_return,client_sharpe
12m,34.1%,0.74
36m,56.4%,0.82
60m,70.1%,0.77

Use str.rsplit by last _ and then reshape by DataFrame.stack:
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = df.stack().reset_index(level=0, drop=True).rename_axis('period').reset_index()
print (df)
period client_return client_sharpe
0 12m 0.341160 0.74320
1 36m 0.564390 0.82349
2 60m 0.701156 0.76889

Related

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

split strings from a column in separate columns

I am trying to split string values from a column, in as many columns as strings are in each row.
I am creating a new dataframe with three columns and I have the string values in the third column, I want to split in new columns (which already have headers) but the numbers of strings which are separated by semicolon, is different in each row
If I use this code:
df['string']= df['string'].str.split(';', expand=True)
then I will have left only one value in the column while the rest of the string values will not be split but eliminated.
Cal u advice on how this line of code should be modified in order to have the right output?
many thanks in advance
Instead of overwriting the original column, you can take the result of split and join with original DataFrame
df = pd.DataFrame({'my_string':['car;war;bus','school;college']})
df = df.join(df['my_string'].str.split(';',expand=True))
print(df)
my_string 0 1 2
0 car;war;bus car war bus
1 school;college school college None
Then we do
df['string']= df['string'].str.split(';', expand=True).str[0]

Aggregate Python DF based on column

I have a big dataframe (approximately 35 columns), where 1 column - concat_strs is a concatenation of 8 columns in the dataframe. This is used to detect duplicates. What I want to do is to aggregate rows, where concat_strs has the same value, on columns val, abs_val, price, abs_price (using sum).
I have done the following:
agg_attributes = {'val': 'sum', 'abs_val': 'sum', 'price': 'sum', 'abs_price': 'sum'}
final_df= df.groupby('concat_strs', as_index=False).aggregate(agg_attributes)
But, when I look at final_df, I notice 2 issues:
Other columns are removed, so I have only 5 columns. I have tried to do final_df.reindex(columns=df.columns), but then all of the other columns are NaN
The number of rows in the final_df remains the same as in the df (ca. 300k rows). However, it should be reduced (checked manually)
The question is - what is done wrong and is there any improvement suggestion?
You groupby concat_strs, so only concat_strs and the columns in agg_attributes is kept, because groupby operation, pandas does not know what to do with other columns.
You can include all columns with first agg to keep the first value of that column (if duplicated), or last etc.. depends on what you need.
Also this way to dedup I bet it a good operation, can you simply drop all the duplicates?
You dont need to concat_strs too, as groupby support input in a list of column to group on
Not sure if I understood the question correctly. but you can try this?
final_df = df.groupby(['concat_strs']).sum()

How can I subtract only the values in a dataframe in pandas?

My pandas dataframe consists of a Column "timeStamp". I'm trying to obtain the difference between two values of two set of data frames. I use the following piece of code for it (see code). My question: How can I keep the date the same and only subtract the values?
merge is a nice approach as SwaggaTing suggested. Alternatively, you can set your date as the index:
a.set_index('date')['values_TProducing'] - b.set_index('date')['values_AProducing']
Assuming the dates are unique, you can join the dataframes on the date column and then subtract:
merged = a.merge(b, on='date')
merged['diff'] = merged['values_AProducing'] - merged['values_TProducing']
This assumes that the dates line up as they do in your example:
x = a.copy().drop('values_TProducing', 1)
x['values'] = a['values_TProducing'] - b['values_AProducing']

What's the most efficient way to drop columns (from beginning and end) in pandas from a large dataframe?

I am trying to drop a number of columns from the beginning and end of the pandas dataframe.
My dataframe has 397 rows and 291 columns. I currently have this solution to remove the first 8 columns, but I also want to remove some at the end:
SMPS_Data = SMPS_Data.drop(SMPS_Data.columns[0:8], axis=1)
I know I could just repeat this step and remove the last few columns, but I was hoping there is a more direct way to approach this problem.
I tried using
SMPS_Data = SMPS_Data.drop(SMPS_Data.columns[0:8,278:291], axis=1)
but it doesn't work.
Also, it seems that the .drop method somehow slows down the console responsiveness, so maybe there's a cleaner way to do it?
You could use .drop(), if you want to remove your columns by their column names
drop_these = ['column_name1', 'column_name2', 'last_columns']
df = df.drop(columns=drop_these)
If you know you want to remove them by their location, you could use .iloc():
df.iloc[:, 8:15] # For columns 8-15
df.iloc[:, :-5] # For all columns, except the last five
df.iloc[:. 2:-5] # For all columns, except the first column, and the last five
See this documentation on indexing and slicing data with pandas, for more information.

Categories