Pandas, create MultIndex column from a column in data frame - python

Suppose I have a data frame like this. The index of this dataframe is a MultiIndex alredy, date/id.
Column N tells the price information is N periods before. How could I turn column['N'] into a MultiIndex?
In this example, suppose columns N has two unique value [0, 1], the final result would have 6 columns and it should look like [0/priceClose] [0/priceLocal] [0/priceUSD] [1/priceClose] [1/priceLocal] [1/priceUSD]

I finally found the following method works:
step 1: melt
step 2: pivot
df = pd.melt(df, id_vars=['date', 'id', 'N'],
value_vars=[p for p in df if p.startswith('price')],
value_name='price')
df = pd.pivot_table(df, values='price', index=['date', 'id'],
columns=['variable', 'N'], aggfunc='max')

Related

Sum with rows from two dataframes

I have two dataframes. One has months 1-5 and a value for each month, which are the same for ever ID, the other has an ID and a unique multiplier e.g.:
data = [['m', 10], ['a', 15], ['c', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Unique'])
data2=[[1,0.2],[2,0.3],[3,0.01],[4,0.5],[5,0.04]]
df2 = pd.DataFrame(data2, columns=['Month', 'Value'])
I want to do sum ( value / (1+unique)^(Month/12) ). E.g. for ID m, I want to do (value/(1+10)^(Month/12)), for every row in df2, and sum them. I wrote a for-loop to do this but since my real table has 277,000 entries this takes too long!
df['baseTotal']=0
for i in df.index.unique():
for i in df2.Month.unique():
df['base']= df2['Value']/pow(1+df.loc[i,'Unique'],df2['Month']/12.0)
df['baseTotal']=df['baseTotal']+df['base']
Is there a more efficient way to do this?
df['Unique'].apply(lambda x: (df2['Value']/((1+x) ** (df2['Month']/12))).sum())
0 0.609983
1 0.563753
2 0.571392
Name: Unique, dtype: float64

How to merge two dataframe with some row values equal?

I have two dataframes which I want to merge into one. The first one has as its columns the ID, while the second has the same values but in the column named id_number. I tried the below code, but in the end the final_df has both ID and the id_number columns and their values. How can I keep only one column for the ids after merging?
final_df = df.merge(
df2,
left_on='ID',
right_on='id_number',
how='inner')
Also, let's say the following dataframe format for df column A:
A
0
1
2
The same column A in the second dataframe has some empty fields, like this:
A
-
1
2
After merge, how can the final dataframe compound the two dataframes so that A won't have empty values?
try selecting required columns after merge
final_df = df.merge(
df2,
left_on='ID',
right_on='id_number',
how='inner')[['ID', 'col1', 'col2']]
or drop the column after merge
final_df = df.merge(
df2,
left_on='ID',
right_on='id_number',
how='inner').drop(['id_number'], axis=1)
The solution you're looking for:
df.combine_first(df2.rename(columns={'id_number': 'ID'}))
A full working example:
import pandas as pd
dfa = pd.DataFrame({'ID': [1, 2, 3], 'other': ['A', 'B', 'C']})
dfb = pd.DataFrame({'id_number': [None, 2, 3], 'other_2': ['A2', 'B2', 'C2']})
dfa.combine_first(dfb.rename(columns={'id_number': 'ID'}))
Rename 'on-the-fly' id_number column of df2 to ID
final_df = df.merge(
df2.rename(columns={'id_number': 'ID'}),
on='ID',
how='inner')

pandas - merge df with multiindex on column with other df on index (without multiindex)

My Google-fu didn't bring me the answer so I am posting my question here.
Let's say I have two data frames df1 and df2 and I want to merge them. df1 has a multi-index on columns and df2 consists of one multi-index column with an index. The index of df2 has a name that coincides with a name (at level 1) of one column in df1. How to merge the frames using one column in df1 and the index of df2? Simple example would go like this:
import pandas as pd
df1 = pd.DataFrame({('A', 'Col_1'): [1, 2, 3],
('A', 'Col_2'): ['A', 'B', 'C'],
('B', 'Col_1'): [1, 2, 3],
('B', 'Col_2'): ['A', 'B', 'C']})
df2 = pd.DataFrame({('C', 'Col_1'): ['X', 'Y', 'Z']},
index=pd.Index(['A', 'B', 'C'], name='Col_2'))
My aim is to merge df1 on column ('B', 'Col_2') with df2 on index, preserving all the columns in df1. How to do that?
As per understanding of the question you want df1 and df2 to be joined based on Col_2. Here is how you can do it. If some how i missed the part, please add in the comment.
#dropping the group header of columns from df1
df1.columns = df1.columns.droplevel(0)
#Removing the duplicated columns in df1
df1 = df1.loc[:,~df1.columns.duplicated()]
#dropping the group header of columns from df2
df2.columns = df2.columns.droplevel(0)
#Reset the index of df2 as first column
df2.reset_index(level=0, inplace=True)
#Concatinating 2 dataframes
new_df = pd.concat([df1.set_index('Col_2'),df2.set_index('Col_2')], axis=1,
join='inner').reset_index()
The final output will look like this
Col_2 Col_1 Col_1
0 A 1 X
1 B 2 Y
2 C 3 Z

How to join two data frames, one with a date time index and the other with normal indexing

I have a data frame with a date index in the form of YYYY-MM-DD and another data frame with normal indexing, they both have the same number of rows and i want to join the two data frames. Join and merge functions don't work, concat function changes the date format to a date time format by adding hours-mins-sec and there are many null values in the table. So how can i join the two data frames?
this is the code i used:
pd.concat([HK4, adjusted_data], axis=1, join='outer', ignore_index=False)
1"dataset with date time index"
2"dataset with normal indexing"
3"concatenated dataset"
Try reset index in DataFrame with index YYYY-MM-DD.
import pandas as pd
# create first dataframe
d1 = {'dt': ['2020-01-02', '2020-05-05'], 'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data=d1).set_index('dt')
# create second dataframe
d2 = {'col3': ['hello', 'world'], 'col4': ['how', 'to']}
df2 = pd.DataFrame(data=d2)
# concatenate dataframes
df3 = pd.concat([df1.reset_index(), df2], axis=1)

Python: Divide each row of a DataFrame by another DataFrame vector

I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)

Categories