I have a dataframe like as shown below
data = {
'key':['k1','k2'],
'name_M1':['name', 'name'],'area_M1':[1,2],'length_M1':[11,21],'breadth_M1':[12,22],
'name_M2':['name', 'name'],'area_M2':[1,2],'length_M2':[11,21],'breadth_M2':[12,22],
'name_M3':['name', 'name'],'area_M3':[1,2],'length_M3':[11,21],'breadth_M3':[12,22],
'name_M4':['name', 'name'],'area_M4':[1,2],'length_M4':[11,21],'breadth_M4':[12,22],
'name_M5':['name', 'name'],'area_M5':[1,2],'length_M5':[11,21],'breadth_M5':[12,22],
'name_M6':['name', 'name'],'area_M6':[1,2],'length_M6':[11,21],'breadth_M6':[12,22],
}
df = pd.DataFrame(data)
Input data looks like below in wide format
I would like to convert it into time-based long format like below. We call it time-based because you can see that each row has 3 months data. Then the subsequent rows are pushed by 1 month
ex: sample shape of data looks like below (with only one column for each month)
k1,Area_M1,Area_M2,Area_M3,Area_M4,Area_M5,Area_M6
I would like to convert it like below (subsequent rows are shifted by one month)
k1,Area_M1,Area_M2,Area_M3
K1,Area_M2,Area_M3,Area_M4
K1,Area_M3,Area_M4,Area_M5
K1,Area_M4,Area_M5,Area_M6
But in real data, instead of one column for each month, I have multiple columns for each month. So, we need to convert/transform all those columns. So, I tried something like below but it doesn't work
pd.wide_to_long(df, stubnames=["name_1st","area_1st","length_first","breadth_first",
"name_2nd","area_2nd","length_2nd","breadth_2nd",
"name_3rd","area_3rd","length_3rd","breadth_3rd"],
i="key", j="name",
sep="_", suffix=r"(?:\d+|n)").reset_index()
But I expect my output to be like as below
updated error screenshot
Updated error screenshot
This is pretty ugly, but I'm not exactly sure of an easier way to do this. Perhaps you could melt everything and do a rolling pivot, but it's not really much different.
This approach just slices rows 0:12, 4:16, etc until the end - renaming and concatenating them all together.
import pandas as pd
import numpy as np
data = {
'key':['k1','k2'],
'name_M1':['name', 'name'],'area_M1':[1,2],'length_M1':[11,21],'breadth_M1':[12,22],
'name_M2':['name', 'name'],'area_M2':[1,2],'length_M2':[11,21],'breadth_M2':[12,22],
'name_M3':['name', 'name'],'area_M3':[1,2],'length_M3':[11,21],'breadth_M3':[12,22],
'name_M4':['name', 'name'],'area_M4':[1,2],'length_M4':[11,21],'breadth_M4':[12,22],
'name_M5':['name', 'name'],'area_M5':[1,2],'length_M5':[11,21],'breadth_M5':[12,22],
'name_M6':['name', 'name'],'area_M6':[1,2],'length_M6':[11,21],'breadth_M6':[12,22],
}
df = pd.DataFrame(data)
df = df.set_index('key')
s = 4
n = 3
cols = [
'name_1st','area_1st','length_1st','breadth_1st',
'name_2nd','area_2nd','length_2nd','breadth_2nd',
'name_3rd','area_3rd','length_3rd','breadth_3rd'
]
output = pd.concat((df.iloc[:,0+i*s:12+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,4))
I am a new coder using jupyter notebook. I have a dataframe that contains 23 columns with different amounts of values( at most 23 and at least 2) I have created a function that normalizes the contents of one column below.
def normalize(column):
y = DFref[column].values[()]
y = x.astype(int)
KGF= list()
for element in y:
element_norm = element / x.sum()
KGF.append(element_norm)
return KGF
I am now trying to create a function that loops through all columns in the Data frame. Right now if I plug in the name of one column, it works as intended. What would I need to do in order to create a function that loops through each column and normalizes the values of each column, and then adds it to a new dataframe?
It's not clear if all 23 columns are numeric, but I will assume they are. Then there are a number of ways to solve this. The method below probably isn't the best, but it might be a quick fix for you...
colnames = DFref.columns.tolist()
normalised_data = {}
for colname in colnames:
normalised_data[colname] = normalize(colname)
df2 = pd.DataFrame(normalised_data)
Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.
I have this situation:
A have a probability of 0.1348 calculated in a variable called treat_conv
Now, I am trying to create a dataframe from the original dataframe, using this probability to bring a especified column. Is that possible? I am trying to using weights but no success. Maybe am I using it wrong?
Follow my code:
weights = np.array(treat_conv) #creating a array with treat_conv
new_page_converted = df2.sample(n = treat_group.shape[0], weights=df2.converted(weights)) #creating new dataframe with the number of rows of treat_group and the column converted must have a 0.13 of chance to bring value 1
So, the code works if I use the n alone. It creates a new dataframe with the correct ammount of rows. But I cant get the correct probabiliy to bring certain ammount of value 1 in converted column.
I hope my explanation is undestandable.
Thank you!
You could do something like this
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 100, 1), columns=["SomeValue"])
selected = pd.DataFrame(data=np.random.choice(df["SomeValue"], int(len(df["SomeValue"]) * 0.13), replace=False),
columns=["SomeValue"])
selected["Trigger"] = 1
df = df.merge(selected, how="left", on="SomeValue")
df["Trigger"].fillna(0, inplace=True)
"df" is your original DataFrame. Then select random 13% of the values and add a column indicating they've been selected. Finally, merge all back to your original Dataframe.
I have a dataframe containing dates as rows and columns as $investment in each stock on a particular day ("ndate"). Also, I have a Series ("portT") containing the sum of the total investments in all stocks each date (series size: len(ndate)*1). Here is the code that calculates the weight of each stock/each date by dividing each element of each row of ndate by sum of that day:
(l,w)=port1.shape
for i in range(0,l):
port1.iloc[i]=np.divide(ndate.iloc[i],portT.iloc[i])
The code works very slowly, could you please let me know how I can modify and speed it up? I tried to do this by vectorising, but did not succeed.
as this is justa simple divison of two dataframes of the same shape (or you can formulate it as such) you can use the simple /-operator, pandas will execute it element-wise (possibly with replication if shapes don't match, so be sure about that):
import pandas as pd
df1 = pd.DataFrame([[1,2], [3,4]])
df2 = pd.DataFrame([[2,2], [3,3]])
df_new = df1 / df2
#>>> pd.DataFrame([[0.5, 1.],[1., 1.3]])
this is most likely internally doing the same operations that you have specified in your example, however, internal assignments and checks are by-passed, which should give you some speed
EDIT:
I was mistaken on the outline of your problem; maybe include a minimal self-contained code example next time. Still the /-operator also works for Dataframes and Series in combination:
import pandas as pd
df = pd.DataFrame([[1,2], [3,4]])
s = pd.Series([1,2])
new_df = df / s
#>>> pd.DataFrame([[1., 3.],[1., 2]])