I have a dataframe
x = pd.DataFrame(index = ['wkdy','hr'],columns=['c1','c2','c3'])
This leads to 168 rows of data in the dataframe. 7 weekdays and 24 hours in each day.
I have another dataframe
dates = pd.date_range('20090101',periods = 10000, freq = 'H')
y = DataFrame(np.random.randn(10000, 3), index = dates, columns = ['c1','c2','c3'])
y['hr'] = y.index.hour
y['wkdy'] = y.index.weekday
I want to fill 'y' with data from 'x', so that all each weekday and hour has same data but has a datestamp attached to it..
The only way i know is to loop through the dates and fill values. Is there a faster, more efficient way to do this?
My Solution (rather crude to say the least) iterates over the entire dataframe y row by row and tries to fill from dataframe x through a lookup.
for r in range(0,len(y)):
h = int(y.iloc[r]['hr'])
w = int(y.iloc[r]['wkdy'])
y.iloc[r] = x.loc[(w,h)]
Your dataframe x doesn't have 168 rows but looks like
c1 c2 c3
wkdy NaN NaN NaN
hr NaN NaN NaN
and you can't index it using a tuple like in x.loc[(w,h)]. What you probably had in mind was something like
x = pd.DataFrame(
index=pd.MultiIndex.from_product(
[range(7), range(24)], names=['wkdy','hr']),
columns=['c1','c2','c3'],
data=np.arange(3 * 168).reshape(3, 168).T)
x
c1 c2 c3
wkdy hr
0 0 0 168 336
1 1 169 337
... ... ... ... ...
6 22 166 334 502
23 167 335 503
168 rows × 3 columns
Now your loop will work, although a pythonic representation would look like this:
for idx, row in y.iterrows():
y.loc[idx, :3] = x.loc[(row.wkdy, row.hr)]
However, iterating through dataframes is very expensive and you should look for a vectorized solution by simply merging the 2 frames and removing the unwanted columns:
y = (x.merge(y.reset_index(), on=['wkdy', 'hr'])
.set_index('index')
.sort_index()
.iloc[:,:-3])
y
wkdy hr c1_x c2_x c3_x
index
2009-01-01 00:00:00 3 0 72 240 408
2009-01-01 01:00:00 3 1 73 241 409
... ... ... ... ... ...
2010-02-21 14:00:00 6 14 158 326 494
2010-02-21 15:00:00 6 15 159 327 495
10000 rows × 5 columns
Now y is a dataframe with columns c1_x, c2_x, c3_x having data from dataframe x where y.wkdy==x.wkdy and y.hr==x.hr. Merging here is 1000 times faster than looping.
Related
I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.
Sorry being naive. I have the following data and I want to feature engineered some columns. But I don't have how I can do multiple operations on the same data frame. One thing to mention I have multiple entries for each customer. So, in the end, I want aggregated values (i.e. 1 entry for each customer)
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 06-11-2009 2395
1 860 50.0 09-28-2012 1190
2 1200 100.0 10-25-2005 3720
3 1420 50.0 09-07-2009 2307
4 1940 70.0 01-25-2013 1071
new column based on min, count and mean
customer_purchases['amount'] = customer_purchases.groupby(['customer_id'])['purchase_amount'].agg('min')
customer_purchases['frequency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('count')
customer_purchases['recency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('mean')
nexpected outcome
customer_id purchase_amount date_of_purchase days_since recency frequency amount first_purchase
0 760 25.0 06-11-2009 2395 1273 5 38.000000 3293
1 860 50.0 09-28-2012 1190 118 10 54.000000 3744
2 1200 100.0 10-25-2005 3720 1192 9 102.777778 3907
3 1420 50.0 09-07-2009 2307 142 34 51.029412 3825
4 1940 70.0 01-25-2013 1071 686 10 47.500000 3984
One solution :
I can think of 3 separate operations for each needed column and then join all those to get a new data frame. I know it's not efficient for just sake what I need
df_1 = customer_purchases.groupby('customer_id', sort = False)["purchase_amount"].min().reset_index(name ='amount')
df_2 = customer_purchases.groupby('customer_id', sort = False)["days_since"].count().reset_index(name ='frequency')
df_3 = customer_purchases.groupby('customer_id', sort = False)["days_since"].mean().reset_index(name ='recency')
However, either I get an error or not data frame with correct data.
Your help and patience will be appreciated.
SOLUTION
finally i found the solution
def f(x):
recency = x['days_since'].min()
frequency = x['days_since'].count()
monetary_value = x['purchase_amount'].mean()
c = ['recency','frequency, monetary_value']
return pd.Series([recency, frequency, monetary_value], index =c )
df1 = customer_purchases.groupby('customer_id').apply(f)
print (df1)
Use instead
customer_purchases.groupby('customer_id')['purchase_amount'].transform(lambda x : x.min())
Transform will give output for each row of original dataframe instead of grouped row as in case of using agg
Say, I have a data frame of dimension (74, 3234), 74 rows, and 3234 columns. I have a function to run a correlation analysis. However, when I give this data frame as it is, it is taking forever to print the results. Now I would like to split the data frame into multiple chunks. And use the chucks in the function.
The data frame has 20,000 columns with the column names containing string _PC and 15000 columns with string _lncRNAs.
The condition which needs to follow is,
I what I need to split the data frame into multiple smaller dataframe, which contain both columns with _PC and _lncRNAs column names. For example df1 must contain 500 columns with _PC and 500 columns with _lncRNAs strings.
I envision having multiple data frames. For example always 74 rows, but using consecutive column . for instance, 1-500, 501-1000, 10001 -1500, 1501-2000, so on until the last column
`df1.shape`
(74, 500)
df2.shape
(74, 500)
...
so on
one example
df1.head()
sam END_PC END2_PC END3_lncRNAs END4_lncRNAs
SAP1 50.9 30.4 49.0 50
SAP2 6 8.9 12.4 39.8 345.9888
Then,
I need to use each split data frame on the following function.
def correlation_analysis(lncRNA_PC_T):
"""
Function for correlation analysis
"""
correlations = pd.DataFrame()
for PC in [column for column in lncRNA_PC_T.columns if '_PC' in column]:
for lncRNA in [column for column in lncRNA_PC_T.columns if '_lncRNAs' in column]:
correlations = correlations.append(pd.Series(pearsonr(lncRNA_PC_T[PC],lncRNA_PC_T[lncRNA]),index=['PCC', 'p-value'],name=PC + '_' +lncRNA))
correlations.reset_index(inplace=True)
correlations.rename(columns={0:'name'},inplace=True)
correlations['PC'] = correlations['index'].apply(lambda x:x.split('PC')[0])
correlations['lncRNAs'] = correlations['index'].apply(lambda x:x.split('PC')[1])
correlations['lncRNAs'] = correlations['lncRNAs'].apply(lambda x:x.split('_')[1])
correlations['PC'] = correlations.PC.str.strip('_')
correlations.drop('index',axis=1,inplace=True)
correlations = correlations.reindex(columns=['PC','lncRNAs','PCC','p-value'])
return(correlations)
For each, data frame output should look like this,
gene PCC p-value
END_PC_END3_lncRNAs -0.042027 0.722192
END2_PC_END3_lncRNAs -0.017090 0.885088
END_PC_END4_lncRNAs 0.001417 0.990441
END2_PC_END3_lncRNAs -0.041592 0.724954
I know one can split based on rows like this,
n = 200000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
I want something like this based on columns. Any suggestions or help is much appreciated.
Thanks
If you want the correlation between each column with _PC versus the columns with _lncRNAs string, you could try something like this:
df_pc=df.filter(like='_PC')
df_lncRNAs=df.filter(like='_lncRNAs')
pd.concat([df_pc, df_lncRNAs], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']
Example:
import pandas as pd
df = pd.DataFrame({"a_pc":[1,2,3,4,5,6],
"b_pc":[3,210,12,412,512,61]
,"c_pc": [1,2,3,4,5,6]
,"d_lncRNAs": [3,210,12,412,512,61]
,"d1_lncRNAs": [3,210,12,412,512,61]})
df_pc=df.filter(like='_pc')
df_lncRNAs=df.filter(like='_lncRNAs')
correlation=pd.concat([df_pc, df_lncRNAs], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']
correlation
Output:
df
a_pc b_pc c_pc d_lncRNAs d1_lncRNAs
0 1 3 1 3 3
1 2 210 2 210 210
2 3 12 3 12 12
3 4 412 4 412 412
4 5 512 5 512 512
5 6 61 6 61 61
df_pc
a_pc b_pc c_pc
0 1 3 1
1 2 210 2
2 3 12 3
3 4 412 4
4 5 512 5
5 6 61 6
df_lncRNAs
d_lncRNAs d1_lncRNAs
0 3 3
1 210 210
2 12 12
3 412 412
4 512 512
5 61 61
correlation
a_pc b_pc c_pc
d_lncRNAs 0.392799 1.0 0.392799
d1_lncRNAs 0.392799 1.0 0.392799
How about df.iloc?
And use df.shape[1] for the number of columns:
list_df = [df.iloc[:, i:i+n] for i in range(0, df.shape[1], n)]
Ref: How to take column-slices of dataframe in pandas
It's just like wrote Basil but using pandas.DataFrame.iloc
I do not know what are the columns labels. So in order to make this independent of the index or column labels, is better to use:
list_df = [df.iloc[:,i:i+n] for i in range(0, df.shape[1], n)]
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
This is what I tried to see how long it takes to evaluate the correlation between the rows and the columns of the dataframe (df). It took under 50 milliseconds for rows-correlation and under 2 seconds for columns-correlation.
Rows-correlation output-shape: (74x74)
Columns-correlation output-shape: (3000x3000)
Correlation b/w some target-columns and all-columns
# Define target columns
target_cols = ['C0', 'C12', 'C100']
# Extract correlation-result for target-columns
corr_result = df_corr[target_cols]
print(corr_result)
Output:
C0 C12 C100
C0 1.000000 -0.031120 -0.221829
C1 0.064772 -0.130507 -0.086164
C2 0.077853 -0.116949 0.003468
C3 0.070557 -0.013551 0.007093
C4 0.165782 -0.058755 -0.175888
... ... ... ...
C2995 -0.097033 -0.014391 0.018961
C2996 0.099591 0.017187 -0.016138
C2997 -0.126288 0.145150 -0.089306
C2998 0.033484 0.054106 -0.006594
C2999 -0.154657 0.020002 -0.104889
Dummy Data
import numpy as np
import pandas as pd
## Create Dummy Data
a = np.random.rand(74, 3000)
print(f'a.shape: {a.shape}')
## Create Dataframe
index = [f'R{i}' for i in range(a.shape[0])]
columns = [f'C{i}' for i in range(a.shape[1])]
df = pd.DataFrame(a, columns=columns, index=index)
df.shape # (74, 3000)
Evaluate Correlation
I did the following in a jupyter notebook
## Correlation between Rows of dfp
%%time
df.T.corr()
#CPU times: user 39.5 ms, sys: 1.09 ms, total: 40.6 ms
#Wall time: 41.3 ms
## Correlation between Columns of dfp
%%time
df.corr()
# CPU times: user 1.64 s, sys: 34.6 ms, total: 1.67 s
# Wall time: 1.67 s
Output: df.corr()
Since, the shape of the dataframe was (74, 3000), df.corr() yields a dataframe of shape (3000, 3000).
C0 C1 C2 ... C2997 C2998 C2999
C0 1.000000 0.064772 0.077853 ... -0.126288 0.033484 -0.154657
C1 0.064772 1.000000 0.031059 ... 0.064317 0.095075 -0.100423
C2 0.077853 0.031059 1.000000 ... -0.123791 -0.034085 0.052334
C3 0.070557 0.229482 0.047476 ... 0.043630 -0.055772 0.037123
C4 0.165782 0.189635 -0.009193 ... -0.123917 0.097660 0.074777
... ... ... ... ... ... ... ...
C2995 -0.097033 -0.126214 0.051592 ... 0.008921 -0.004141 0.221091
C2996 0.099591 0.030975 -0.081584 ... 0.186931 0.084529 0.063596
C2997 -0.126288 0.064317 -0.123791 ... 1.000000 0.061555 0.024695
C2998 0.033484 0.095075 -0.034085 ... 0.061555 1.000000 0.195013
C2999 -0.154657 -0.100423 0.052334 ... 0.024695 0.195013 1.000000
This is my dataframe
Order Time Profit
0 1 106 NaN
1 1 111 -296.0
2 2 14 NaN
3 2 16 -296.0
4 3 62 NaN
.. ... ... ...
335 106 32 -297.6
336 107 44 NaN
337 107 44 138.0
338 108 58 NaN
339 108 63 -303.4
So the way I want it to work is plot a chart where X is the time, Y is the absolute price(positive or negative) so we need to have 2 bars. Now, the time should not be from the same row, but from the first row with the same order number.
For ex. The -296.0 would be under time 106, not 111 because 106 was the first under Order nr.1. How would we do something like that?
This is my code so far:
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Order','Time','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
df['Profit'] = df['Profit'].astype(float)
Assuming the structure we see in the sample of your data holds over the entire data set, i.e. there is only one Profit value per Order, you can do it like this: Group the DataFrame by Order, and aggregate by taking the minimum:
df_grouped = df.groupby(by='Order').min()
resulting in this DataFrame:
Time Profit
Order
1 106 -296.0
2 14 -296.0
3 62 NaN
...
106 32 -297.6
107 44 138.0
108 58 -303.4
Then you can sort by Time and do the plot:
import matplotlib.pyplot as plt
df_grouped.sort_values(by='Time', inplace=True)
plt.plot(df_grouped['Time'], df_grouped['Profit'])
If you rather want to rely on position in the data table you can also do this:
plot_df = pd.DataFrame()
plot_df["Order"] = df.Order.unique()
plot_df["Profit"] = list(df.groupby("Order").nth(-1)["Profit"])
plot_df["Time"] = list(df.groupby("Order").nth(0)["Time"])
However, if you want min value for time you'd better use solution provided by Arne since it would be more safe and correct (provided that you only have one profit value for each order number).
I have a pandas dataframe where I'm trying to append two column values if the value of the second column is not NaN. Importantly, after appending the two values I need the value from the second column set to NaN. I have managed to concatenate the values but cannot update the second column to NaN.
This is what I start with for ldc_df[['ad_StreetNo', 'ad_StreetNo2']].head(5):
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196 198
4 227 NaN
This is what I currently have after appending:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 198
4 227 NaN
But here is what I am trying to obtain:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN
Where the value for ldc_df['ad_StreetNo2'].loc[3] should be changed to NaN.
This is the code I am using currently:
def street_check(street_number_one, street_number_two):
if pd.notnull(street_number_one) and pd.notnull(street_number_two):
return str(street_number_one) + '-' + str(street_number_two)
else:
return street_number_one
ldc_df['ad_StreetNo'] = ldc_df[['ad_StreetNo', 'ad_StreetNo2']].apply(lambda x: street_check(*x),axis=1)
Does anyone have any advice as to how I can obtain my expected output?
Sam
# Convert the Street numbers to a string so that you can append the '-' character.
ldc_df['ad_StreetNo'] = ldc_df['ad_StreetNo'].astype(str)
# Create a mask of those addresses having an additional street number.
mask = ldc_df.loc[ldc_df['ad_StreetNo2'].notnull()
# Use the mask to append the additional street number.
ldc_df.loc[mask, 'ad_StreetNo'] += '-' + ldc_df.loc[mask, 'ad_StreetNo2'].astype(str)
# Set the additional street number to NaN.
ldc_df.loc[mask, 'ad_StreetNo2'] = np.nan
Alternative Solution
ldc_df['ad_StreetNo'] = (
ldc_df['ad_StreetNo'].astype(str)
+ ['' if np.isnan(n) else '-{}'.format(str(int(n)))
for n in ldc_df['ad_StreetNo2']]
)
ldc_df['ad_StreetNo2'] = np.nan
pd.DataFrame.stack folds a dataframe with a single level column index into a series object. Along the way, it drops any null values by default. We can then group by the previous index levels and join with '-'.
df.stack().astype(str).groupby(level=0).apply('-'.join)
0 284
1 51
2 136
3 196-198
4 227
dtype: object
I then use assign to create a copy of df while overwriting the two columns.
df.assign(
ad_StreetNo=df.stack().astype(str).groupby(level=0).apply('-'.join),
ad_StreetNo2=np.NaN
)
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN