I currently have the following code which goes through each row of a dataframe and assigns the prior row value for a certain cell to the current row of a different cell.
Basically what im doing is finding out what 'yesterdays' value for a certain metric is compared to today. As you would expect this is quite slow (especially since I am working with dataframes that have hundreds of thousands of lines).
for index, row in symbol_df.iterrows():
if index != 0:
symbol_df.loc[index, 'yesterday_sma_20'] = symbol_df.loc[index-1]['sma_20']
symbol_df.loc[index, 'yesterday_roc_20'] = symbol_df.loc[index-1]['roc_20']
symbol_df.loc[index, 'yesterday_roc_100'] = symbol_df.loc[index-1]['roc_100']
symbol_df.loc[index, 'yesterday_atr_10'] = symbol_df.loc[index-1]['atr_10']
symbol_df.loc[index, 'yesterday_vsma_20'] = symbol_df.loc[index-1]['vsma_20']
Is there a way to turn this into a vectorized operation? Or really just any way to speed it up instead of having to iterate through each row individually?
I might be overlooking something, but I think using .shift() should do it.
import pandas as pd
df = pd.read_csv('test.csv')
print df
# Date SMA_20 ROC_20
# 0 7/22/2015 0.754889 0.807870
# 1 7/23/2015 0.376448 0.791365
# 2 7/22/2015 0.527232 0.407420
# 3 7/24/2015 0.616281 0.027188
# 4 7/22/2015 0.126556 0.274681
# 5 7/25/2015 0.570008 0.864057
# 6 7/22/2015 0.632057 0.746988
# 7 7/26/2015 0.373405 0.883944
# 8 7/22/2015 0.775591 0.453368
# 9 7/27/2015 0.678638 0.313374
df['y_SMA_20'] = df['SMA_20'].shift()
df['y_ROC_20'] = df['ROC_20'].shift()
print df
# Date SMA_20 ROC_20 y_SMA_20 y_ROC_20
# 0 7/22/2015 0.754889 0.807870 NaN NaN
# 1 7/23/2015 0.376448 0.791365 0.754889 0.807870
# 2 7/22/2015 0.527232 0.407420 0.376448 0.791365
# 3 7/24/2015 0.616281 0.027188 0.527232 0.407420
# 4 7/22/2015 0.126556 0.274681 0.616281 0.027188
# 5 7/25/2015 0.570008 0.864057 0.126556 0.274681
# 6 7/22/2015 0.632057 0.746988 0.570008 0.864057
# 7 7/26/2015 0.373405 0.883944 0.632057 0.746988
# 8 7/22/2015 0.775591 0.453368 0.373405 0.883944
# 9 7/27/2015 0.678638 0.313374 0.775591 0.453368
Related
I am getting myself very confused over a problem I am encountering with a short python script I am trying to put together. I am trying to iterate through a dataframe, appending rows to a new dataframe, until a certain value is encountered.
import pandas as pd
#this function will take a raw AGS file (saved as a CSV) and convert to a
#dataframe.
#it will take the AGS CSV and print the top 5 header lines
def AGS_raw(file_loc):
raw_df = pd.read_csv(file_loc)
#print(raw_df.head())
return raw_df
import_df = AGS_raw('test.csv')
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if "**PROJ" == True:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif "**ABBR" == True:
break
print(raw_df)
return cut_df
I don't need to get into specifics, but the values (**PROJ and **ABBR) in this data occur as single cells as the top of tables. So I want to loop row-wise through the data, appending rows until **ABBR is encountered.
When I call AGS_snip(import_df), nothing happens. Previous incarnations just spat out the whole df, and I'm just confused over the logic of the loops. Any assistance much appreciated.
EDIT: raw text of the CSV
**PROJ,
1,32
1,76
32,56
,
**ABBR,
1,32
1,76
32,56
The test CSV looks like this:
The reason that "nothing happens" is likely b/c of the conditions you're using in if and elif.
Neither "**PROJ" == True nor "**ABBR" == True will ever be True because neither "**PROJ" nor "**ABBR" are equal to True. Your code is equivalent to:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if False:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif False:
break
print(raw_df)
return cut_df
Which is the same as:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
You also always return from inside the loop and df_new_row isn't used for anything, so it's equivalent to:
def AGS_snip(raw_df):
first_row = next(raw_df.iterrows(), None)
if first_row:
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
Here's how to parse your CSV file into multiple separate dataframes based on a row condition. Each dataframe is stored in a Python dictionary, with titles as keys and dataframes as values.
import pandas as pd
df = pd.read_csv('ags.csv', header=None)
# Drop rows which consist of all NaN (Not a Number) / missing values.
# Reset index order from 0 to the end of dataframe.
df = df.dropna(axis='rows', how='all').reset_index(drop=True)
# Grab indices of rows beginning with "**", and append an "end" index.
idx = df.index[df[0].str.startswith('**')].append(pd.Index([len(df)]))
# Dictionary of { dataframe titles : dataframes }.
dfs = {}
for k in range(len(idx) - 1):
table_name = df.iloc[idx[k],0]
dfs[table_name] = df.iloc[idx[k]+1:idx[k+1]].reset_index(drop=True)
# Print the titles and tables.
for k,v in dfs.items():
print(k)
print(v)
# **PROJ
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# **ABBR
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# Access each dataframe by indexing the dictionary "dfs", for example:
print(dfs['**ABBR'])
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# You can rename column names with for example this code:
dfs['**PROJ'].set_axis(['data1', 'data2'], axis='columns', inplace=True)
print(dfs['**PROJ'])
# data1 data2
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
Question
From the example below, can I calculate the cum_series_c, based on cum_series_a and cum_series_b?
Example
import pandas as pd
# I don't have these two pd.Series (a and b) in my pocket.
# In other words, I cannot make use of these two pd.Series.
series_a = pd.Series([1,1.03,1.02,0.98,0.99])
series_b = pd.Series([1,0.98,0.95,1.05,1.07])
# I am given these two cumprod series, cum_series_a and cum_series_b
# I know what these varibles look like.
cum_series_a = series_a.cumprod()
cum_series_b = series_b.cumprod()
cum_series_a
>> 0 1.000000
1 1.030000
2 1.050600
3 1.029588
4 1.019292
cum_series_b
>> 0 1.000000
1 0.980000
2 0.931000
3 0.977550
4 1.045979
#######################################################################################
# What I want to calculate is the cum_series_c based on cum_series_a and cum_series_b #
#######################################################################################
series_c = pd.concat([series_a, series_b[1:]])
cum_series_c = series_c.cumprod()
### Attention, please!
# I don't need the first element of the series_b, because it is 1.
# So it would repeat the same number 1.019292 two times, if I didn't delete it.
cum_series_c
>>> 0 1.000000
1 1.030000
2 1.050600
3 1.029588
4 1.019292
1 0.998906
2 0.948961
3 0.996409
4 1.066158
To put my question in other words, is it possible to calculate the cum_series_c without knowing the series_a and series_b but knowing only cum_series_a and cum_series_b?
What would be a code like to do this?
Yes you can by factor all cum_series_b with the last element of cum_series_a
cum_series_c = cum_series_a.append(cum_series_b * cum_series_a.values[-1], ignore_index = True)
I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.
You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx
I have a panda Dataframe like the following
and this is the data:
0 1 2 3 4 5 6
0 Label Total/Target Jaccard Dice VolumeSimilarity FalseNegative FalsePositive
1 image-9003406 0.753958942196244 0.628584809743865 0.771939914928625 -0.0476974851707525 0.246041057803756 0.209200511636753
2 image-9007827 0.783266136200411 0.652181507072358 0.789479248231042 -0.015864625683349 0.216733863799589 0.204208282912204
3 image-9040390 0.797836181211824 0.611217035556112 0.758702300270988 0.0981000407411853 0.202163818788176 0.276772045623749
4 image-9047800 0.833585767007274 0.627592483537663 0.771191179469637 0.149701662401568 0.166414232992726 0.282513296651508
5 image-9054866 0.828860635279561 0.652709649240693 0.789866083907199 0.0940919370823063 0.171139364720439 0.245624253720476
6 image-9056363 0.795614053800371 0.658368025419615 0.793995078689519 0.00406974990730408 0.204385946199629 0.207617320977731
7 image-9068453 0.763313209747495 0.565848914378489 0.722737563225356 0.106314540359027 0.236686790252505 0.313742036740474
8 image-9085290 0.633747182342442 0.498166624744976 0.665035005475144 -0.0987391313269621 0.366252817657558 0.300427399066708
9 image-9087863 0.663537911271341 0.539359224086608 0.700758102003958 -0.112187081100769 0.336462088728659 0.257597937816249
10 image-9094865 0.667530629804239 0.556419610760253 0.714999485888594 -0.142222256073179 0.332469370195761 0.230263697338428
However, I need to convert the data which is starting from column #1 and row #1 as numbers, when it is saving into excel file, it is saving as string.
How to do that?
your help is appreciated
Use:
#set columns by first row
df.columns = df.iloc[0]
#set index by first column
df.index = df.iloc[:, 0]
#remove first row, first col and cast to floats
df = df.iloc[1:, 1:].astype(float)
print (df)
0 Total/Target Jaccard Dice VolumeSimilarity \
Label
image-9003406 0.753959 0.628585 0.771940 -0.047697
image-9007827 0.783266 0.652182 0.789479 -0.015865
image-9040390 0.797836 0.611217 0.758702 0.098100
image-9047800 0.833586 0.627592 0.771191 0.149702
image-9054866 0.828861 0.652710 0.789866 0.094092
image-9056363 0.795614 0.658368 0.793995 0.004070
image-9068453 0.763313 0.565849 0.722738 0.106315
image-9085290 0.633747 0.498167 0.665035 -0.098739
image-9087863 0.663538 0.539359 0.700758 -0.112187
image-9094865 0.667531 0.556420 0.714999 -0.142222
0 FalseNegative FalsePositive
Label
image-9003406 0.246041 0.209201
image-9007827 0.216734 0.204208
image-9040390 0.202164 0.276772
image-9047800 0.166414 0.282513
image-9054866 0.171139 0.245624
image-9056363 0.204386 0.207617
image-9068453 0.236687 0.313742
image-9085290 0.366253 0.300427
image-9087863 0.336462 0.257598
image-9094865 0.332469 0.230264