How to append dataframes inside a for loop in Python - python

I have been trying to append the DataFrame in the four loop, for loop works fine, however it is not appending the data frames, any help would be much appreciated.
symbols = ['MSFT', 'GOOGL', 'AAPL']
apikey = 'CR*****YDA'
for s in symbols:
print(s)
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=%s&apikey=%s" % (s, apikey)
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
a = pd.DataFrame(js['Time Series (Daily)']).T
b = pd.DataFrame()
print(b)
b = b.append(a, ignore_index=True)
print(b)
print("loop successful")
print("run successfull")
Outputs:
MSFT
Empty DataFrame
Columns: []
Index: []
1. open 2. high 3. low 4. close 5. volume
0 107.4600 107.9000 105.9100 107.7100 37427587
1 105.0000 106.6250 104.7600 106.1200 28393015
.. ... ... ... ... ...
99 109.2700 109.6400 108.5100 109.6000 19662331
[100 rows x 5 columns]
loop successful
GOOGL
Empty DataFrame
Columns: []
Index: []
1. open 2. high 3. low 4. close 5. volume
0 1108.5900 1118.0000 1099.2800 1107.3000 2244569
1 1087.9900 1100.7000 1083.2600 1099.1200 1244801
.. ... ... ... ... ...
99 1244.1400 1257.8700 1240.6800 1256.2700 1428992
[100 rows x 5 columns]
loop successful
AAPL
Empty DataFrame
Columns: []
Index: []
1. open 2. high 3. low 4. close 5. volume
0 157.5000 157.8800 155.9806 156.8200 33751023
1 154.2000 157.6600 153.2600 155.8600 29821160
.. ... ... ... ... ...
99 217.1500 218.7400 216.3300 217.9400 20525117
[100 rows x 5 columns]
loop successful
run successfull

The immediate problem is you define b as an empty dataframe within each iteration of your for loop. Instead, define it once before your for loop begins:
b = pd.DataFrame()
for s in symbols:
# some code
a = pd.DataFrame(js['Time Series (Daily)']).T
b = b.append(a, ignore_index=True)
But appending dataframes in a loop is not recommended. It requires unnecessary copy operations and is inefficient. The docs recommend using pd.concat on an iterable of dataframes:
list_of_dfs = []
for s in symbols:
# some code
list_of_dfs.append(pd.DataFrame(js['Time Series (Daily)']).T)
b = pd.concat(list_of_dfs, ignore_index=True)

The problem is that you kept erasing the value of b with an empty DataFrame. So you have to define b as a DataFrame before the for loop.
symbols = ['MSFT', 'GOOGL', 'AAPL']
apikey = 'CR*****YDA'
b = pd.DataFrame()
for s in symbols:
print(s)
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=%s&apikey=%s" % (s, apikey)
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
a = pd.DataFrame(js['Time Series (Daily)']).T
print(b)
b = b.append(a, ignore_index=True)
print(b)
print("loop successful")
print("run successfull")

Moving the following code
b = pd.DataFrame()
to outside of the loop would fix your problem. Right now, 'b' is re-initialized as empty dataframe every loop.

Related

pandas Optimizing many loops into one

I have multiple dfs with same columns. Here is the list of all dfs
dfs = [df_14, df_15, df_16, df_17]
Every dataframe looks like this for example,df_14:
id
Days
001
0
004
56
013
95
015
33
Next, df_15:
Id
Days
001
0
023
18
459
19
811
35
df_16:
Id
Days
111
93
114
56
232
0
df_17:
Id
Days
532
120
113
31
065
58
015
2
My code:
rows = [['532', 120],['113', 31], ['065', 58],['025', 2]]
for row in rows:
df_14.loc[len(df_14)] = row
# and so on
The task is to append to lists of each month - the is which has 30-60 days and another separate list with id of clients which has 60-100 days.
#The result should be like this:
14_1: ['004', '015']
14_2: ['013']
15_1: ['811']
I try to use f'strings on it. Something like:
abrreviations = ['14', '15','16', '17']
c = ['_1', '_2']
#Have wrote initializing loops like
m_list=[]
for a in abrreviations:
for cp in c:
m_list.append(a+cp)
And the idea is using abbreviations in the loops with f'string or format. But don't know how to do it? Or can you offer another ideas?
This can help you
import pandas as pd
data = {'df_jan' : [['001', 0],['004', 56], ['013', 95],['015', 33]],
'df_feb' : [['001', 0],['023', 18], ['459', 19],['811', 35]],
'df_mar' : [['111', 93],['114', 56], ['232', 0]],
'df_apr' : [['532', 120],['113', 31], ['065', 58],['025', 2]]}
dfs = {}
for df in data:
dfs[df] = pd.DataFrame(data[df], columns=['id', 'days'])
months = {}
for df in dfs:
months[df.replace('df_', '') + '_30'] = dfs[df][(dfs[df].days >= 30) & (dfs[df].days <= 60)].id.to_list()
months[df.replace('df_', '') + '_90'] = dfs[df][(dfs[df].days >= 90) & (dfs[df].days <= 120)].id.to_list()
months
{'jan_30': ['004', '015'],
'jan_90': ['013'],
'feb_30': ['811'],
'feb_90': [],
'mar_30': ['114'],
'mar_90': ['111'],
'apr_30': ['113', '065'],
'apr_90': ['532']}
In response to your comment:
I created the df inside the dictionary to simplify the creation of test data.
Your code can create the df in its own way ...
df_jan = ...
df_feb = ...
df_mar = ...
df_apr = ...
and to process them you create the dictionary ...
dfs = {
'df_jan' : df_jan,
'df_feb' : df_feb,
'df_mar' : df_mar,
'df_apr' : df_apr
}
run the loop
and you can assign results to your variables
and delete dictionaries
jan_30 = months['jan_30']
jan_90 = months['jan_90']
feb_30 = months['feb_30']
feb_90 = months['feb_90']
mar_30 = months['mar_30']
mar_90 = months['mar_90']
apr_30 = months['apr_30']
apr_90 = months['apr_90']
del dfs, months
#let first create a list containing all the dataframe's
all_df=[df_jan, df_feb, df_mar, df_apr, df_may, df_jun, df_jul, df_aug, df_sep, df_oct, df_nov, df_dec]
#create 2 lists for storing the id values of 30-60 range and 90-120 range
list_30,list_90=[],[]
#1 nested for loop for handling all data frames
for cur_df in all_df:
for id,days in zip(cur_df['Id'],cur_df['Days']):
if(30<=days<=60):
list_30.append(id)
elif(90<=days<=120):
list_90.append(id)
#Now list_30 and list_90 contains the corresponding id values in that range
Hope the answer helps :)
Since you didn't provide data I made a basic example and it worked for me so here is a single for-loop as you described:
import numpy as np
import pandas as pd
dfs = [df_jan, df_feb, df_mar, df_apr, df_may, df_jun, df_jul, df_aug, df_sep, df_oct, df_nov, df_dec]
df30 = []
df90 = []
dfsChained30 = []
dfsChained90 = []
for rowsForMonths, xForMonths in enumerate(dfs):
# If January [don't consider chain];
if rowsForMonths == 0:
for dayN in range(dfs[rowsForMonths]):
if dfs[rowsForMonths][dayN] in range(30, 61):
df30.append(dfs[rowsForMonths][dayN])
elif dfs[rowsForMonths][dayN] in range(90, 121):
df90.append(dfs[rowsForMonths][dayN])
else:
pass
dfsChained30.append(df30)
dfsChained90.append(df90)
# If not January [consider chain];
else:
for dayN in range(dfs[rowsForMonths]):
if dfs[rowsForMonths][dayN] in range(30, 61) and dfs[rowsForMonths][dayN] not in set(dfsChained30):
df30.append(dfs[rowsForMonths][dayN])
elif dfs[rowsForMonths][dayN] in range(90, 121) and dfs[rowsForMonths][dayN] not in set(dfsChained90):
df90.append(dfs[rowsForMonths][dayN])
else:
pass
dfsChained30.append(df30)
dfsChained90.append(df90)

Python: How to append data created in nested loop after every incremental iteration

Example: if each iteration returns 10 observations and there are 4 iterations, final table should have 40 observations.
This is the query I wrote:
Df_All = pd.DataFrame()
for i in Vars:
for j in Type:
i_j = df[df['Product'] == j].groupby([i,'Var1', 'Var2']).count()[['Var3']].reset_index()
i_j['Var'] = i
Df_All = Df_All.append(i_j)

Feature extraction from the training data

I have a training data like below which have all the information under a single column. The data set has above 300000 data.
id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games; 1
3 name=David;age=12;1:=High School;2:=Cricketer;native=america; 2
4 name=George;age=11;1:=High School;2:=Carpenter;married=yes 2
.
.
300000 name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No 3
Now i need to convert this training data like below
id name age 1 2 Interest married Smoker
1 John Matthew 25 Post Graduate Football Player Nan Nan Nan
2 Mark clark 21 Under Graduate Nan Video Games Nan Nan
.
.
Is there any efficient way to do this. I tried the below code but it took 3 hours to complete
#Getting the proper features from the features column
cols = {}
for choices in set_label:
collection_list = []
array = train["features"][train["label"] == choices].values
for i in range(1,len(array)):
var_split = array[i].split(";")
try :
d = (dict(s.split('=') for s in var_split))
for x in d.keys():
collection_list.append(x)
except ValueError:
Error = ValueError
count = Counter(collection_list)
for k , v in count.most_common(5):
key = k.replace(":","").replace(" ","_").lower()
cols[key] = v
columns_add = list(cols.keys())
train = train.reindex(columns = np.append( train.columns.values, columns_add))
print (train.columns)
print (train.shape)
#Adding the values for the newly created problem
for row in train.itertuples():
dummy_dic = {}
new_dict={}
value = train.loc[row.Index, 'features']
v_split = value.split(";")
try :
dummy_dict = (dict(s.split('=') for s in v_split))
for k, v in dummy_dict.items():
new_key = k.replace(":","").replace(" ","_").lower()
new_dict[new_key] = v
except ValueError:
Error = ValueError
for k,v in new_dict.items():
if k in train.columns:
train.loc[row.Index, k] = v
Is there any useful function that i can apply here for efficient way of feature extraction ?
Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria:
import pandas as pd
import numpy as np
import random
import time
import itertools
# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000
NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]
df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
FEATURES1[np.random.randint(0, len(FEATURES1))],\
FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]
df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]
df.rename(columns={0:"features"}, inplace=True)
print df.head(20)
# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point.
mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)
INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']
mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]
mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]
print mod_df.head(20)
Assume that your original data is stored in a DataFrame called df.
Solution 1 (all the features are the same for every data point).
def func2(y):
lista = y.split('=')
value = lista[1]
return value
def function(x):
lista = x.split(';')
array = [func2(i) for i in lista]
return array
# Calculate the execution time
start = time.time()
array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])
end = time.time()
new_df
print 'Total time:', end - start
Total time: 1.80923295021
Edit: The one thing you need to do is to edit accordingly the columns list.
Solution 2 (The features might be the same or different for every data point).
import pandas as pd
import numpy as np
import time
import itertools
# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
return x.split('=')[0]
def def_columns(x):
lista = x.split(';')
keys = [extract_key(i) for i in lista]
return keys
df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns
# This function turns each row from the original dataframe into a dictionary.
def function(x):
lista = x.split(';')
dict_ = {}
for i in lista:
key, val = i.split('=')
dict_[key ] = val
return dict_
df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)
Suppose your data is like this :
features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;",
'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;",
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})
I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ;
with this function
def replace_feature(x):
for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
x = x.replace(*r)
x = x.split(';')
return x
df = df.assign(features= df.features.apply(replace_feature))
After applying that function to your df all the values will a list of features. where you can get each one by index
then I use 4 customs function to get each attribute name, age, grade; job,
Note: There can be a better way to do this by using only one function
def get_name(df):
return df['features'][1]
def get_age(df):
return df['features'][2]
def get_grade(df):
return df['features'][3]
def get_job(df):
return df['features'][4]
And finaly applying that function to your dataframe :
df = df.assign(name = df.apply(get_name, axis=1),
age = df.apply(get_age, axis=1),
grade = df.apply(get_grade, axis=1),
job = df.apply(get_job, axis=1))
Hope this will be quick and fast
As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. It's better to create the whole dataframe at once whith a list of dictionnaries.
Let's recreate your input dataframe :
from StringIO import StringIO
data=StringIO("""id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;2.=Football Player; 1
3 name=David;age=12;1:=High School;2:=Cricketer; 2
4 name=George;age=11;1:=High School;2:=Carpenter; 2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
we can check :
print df
id features label
0 1 name=John Matthew;age=25;1.=Post Graduate;2.=F... 1
1 2 name=Mark clark;age=21;1.=Under Graduate;2.=Fo... 1
2 3 name=David;age=12;1:=High School;2:=Cricketer; 2
3 4 name=George;age=11;1:=High School;2:=Carpenter; 2
Now we can create the needed list of dictionnaries with the following code :
feat=[]
for line in df['features']:
line=line.replace(':','.')
lsp=line.split(';')[:-1]
feat.append(dict([elt.split('=') for elt in lsp]))
And the resulting dataframe :
print pd.DataFrame(feat)
1. 2. age name
0 Post Graduate Football Player 25 John Matthew
1 Under Graduate Football Player 21 Mark clark
2 High School Cricketer 12 David
3 High School Carpenter 11 George

Looping through a python pivot table

I have a pivot table that I have created (pivotTable) using:
pivotTable= dayData.pivot_table(index=['sector'], aggfunc='count')
which has produced the following pivot table:
sector id
broad_sector
Communications 2 2
Utilities 3 3
Media 3 3
Could someone just let me know if there is a way to loop through the pivot table assigning the index value and sector total to respective variables sectorName and sectorCount
I have tried:
i=0
while i <= lenPivotTable:
sectorName = sectorPivot.index.get_level_values(0)
sectorNumber = sectorPivot.index.get_level_values(1)
i=i+1
to return for the first loop iteration:
sectorName = 'Communications'
sectorCount = 2
for the second loop iteration:
sectorName = 'Utilities'
sectorCount = 3
for the third loop iteration:
sectorName = 'Media'
sectorCount = 3
But can't get it to work.
This snippet will get you the values as asked.
for sector_name, sector_count, _ in pivotTable.to_records():
print(sector_name, sector_count)
well, i don't understand why do you need this (because looping through DF is very slow), but you can do it this way:
In [403]: for idx, row in pivotTable.iterrows():
.....: sectorName = idx
.....: sectorCount = row['sector']
.....: print(sectorName, sectorCount)
.....:
Communications 2
Utilities 3
Media 3

Creating dataframe columns within a for loop

I'm having a hard time figuring out how to create a data frame within a for loop.
df = pd.DataFrame()
for sym in sorted(snapshot):
for lp in sorted(snapshot[sym]):
df['trader'] = lp
df['bid'] = snapshot[sym][lp][":b"]["LUC"]["price"] if ":b" in snapshot[sym][lp] else "0"
df['ask'] = snapshot[sym][lp][":a"]["LUC"]["price"] if ":a" in snapshot[sym][lp] else "0"
print df
print df['trader']
Printing 'df' results in Columns: [trader, bid, ask] Index: []
Printing 'df['trader'] results in Series([], Name: bid, dtype: object)
If I change the df[column headings] to assignments, everything prints fine.
I'm trying to create a df that look like this:
trader bid ask
0 MM2 1.25 1.26
1 MM5 1.23 1.27
2 MM3 1.25 1.28
....
Thanks for all the help
It's hard to understand from your question what's going on and what data do you have. Hovewer from your code you overwriting your columns in each step of for loop. You could add loc with indices to avoid that:
df = pd.DataFrame()
sym_len = len(snapshot[sym])
for i, sym in enumerate(sorted(snapshot)):
for j, lp in enumerate(sorted(snapshot[sym])):
idx = i*sym_len + j
df.loc[idx, 'trader'] = lp
df.loc[idx, 'bid'] = snapshot[sym][lp][":b"]["LUC"]["price"] if ":b" in snapshot[sym][lp] else "0"
df.loc[idx, 'ask'] = snapshot[sym][lp][":a"]["LUC"]["price"] if ":a" in snapshot[sym][lp] else "0"

Categories