Creating n empty dataframes using for loop - python

I want to create n empty dataframes using for loop.
Something like :
import pandas as pd
n=6
for i in range(0,n):
df(i) = pd.DataFrame()
Output like:
df1,df2,df3,df4........dfn

You could store them in a list of dataframes:
dfs = []
n = 6
for i in range(n):
dfs.append(pd.DataFrame())
An alternative would be using a dictionary with meaningful names (these could of course also just be the numbers 1 to 6:
names = ['df1', 'df2', 'df3']
dfs = {}
for name in names:
dfs[name] = pd.DataFrame()

Create dataframes and append them to a list:
df_list = list()
for i in range(6):
d_one = pd.DataFrame()
df_list.append(d_one)
Access individual dataframe by simply indexing normally:
df_list[0]

The short answer for the example you gave is, you don't.
This is what collections (lists, dicts) are for.
With a comprehension it's a fairly trivial task.
# as a list
list_of_df = [pd.DataFrame() for _ in range(n)]
print(list_of_df[0])
If you still want to refer to them by their name it might make more sense in the form of a dict.
dict_of_df = {f'df{i}': pd.DataFrame() for i in range(1, n + 1)}
print(dict_of_df['df1'])
Although it is possible to modify the dict of globals(), if you use an IDE, your linter will hate you, and you'll be fighting it at every corner.
# don't do this
for i in range(1, n + 1):
globals()[f'df{i}'] = pd.DataFrame()
print(df1)
It's a hackier way of creating your own dict and if you do do it you'll have to hard code the variable names anyway.

Related

Is there a more elegant way to initialize empty dataframes

isd = pd.DataFrame()
ind = pd.DataFrame()
exd = pd.DataFrame()
psd = pd.DataFrame()
visd = pd.DataFrame()
vind = pd.DataFrame()
vexd = pd.DataFrame()
sd = pd.DataFrame()
ise = pd.DataFrame()
idb = pd.DataFrame()
mdd = pd.DataFrame()
add = pd.DataFrame()
Is there any alternate way to make it elegant and faster?
Use a dictionary of dataframes, especially of the code for some data frames is going to share some similarities. This will allows doing some operations using loops or functions:
dct = {n: pd.DataFrame() for n in ['isd', 'ind', 'exd']}
If you want to avoid needing to numerically index each of the DataFrames, but would rather be able to access them directly by their name:
import pandas as pd
table_names = ['df1', 'df2', 'df3']
for name in table_names:
exec('%s = pd.DataFrame()' % name, locals(), locals())
print(df1)
This approach uses exec, which essentially runs a string as if it were python code. I'm just formatting each of the predetermined names into the string in a for-loop.
You can do something like this:
dfs = ['isd', 'ind', 'exd']
df_list = [pd.DataFrame() for df in dfs ]
I think you can go this way
import pandas as pd
a, b, c, d = [pd.DataFrame()]*4

Create URLs for different data frames

I have a data frame that I split into different data frames of size 100 (to be able to make Python able to process it).
Therefore, I get different data frames (df1 to df..). For all those data frames, I want to create an URL as shown below.
When I use type(df), it shows me it is a data frame, however, when I use for j in dfs: print(type(j)), it is shown it is a string. I need the data frame to make it able to create the URL.
Can you please help me what the loop for creating the urls for all data frames could look like?
Thank you so much for your help!
df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
n = 100 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
dfs = {}
for idx,df in enumerate(list_df, 1):
dfs[f'df{idx}'] = df
type(df1)
for j in dfs:
print(type(j))
def create_url():
url = "https://api.twitter.com/2/tweets?{}&{}".format("ids=" + (str(str((df1['id'].tolist()))[1:-1])).replace(" ", ""), tweet_fields)
return url
dfs is dictionary so for j in dfs: gives you only keys - which are string.
You need .values()
for j in dfs.values():
or .items()
for key, j in df.items():
or you have to use dfs[j]
for j in dfs:
print(type( dfs[j] ))
EDIT:
Frankly, you could do it all in one loop without list_df
import pandas as pd
#df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
df = pd.DataFrame({'id': range(1000)})
tweet_fields = 'something'
n = 100 #chunk row size
for i in range(0, df.shape[0], n):
ids = df[i:i+n]['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)
You can also use groupby and index if index uses numbers 0,1,...
for i, group in df.groupby(df.index//100):
ids = group['id'].tolist()
ids_str = ','.join(str(x) for x in ids)
url = "https://api.twitter.com/2/tweets?ids={}&{}".format(ids_str, tweet_fields)
print(url)

How can I create a Pandas frame from data stored in multiple nested dictionaries?

I have a Python program in which I sweep multiple parameters and at each point I calculate a few results. I then want to export the results in the form of a CSV (or Excel) report that, on each row, contains the parameters and results. For example, here I sweep two parameters i and j and calculated res1 and res2 as a function of i and j. (This is completely silly MWE though!)
res1 = dict()
res2 = dict()
for i in range(5):
res1[i] = dict()
res2[i] = dict()
for j in range(5):
res1[i][j] = i+j
res2[i][j] = i*j
And I would like to create a CSV with 25 rows and 4 columns where first two columns are (i, j) combinations for which res1 and res2 are calculated and second two columns are res1 and res2 respectively. A naive way of exporting such a CSV is as follows:
#### Naive CSV writing
print(', '.join(['i', 'j', 'res1', 'res2']))
for i in range(5):
for j in range(5):
print(', '.join([str(i), str(j), str(res1[i][j]), str(res2[i][j])]))
I was wondering if there is a way to create a pandas frame from the dictionaries so that then I can export the reports more easily?
I know that pandas.DataFrame constructor accepts a dictionary that maps column headers to column values. So, for example the following is a possible solution:
import pandas as pd
import sys
# generate results as before
d = dict([('i', list()),
('j', list()),
('res1', list()),
('res2', list())])
for i in range(5):
for j in range(5):
d['i'].append(i)
d['j'].append(j)
d['res1'].append(res1[i][j])
d['res2'].append(res2[i][j])
df = pd.DataFrame(data=d)
df.to_csv(sys.stdout, index=False)
Yet, the above does not look so elegant (and I think is not efficient either). Is there a better way to do so?
You could create normal list
data = []
for i in range(5):
for j in range(5):
data.append([i, j, res1[i][j], res2[i][j]])
And then convert to DataFrame
import pandas as pd
df = pd.DataFrame(data, columns=['i', 'j', 'res1', 'res2'])
print(df)
Or directly write it using csv module
import csv
fh = open("output.csv", 'w')
csvwriter = cvs.writer(fh)
csvwriter.writerow(['i', 'j', 'res1', 'res2'])
for i in range(5):
for j in range(5):
csvwriter.writerow([i, j, res1[i][j], res2[i][j]])
fh.close()
How about this:
import pandas as pd
from itertools import product
p = np.array(list(product(range(5), range(5))))
df = pd.DataFrame(data={'i': p[:,0], 'j':p[:,1]})
def res(row):
row['res1'] = res1(row['i'], row['j'])
row['res2'] = res2(row['i'], row['j'])
return row
df = df.apply(res, axis=1)
Now you can write the dataframe directly into a csv

storing the results of a loop quickly Python

I have a function that I call on every row of a pandas DataFrame and I would like to store the result of each function call (each iteration). Below is an example of what I am trying to do.
data =[{'a':1,'b':2,'c':3},{'a':1,'b':2,'c':3}, {'a':1,'b':2,'c':3}]
InputData = pd.DataFrame(data)
ResultData = pd.DataFrame(columns = ['a', 'b', 'c'])
def SomeFunction(row):
#Function code goes here (not important to this question)#
##########################################################
##########################################################
return Temp
for index, row in InputData.iterrows():
# Temp will equal the result of the function (a DataFrame with 3 columns and 1 Row)
Temp = Somefunction(row)
# If ResultData is not empty append Temp to ResultData
if len(ResultData) != 0:
ResultData = ResultData.append(Temp, ignore_index = True)
# If ResultData is empty Result data = Temp
else:
ResultData = Temp
I hope my example is easy to follow.
In my real example I have about a million rows in the Input Data and this process is very slow and I think it is the appending of the DataFrame which is making it so slow. Is there maybe a different data structure I could use which could store the three values of the "Temp" DataFrame which could be appended at the end to form the "ResultData" DataFrame?
Any help would be much appreciated
Best to avoid any explicit loops in pandas. Using apply is still a little slow but probably faster than a loop.
df["newcol"] = df.apply(function, axis=1)
Maybe a list of lists will solve your problem:
Result_list = []
for ... :
...
Result_list.append([data1, data2, data3]);
To review the data:
for Current_data in Result_list:
data1 = Current_data[0]
data2 = Current_data[1]
data3 = Current_data[2]
Hope it helps!

Pandas read_csv multiple files

What's the best way to loop through a bunch of files and create separate data frames for each file? I've looked through other questions, but it seems the point in each of those is to concatenate files into one data frame.
For example, if I have mylist = ['a.csv','b.csv','c.csv'], and I want each of my data frames to take the name of the file (a, b,c), I can't do this because the left side of the assignment statement is treated as a string. How do I correct this so that it is interpreted as a dataframe assignment?
mylist = ['a.csv','b.csv','c.csv']
import pandas as pd
for file in mylist:
file.rsplit('.csv',1)[0] = pd.read_csv(file)
Use a dictionary comprehension:
dfs = {f.rsplit('.csv',1)[0]: pd.read_csv(file)
for f in mylist}
It is generally considered bad practice to name a variable using a formula. A better solution would be to use a dictionary:
mylist = ['a.csv','b.csv','c.csv']
mydict = {}
import pandas as pd
for file in mylist:
mydict[file.rsplit('.csv',1)[0]] = pd.read_csv(file)
Once you do this, you can access each dataframe by saying:
mydict['a']
mydict['b']
etc...
I think you can create dictionary of DataFrames:
import pandas as pd
mylist = ['a.csv','b.csv','c.csv']
dfs = {}
for f in mylist:
dfs.update({f.rsplit('.csv',1)[0]: pd.read_csv(f)})
print dfs['a']

Categories