I would like to yield multiple empty dataframes by a function in Python.
import pandas as pd
df_list = []
def create_multiple_df(num):
for i in range(num):
df = pd.DataFrame()
df_name = "df_" + str(num)
exec(df_name + " = df ")
df_list.append(eval(df_name))
for i in df_list:
yield i
e.g. when I create_multiple_df(3), I would like to have df_1, df_2 and df_3 returned.
However, it didn't work.
I have two questions,
How to store multiple dataframes in a list (i.e. without evaluating the contents of the dataframes)?
How to yield multiple variable elements from a list?
Thanks!
It's very likely that you do not want to have df_1, df_2, df_3 ... etc. This is often a design pursued by beginners for some reason, but trust me that a dictionary or simply a list will do the trick without the need to hold different variables.
Here, it sounds like you simply want a list comprehension:
dfs = [pd.DataFrame() for _ in range(n)]
This will create n empty dataframes and store them in a list. To retrieve or modify them, you can simply access their position. This means instead of having a dataframe saved in a variable df_1, you can have that in the list dfs and use dfs[1] to get/edit it.
Another option is a dictionary comprehension:
dfs = {i: pd.DataFrame() for i in range(n)}
It works in a similar fashion, you can access it by dfs[0] or dfs[1] (or even have real names, e.g. {f'{genre}': pd.DataFrame() for genre in ['romance', 'action', 'thriller']}. Here, you could do dfs['romance'] or dfs['thriller'] to retrieve the corresponding df).
Related
I have a list of names. for each name, I start with my dataframe df, and use the elements in the list to define new columns for the df. after my data manipulation is complete, I eventually create a new data frame whose name is partially derived from the list element.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
new_df_x = df
new_df_x.to_parquet('new_df_x.parquet')
del new_df_x
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merege(new_df_foo ,new_df_bar , ...)
The reason I am using this approach is that, if I don't use a loop and just add the foo and bar columns one after another to the original df, my data gets really big and highly fragmented before I go from wide to long and I encounter insufficient memory error. The workaround for me is to create a loop and store the data frame for each element and then at the very end join the long-format data frames together. Therefore, I cannot use the approach suggested in other answers such as creating dictionaries etc.
I am stuck at the line
new_df_x = df
where within the loop, I am using the list element in the name of the data frame.
I'd appreciate any help.
IIUC, you only want the filenames, i.e. the stored parquet files to have the foo and bar markers, and you can reuse the variable name itself.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
df.to_parquet(f'new_df_{x}.parquet')
del df
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merge(new_df_foo ,new_df_bar , ...)
Here is an example, if you are looking to define a variables names dataframe using a list element.
import pandas as pd
data = {"A": [42, 38, 39],"B": [13, 25, 45]}
prior_df=pd.DataFrame(data)
list= ['foo','bar']
variables = locals()
for x in list :
df = prior_df.copy() # assign a dataframe copy to the variable df.
# (smple code for manipulating df)
#-----------------------------------
if x=='foo':
df['B']=df['A']+df['B'] #
if x=='bar':
df['B']=df['A']-df['B'] #
#-----------------------------------
new_df_x="new_df_{0}".format(x)
variables[new_df_x]=df
#del variables[new_df_x]
print(new_df_foo) # print the 1st df variable.
print(new_df_bar) # print the 2nd df variable.
I have a list of files and a list of Dataframes, and I want to use 1 "for" loop to open the first file of the list, extract some data and write it into the first Dataframe, then open the second file, do the same thing and write it into the second dataframe, etc. So I wrote this:
import pandas as pd
filename1 = 'file1.txt'
filename2 = 'file2.txt'
filenames = [filename1, filename2]
columns = ['a', 'b', 'c']
df1 = pd.DataFrame(columns = columns)
df2 = pd.DataFrame(columns = columns)
dfs = [df1, df2]
for name, df in zip(filenames, dfs):
info = open(name, 'r')
# go through the file, find some values
df = df.append({'''dictionary with found values'''})
However, when I run the code, instead of having my data written into the df1 and df2, which I created in the beginning, those dataframes stay empty, and a new dataframe appears in the list of variables, called df, where my data is stored, also it seems to be re-written at every execution of the loop... How do I solve this in the simplest way? The main goal is to have several different dataframes, each corresponding to a different file, in the end of the loop over the list of files. So I don't really care when and how the dataframes are created, I only want a new dataframe to be filled with values when a new file is open.
Each time you loop through dfs, df is actually a copy of the DataFrame object, not the actual object you created. Thus, when you assign a new DataFrame to df, the result is assigned to a new variable. Re-write your code like this:
dfs = []
for name in filenames:
with open(name, 'r') as info:
dfs.append(pd.read_csv(info))
If the text files are dictionaries or can be converted to dictionaries with keys: a, b, and c, after reading; just like the dataframes columns you created (a, b, c). Then they can be assigned this way
import pandas as pd
filename1 = 'file1.txt'
filename2 = 'file2.txt'
filenames = [filename1, filename2]
columns = ['a', 'b', 'c']
df1 = pd.DataFrame(columns = columns)
df2 = pd.DataFrame(columns = columns)
dfs = [df1, df2]
for name, df in zip(filenames, dfs):
with open(name, 'r') as info:
for key in info.keys():
df[key] = info[key]
The reason for this is that Python doesn't know you're trying to re-assign the variable names "df1" and "df2". The list you declare "dfs" is simply a list of two empty dataframes. You never alter that list after creation, so it remains a list of two empty dataframes, which happen to individually be referenced as "df1" and "df2".
I don't know how you're constructing a DF from the file, so I'm just going to assume you have a function somewhere called make_df_from_file(filename) that handles the open() and parsing of a CSV, dict, whatever.
If you want to have a list of dataframes, it's easiest to just declare a list and add them one at a time, rather than trying to give each DF a separate name:
df_list = []
for name in filenames:
df_list.append(make_df_from_file(name))
If you want to get a bit slicker (and faster) about it, you can use a list comprehension which combines the previous script into a single line:
df_list = [make_df_from_file(name) for name in filenames]
To reference individual dataframes in that list, you get just pull them out by index as you would any other list:
df1 = df_list[0]
df2 = df_list[1]
...
but that's often more trouble than it's worth.
If you want to then combine all the DFs into a single one, pandas.concat() is your friend:
from pandas import concat
dfs = concat(df_list)
or, if you don't care about df_list other than as an intermediate step:
from pandas import concat
dfs = concat([make_df_from_file(name) for name in filenames])
And if you absolutely need to give separate names to all the dataframes, you can get ultra-hacky with it. (Seriously, you shouldn't normally do this, but it's fun and awful. See this link for more bad ideas along these lines.)
for n, d in enumerate(dfs):
locals()[f'df{n+1}'] = d
is it possible to create a new df after every iteration? with [i] being the iteration, it should generate df0, df1, df2, etc.. to the MAX NUMBER range as presented in the example:
for i in range(MAX_NUMBER + 1):
df[i] = pd.read_csv(f"C:/Users/Desktop/{i}.csv")
the original codes are functions that loop multiple times. however, for simplicity, i've use read.csv for the example.
kindly advise. Many thanks
Try creating array and append df as you progress through for loop. like this,
df = []
for i in range(MAX_NUMBER + 1):
df.append(pd.read_csv(f"C:/Users/Desktop/{i}.csv"))
and when you need to access, you can use index like df[0], df[1].
Read file and assign it to a dictionary key with key being the name of dataframe as follows:
dfs = {}
for i in range(MAX_NUMBER + 1):
dfs[f'df{i}'] = pd.read_csv(f"C:/Users/Desktop/{i}.csv")
Then you can access each df by its name:
dfs['df0']
I have some difficulties to create multiple lists using pandas from a list of multiple dataframes:
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
...
dfN = pd.read_csv('df1.csv')
dfs = [df1, df2, ..., dfN]
So far, I am able to convert each dataframe into a list by df1 = df1.values.tolist(). Since I have multiple data frames, I would like to convert each dataframe into a list with a loop.
Appreciate any suggestions!
Use list comprehensions:
dfs = [i.values.tolist() for i in dfs]
same as you are storing dataframes?
lists = []
for df in dfs:
temp_list = df.values.tolist()
lists.append(temp_list)
This will give you a list of lists. Each list within will be values from a dataframe. Or did I understand the question incorrectly?
Edit: If you wish to name each list, then you can use a dictionary instead? Would be better than trying to create thousands of variables dynamically.
dict_of_lists = {}
for index, df in enumerate(dfs):
listname = "list" + str(index)
dict_of_lists[listname] = df.values.tolist()
use pd.concat to join all dataframes to one big dataframe
df_all = pd.concat(dfs,axis=1)
df_all.values.tolist()
I have three different Pandas dataframes
df_1
df_2
df_3
I would like to loop over the dataframes, do some computations, and store the output using the name of the dataframe. In other words, something like this
for my_df in [df_1, df_2, df_3]:
my_df.reset_index(inplace=True)
my_df.to_csv('mypath/' + my_df +'.csv')
Output files expected:
'mypath/df_1.csv', 'mypath/df_2.csv' and 'mypath/df_3.csv'
I am struggling doing so because df_1 is an object, and not a string.
Any ideas how to do that?
Thanks!
Another more general solution is create dict with column names and then loop by items():
d = {'df_1':df_1,'df_2':df_3, 'df_3':df_3}
for k, my_df in d.items():
my_df.reset_index(inplace=True)
my_df.to_csv('mypath/' + k +'.csv')
Another possible solution is use another list with dataframes names, then enumerate and get value of name by position:
names = ['a','b','c']
print ({names[i]: df for i, df in enumerate([df_1, df_2, df_3])})
To store df as csv using to_csv method we need a string path.
So we enumerate over the list. This gives us 2 variable in the for loop.
1st variable is the index of the loop iteration. It's like a counter.
So basically enumerate gives us a counter object on the loop.
We are using the counter value to create a string of the index.
Which we use to create a unique file name to store.
for idx, my_df in enumerate([df_1, df_2, df_3]):
my_df.reset_index(inplace=True)
my_df.to_csv('mypath/df_' + str(idx + 1) +'.csv')