Save multiple dataFrames in a loop using to_pickle - python

hi i have 4 pandas dataframe: df1, df2 ,df3, df4.
What i like to do is iterate (using a for loop) the save of this dataframe using to_pickle.
what i did is this:
out = 'mypath\\myfolder\\'
r = [ orders, adobe, mails , sells]
for i in r:
i.to_pickle( out + '\\i.pkl')
The command is fine but it does not save every database with his name but overwriting the same databse i.pkl (i think because is not correct my code)
It seem it can't rename every database with his name (e.g. for orders inside the for loop orders is saved with the name i.pkl and so on with the orders dataframe involved)
What i expect is to have 4 dataframe saved with the name inserted in the object r (so : orders.pkl, adobe.pkl ,mails.pkl, sells.pkl)
How can i do this?

You can't stringify the variable name (this is not something you generally do), but you can do something simple:
import os
out = 'mypath\\myfolder\\'
df_list = [df1, df2, df3, df4]
for i, df in enumerate(df_list, 1):
df.to_pickle(os.path.join(out, f'\\df{i}.pkl')
If you want to provide custom names for your files, here is my suggestion: use a dictionary.
df_map = {'orders': df1, 'adobe': df2, 'mails': df3, 'sells': df4}
for name, df in df_map.items():
df.to_pickle(os.path.join(out, f'\\{name}.pkl')

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

Create multiple dataframes with for loop in python

I need to compile grades from 10 files named quiz2, quiz3 [...], quiz11.
I have the following transformation:
Import the xls to df with pandas
Get only the 4 renamed columns
Keep only the highest grade if there is multiple values for the same ID
The code for one dataframe is the following:
quiz2=pd.read_excel(r'C:\Users\llarbodiere\Desktop\Perso\grade compil\quiz\quiz2.xls')
quiz2=quiz2.rename({'Nom d’utilisateur': 'ID', 'Note totale': 'quiz2'}, axis='columns')
quiz2=quiz2[['Nom','Prénom','ID','quiz2']]
quiz2.groupby("ID").max().sort_values("Nom").fillna(0)
I want to iterate the same transformations for all the quizzes from quiz2 to quiz11. I have tried a for loop but it did not worked.
Thanks by advance!
You could generate the file name dynamically by looping through a range of numbers from 1 to 11 and concatenating the number to the file name and suffix.
#create an empty dataframe for collecting loop results
cumulative_df = pd.DataFrame(columns = ['a'])
#loop through a range of numbers from 1 to 11
for x in range(1,11):
#generate the file name
file = 'quiz'+str(x)+'.xls'
df=pd.read_excel('C:/Users/llarbodiere/Desktop/Perso/grade compil/quiz/'+file)
df=df.rename({'Nom d’utilisateur': 'ID', 'Note totale': 'quiz'}, axis='columns')
df=df[['Nom','Prénom','ID','quiz']]
df.groupby("ID").max().sort_values("Nom").fillna(0)
#add the df active in the loop to the cumulative df
pd.concat([cumulative_df, df])
print(cumulative_df)
EDIT: the example above is for the specific file names you mentioned. This could be generalized further to work for all files in a given directory, for example.

Twitterscaper: Adding tweet country info to scraped dataframe

I am using twitterscraper from https://github.com/taspinar/twitterscraper to scrape around 20k tweets created since 2018. Tweet locations are not readily extracted from the default setting. Nevertheless, the search for tweets written from a location can be done by using advanced queries placed within quotes, e.g. "#hashtagofinterest near:US"
Thus I am thinking to loop through a list of country codes (alpha-2) to filter the tweets from a country and add the info of the country to my search result. Initial attempts had been done on small samples for tweets in the past 10 days.
#set arguments
begin_date = dt.date(2020,4,1)
end_date = dt.date(2020,4,11)
lang = 'en'
#define queries
queries = [(f'(#hashtagA OR #hashtagB near:{country})', country) for country in alpha_2]
#initiate queries
dfs = []
for query, country in queries[:10]: #trying on first 10 countries
temp = query_tweets(query, begindate = begin_date, enddate = end_date, lang=lang)
temp = pd.DataFrame(t.__dict__ for t in temp)
temp["country"] = [country]*len(temp)
dfs.append((temp, country))
I managed to add country info as a new variable for each country df.
Part of output: dfs
Part of output: df
However, I am stuck at combining each query result into 1 dataframe. pd.concat() is not working for passing 22 columns on the passed data of 2 columns
unintended result
My intended result is to have a new country column added to the default 21 columns in a dataframe (total 22 intended columns).
intended result
Since dfs is a list of tuples, with each tuple being (DataFrame, str), you only want to concatenate the first of each element of dfs.
You may achieve this using:
concat_df = pd.concat([df for df, _ in dfs], ignore_index=True)
which will create a new list of only the DataFrames and concatenate those. I have added ignore_index=True so that the rows will be re-indexed in the concatenated DataFrame.
Since the country is already stored in the DataFrame, you could also not add this to dfs and only append temp instead:
dfs = []
for query, country in queries[:10]: #trying on first 10 countries
temp = query_tweets(query, begindate = begin_date, enddate = end_date, lang=lang)
temp = pd.DataFrame(t.__dict__ for t in temp)
temp["country"] = [country]*len(temp)
dfs.append(temp)
concat_df = pd.concat(dfs, ignore_index=True)

Renaming a dataframe in Python using for loop

I am trying to address a problem similar to the one done in
How to name dataframes with a for loop?
review_categories = ["beauty", "pet"]
reviews = {}
for review in review_categories:
df_name = review + '_reviews' # the name for the dataframe
filename = "D:\\Library\\reviews_{}.json".format(review)
reviews[df_name] = pd.read_json(path_or_buf=filename, lines=True)
In reviews, you will have a key with the respective dataframe to store the data. If you want to retrieve the data, just call:
reviews["beauty_reviews"]
But what if I want to rename the data frames
reviews["beauty_reviews"] &
reviews["pet_reviews"]
to something else what is the best way to do so ?
I don't understand if you want to rename de variable dataframe name or the column name, but here's the answer for both:
"Rename" the dataframe name
new_name = reviews['beauty_reviews']
Rename the column:
reviews.rename(columns={"beauty_reviews": "new_beauty", "pet_reviews": "new_pet"})

Trying to access one cell in a pandas dataframe

I have imported two .csv files as pandas. One panda, df1, looks something like this:
projName projOwner Data
proj0 projOwnder0 5
proj1 projOwnder1 7
proj2 projOwnder2 8
proj3 projOwnder3 3
The second panda, df2, looks like this:
projName projOwner projEmail projFirstName projLastName
proj0 projOwnder0 email0 firstName0 lastName0
proj1 projOwnder1 email1 firstName1 lastName4
proj2 projOwnder2 email2 firstName2 lastName5
proj3 projOwnder3 email3 firstName3 lastName6
Basically what I have done is set the index on the df2 to projName. Now I am iterating through the rows of df1 and want to use data from df2 based on df1.
df2 = df.set_index("projName")
for index, row in df1.iterrows():
project_name = str(row['projName'])
firstName = df2.loc[repo_name,'projFirstName']
lastName = df2.loc[repo_name,'projLasttName']
I have done this and it works on some of the rows, but for others it gives me a string of different values in that column. I have tried using .at, .iloc, .loc and have not had success. Can someone help me to see what I am doing wrong.
One way to do this that would be much easier would be to use the pandas merge function to merge the dataframes first, then you don't have to reference the data in one dataframe by the data in another - it's all in one place. For example:
import pandas as pd
df1 = pd.DataFrame({'projName':['proj0', 'proj1'],
'projOwner':['projOwner0','projOwner1'],
'Data':[5, 7]})
df2 = pd.DataFrame({'projName':['proj0', 'proj1'],
'projOwner':['projOwner0','projOwner1'],
'projEmail':['email0', 'email1']})
df = df1.merge(df2, on=['projName', 'projOwner'])
print(df)
df.set_index('projName')
for index, row in df.iterrows():
print(row['projName'])
print(row['projOwner'])
print(row['projEmail'])
print(row['Data'])
df now looks like this:
Data projName projOwner projEmail
0 5 proj0 projOwner0 email0
1 7 proj1 projOwner1 email1
And looping through the rows and printing the project, project owner, and email, and data gives this:
proj0
projOwner0
email0
5
proj1
projOwner1
email1
7

Categories