Renaming a dataframe in Python using for loop - python

I am trying to address a problem similar to the one done in
How to name dataframes with a for loop?
review_categories = ["beauty", "pet"]
reviews = {}
for review in review_categories:
df_name = review + '_reviews' # the name for the dataframe
filename = "D:\\Library\\reviews_{}.json".format(review)
reviews[df_name] = pd.read_json(path_or_buf=filename, lines=True)
In reviews, you will have a key with the respective dataframe to store the data. If you want to retrieve the data, just call:
reviews["beauty_reviews"]
But what if I want to rename the data frames
reviews["beauty_reviews"] &
reviews["pet_reviews"]
to something else what is the best way to do so ?

I don't understand if you want to rename de variable dataframe name or the column name, but here's the answer for both:
"Rename" the dataframe name
new_name = reviews['beauty_reviews']
Rename the column:
reviews.rename(columns={"beauty_reviews": "new_beauty", "pet_reviews": "new_pet"})

Related

Create many CSV Files based on Pandas df column value

I have one csv file with thousands of rows , below is example.
What I need to do is :
Iterate over the pandas df
create separate file for every department with list of staff(groupby or .loc)
To name the file :
If manager is found then name the file with his name
if no manager , then look for officer
if no manager or officer , then write a comment in the last column
the file should be named as this( IT-name2.csv) as name2 staff is manager in the below picture/example.
So I have two variables , dept. name and staffname
I was able to do this but there is lots of manual work , it should not be the case. I name every csv file myself , I have 100s of lines and I add the managername in the csv filename myself , which caused some errors and it could be changed in the future
Now , for every department I have groupby line, and line to save the csv file (manually enter the managername in the filename)
How this can be more automated ?
Many thanks
.
It sounds like you're nearly there with the groupby. How about adding a custom function to modify the csv name depending on what you find in the groupby?
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(data={
"Dept" : np.random.choice(["IT", "HR", "Sales"], 20),
"Staff" : ["name" + num for num in np.random.randint(0,5,20).astype(str)],
"Role" : np.random.choice(a=["Manager", "Officer", "Admin"], size=20, p=[0.1, 0.3, 0.6]),
"Comment" : [None] * 20
})
def to_csv(group):
roles = group["Role"].tolist()
dept = group["Dept"].iloc[0]
staff_name = "NotFound"
if "Manager" in roles:
staff_name = group["Staff"].iloc[roles.index("Manager")]
elif "Officer" in roles:
staff_name = group["Staff"].iloc[roles.index("Officer")]
group.to_csv(f"{dept}-{staf_name}.csv", index=False)
df.groupby("Dept").apply(to_csv)
list().index() will return the position of the first match which you can use to grab the name in that position from the group. It might not be the fastest thing in the world, but hopefully will get the job you have in mind done.
I wasn't able to understand the complete requirement, posting an answer which should help:
Get a list of unique departments:
dept_list = list(set(df['Dept.'].tolist()))
Now we want to run through the unique only department list and do some manipulation of the dataframe:
for dept in dept_list:
sub_df = df.loc[df['Dept.'] == dept]
# We want to send this to a file. The file name should be dept-officer/manager/other name.csv
# Check if manager exists in sub_df['Role']
if 'Manager' in df['Role'].tolist():
name_employee = sub_df[subdf['Role']=='Manager'].iloc[-1]['name']
sub_df.to_csv('{}-{}.csv' .format(dept, name_employee))
elif 'Officer' in df['Role'].tolist():
name_employee = sub_df[subdf['Role']=='Officer'].iloc[-1]['name']
sub_df.to_csv('{}-{}.csv' .format(dept, name_employee))

Save multiple dataFrames in a loop using to_pickle

hi i have 4 pandas dataframe: df1, df2 ,df3, df4.
What i like to do is iterate (using a for loop) the save of this dataframe using to_pickle.
what i did is this:
out = 'mypath\\myfolder\\'
r = [ orders, adobe, mails , sells]
for i in r:
i.to_pickle( out + '\\i.pkl')
The command is fine but it does not save every database with his name but overwriting the same databse i.pkl (i think because is not correct my code)
It seem it can't rename every database with his name (e.g. for orders inside the for loop orders is saved with the name i.pkl and so on with the orders dataframe involved)
What i expect is to have 4 dataframe saved with the name inserted in the object r (so : orders.pkl, adobe.pkl ,mails.pkl, sells.pkl)
How can i do this?
You can't stringify the variable name (this is not something you generally do), but you can do something simple:
import os
out = 'mypath\\myfolder\\'
df_list = [df1, df2, df3, df4]
for i, df in enumerate(df_list, 1):
df.to_pickle(os.path.join(out, f'\\df{i}.pkl')
If you want to provide custom names for your files, here is my suggestion: use a dictionary.
df_map = {'orders': df1, 'adobe': df2, 'mails': df3, 'sells': df4}
for name, df in df_map.items():
df.to_pickle(os.path.join(out, f'\\{name}.pkl')

Creating multiple lists in for loop with dynamic names in Python

I'm trying to find out averages and standard deviation of multiple columns of my dataset and then save them as a new column in a new dataframe. i.e. for every 'GROUP' in the dataset, I want one columns in the new dataframe with its average and SD. I came up with the following script but I'm not able to name it dynamically.
Average_F1_S_list, Average_F1_M_list, SD_F1_S_list, SD_F1_M_list = ([] for i in range(4))
Groups= DF['GROUP'].unique().tolist()
for key in Groups:
Average_F1_S = DF_DICT[key]['F1_S'].mean()
Average_F1_S_list.append(Average_F1_S)
SD_F1_S = DF_DICT[key]['F1_S'].std()
SD_F1_S_list.append(SD_F1_S)
Average_F1_M = DF_DICT[key]['F1_M'].mean()
Average_F1_M_list.append(Average_F1_M)
SD_F1_M = DF_DICT[key]['F1_M'].std()
SD_F1_M_list.append(SD_F1_M)
df=pd.DataFrame({'Group':Groups,
'Average_F1_S':Average_F1_S_list,'Standard_Dev_F1_S':SD_F1_S_list,
'Average_F1_M':Average_F1_M_list,'Standard_Dev_F1_M':SD_F1_M_list},
columns=['Group','Average_F1_S','Standard_Dev_F1_S','Average_F1_M', 'Standard_Dev_F1_M'])
This will not be a good solution as there are too many features. Is there any way I can create the lists dynamically?
This should do the trick! Hope this helps
# These are all the keys you want
key_names = ['F1_S', 'F1_M']
# Holds the data you want to pass to the dataframe.
df_info = {'Groups': Groups}
for group_name in Groups:
# For each group in the groups, we iterate over all the keys we want.
for key in key_names:
# Generate a keyname that you want for your dataframe.
avg_key_name = key + '_Average'
std_key_name = key + '_Standard_Dev'
if avg_key_name not in df_info:
df_info[avg_key_name] = []
df_info[std_key_name] = []
df_info[avg_key_name].append(DF_DICT[group_name][key].mean())
df_info[std_key_name].append(DF_DICT[group_name][key].std())
df = pd.DataFrame(df_info)

Looking for a better way do accomplish dataframe to dictionary by series

Here's a portion of what the Excel file looks like. Meant to include this the first time. Thanks for the help so far.
Name Phone Number Carrier
FirstName LastName1 3410142531 Alltel
FirstName LastName2 2437201754 AT&T
FirstName LastName3 9247224091 Boost Mobile
FirstName LastName4 6548310018 Cricket Wireless
FirstName LastName5 8811620411 Project Fi
I am converting a list of names, phone numbers, and carriers to a dictionary for easy reference by other code. The idea is separate code will be able to call a name and access that person's phone number and carrier.
I got the output I need, but I'm wondering if there were an easier way I could have accomplished this task and get the same output. Though it's fairly concise, I'm interested in any module or built in of which I'm not aware. My python skills are beginner at best. I wrote this in Thorny with Python 3.6.4. Thanks!
#Imports
import pandas as pd
import math
# Assign spreadsheet filename to `file`
file = 'Phone_Numbers.xlsx'
# Load spreadsheets
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1', header=0)
# Put the dataframe into a dictionary to start
phone_numbers = df1.to_dict(orient='records')
# Converts PhoneNumbers.xlsx to a dictionary
x=0
temp_dict = {}
for item in phone_numbers:
temp_list = []
for key in phone_numbers[x]:
tempholder = phone_numbers[x][key]
if (isinstance(tempholder, float) or isinstance(tempholder, int)) and math.isnan(tempholder) == False: # Checks to see if there is a blank and if the phone number comes up as a float
# Converts any floats to string for use in later code
tempholder = str(int(tempholder))
else:
pass
temp_list.append(tempholder)
temp_dict[temp_list[0]] = temp_list[1:] # Makes the first item in the list the key and add the rest as values
x += 1
print(temp_dict)
Here's the desired output:
{'FirstName LastName1': ['3410142531', 'Alltel'], 'FirstName LastName2': [2437201754, 'AT&T'], 'FirstName LastName3': [9247224091, 'Boost Mobile'], 'FirstName LastName4': [6548310018, 'Cricket Wireless'], 'FirstName LastName5': [8811620411, 'Project Fi']
One way to do it would be to iterate through the dataframe and use a dictionary comprehension:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']] for _, row in df.iterrows()}
where df is your original dataframe (the result of xl.parse('Sheet1', header=0)). This basically iterates through all rows in your dataframe, creating a dictionary key for each Name, with Phone number and carrier as it's values (in a list), as you indicated in your output.
To make sure that your phone number is not null (as you did in your loop), you could add an if clause to your dict comprehension, such as this:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']]
for _, row in df.iterrows()
if not math.isnan(row['Phone Number'])}
df.set_index('Name').T.to_dict('list')
should do the job ,Here df is your dataframe

Python : Separating a .txt file into columns and finding the most frequent data item in one of the columns

I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'

Categories