Pandas, assign object name using index strings - python

Using sample data:
Product = [Galaxy_8, Galaxy_Note_9, Galaxy_Note_10, Galaxy_11]
I would like to create 4 data frames, each of the data frames contains respective sales information.
The problem is I would like to use index method to create data frames, for instance,
Expected output is:
Galaxy_8 = pd.DataFrame()
Galaxy_Note_9 = Pd.DataFrame()
Galaxy_Note_10 = pd.DataFrame()
Galaxy_11 = pd.DataFrame()
Imagine if the product list counts beyond 200, what is the most efficient way to achieve the desired outcome?
Thank you

If the sample list is like,
Product = ['Galaxy_8', 'Galaxy_Note_9', 'Galaxy_Note_10','Galaxy_11']
Then you can try like:
for var in Product:
globals()[var] = pd.DataFrame()

Related

Looping through Pandas data frame for a calculation based off the row values in two different list

I have been able to get the calculation to work but now I am having trouble appending the results back into the data frame e3. You can see from the picture that the values are printing out.
brand_list = list(e3["Brand Name"])
product_segment_list = list(e3['Product Segment'])
# Create a list of tuples: data
data = list(zip(brand_list, product_segment_list))
for i in data:
step1 = e3.loc[(e3['Brand Name']==i[0]) & (e3['Product Segment']==i[1])]
Delta_Price = (step1['Price'].diff(1).div(step1['Price'].shift(1),axis=0).mul(100.0))
print(Delta_Price)
it's easier to use groupby. In each loop 'r' will be just the grouped rows from e3 dataframe from each category and i an index.
new_df = []
for i,r in e3.groupby(['Brand Name','Product Segment']):
price_num = r["Price"].diff(1).values
price_den = r["Price"].shift(1).values
r['Price Delta'] = price_num/price_den
new_df.append(r)
e3_ = pd.concat(new_df, axis = 1)

How to export data frame as CSV in a loop

I'm analyzing some data over a loop of 10 iterations, each of the iterations represents one of the data sets. I've managed to create a data frame with pandas at the end of each iteration, now i need to export each with a different name. here a take of the code.
for t in range(len(myFiles)):
DATA = np.array(importdata(t))
data = DATA[:,1:8]
Numbers = data[:,0:5]
Stars = data[:,5:7]
[numbers,repetitions]=(Frequence(Numbers))
rep_n,freq_n = (translate(repetitions,data))
[stars,Rep_s] = (Frequence(Stars))
rep_s,freq_s = (translate(Rep_s,data))
DF1 = dataframe(numbers,rep_n,freq_n)
DF2 = dataframe(stars,rep_s,freq_s)
Data frames DF1 and DF2 must be store separately with different names in each of the loop iterations.
You can create lists of DataFrames:
ListDF1, ListDF2 = [], []
for t in range(len(myFiles)):
...
rep_s,freq_s = (translate(Rep_s,data))
ListDF1.append(dataframe(numbers,rep_n,freq_n))
ListDF2.append(dataframe(stars,rep_s,freq_s))
Then for select DataFrame use indexing:
#get first DataFrame
print (ListDF1[0])
EDIT: If need export with different filenames use t variable for DF1_0.csv, DF2_0.csv, then DF1_1.csv, DF2_1.csv, ... filenames, because python counts from 0:
for t in range(len(myFiles)):
...
DF1.to_csv(f'DF1_{t}.csv')
DF2.to_csv(f'DF2_{t}.csv')
you can use microseconds from datetime, since it will be different
from datetime import datetime
for t in range(len(myFiles)):
DATA = np.array(importdata(t))
data = DATA[:,1:8]
Numbers = data[:,0:5]
Stars = data[:,5:7]
[numbers,repetitions]=(Frequence(Numbers))
rep_n,freq_n = (translate(repetitions,data))
[stars,Rep_s] = (Frequence(Stars))
rep_s,freq_s = (translate(Rep_s,data))
DF1 = dataframe(numbers,rep_n,freq_n)
DF2 = dataframe(stars,rep_s,freq_s)
DF1.to_csv(f'DF1_{datetime.now().strftime('%f')}.csv')
DF2.to_csv(f'DF2_{datetime.now().strftime('%f')}.csv')

Accessing a data frame that is generated inside for loop out side of the loop in python

I have a data frame that I generated inside for loop. I am trying to save this data frame so that I can access it outside of the loop. I have a snippet of my code below.
my_excel_sample = pd.read_excel(r"mypath\mydata.xlsx",sheet_name=None)
for tabs in my_excel_sample.keys():
actualData = pd.DataFrame(removeEmptyColumns(my_excel_sample[tabs],0))
data = replaceNanValues(actualData,0)
data = renameColumns(data,0)
data = removeFooters(data,0)
data.reset_index(drop=True, inplace=True)
data = pd.DataFrame(RowMerger(data,0))
Now I want to use data outside of the loop. Can anyone help me to solve this?
You are creating several dataframes iteratively inside for loop and storing it in variable data.
You can just add the dataframes (data) to a list and then access them anytime you want.
Try this :
my_excel_sample = pd.read_excel(r"mypath\mydata.xlsx",sheet_name=None)
final_df_list = []
for tabs in my_excel_sample.keys():
actualData = pd.DataFrame(removeEmptyColumns(my_excel_sample[tabs],0))
data = replaceNanValues(actualData,0)
data = renameColumns(data,0)
data = removeFooters(data,0)
data.reset_index(drop=True, inplace=True)
data = pd.DataFrame(RowMerger(data,0))
final_df_list.append(data)
print(final_df_list)
If you ave any type of identifier that you can use to recognize the dataframes later, I would suggest you to use a dictionary. Make the identifier as keys and variable data as value.
Here is an example where I take serial number as key :
my_excel_sample = pd.read_excel(r"mypath\mydata.xlsx",sheet_name=None)
final_df_dict = dict()
ind = 0
for tabs in my_excel_sample.keys():
actualData = pd.DataFrame(removeEmptyColumns(my_excel_sample[tabs],0))
data = replaceNanValues(actualData,0)
data = renameColumns(data,0)
data = removeFooters(data,0)
data.reset_index(drop=True, inplace=True)
data = pd.DataFrame(RowMerger(data,0))
final_df_dict[ind] = data
ind += 1
print(final_df_dict)

Creating multiple lists in for loop with dynamic names in Python

I'm trying to find out averages and standard deviation of multiple columns of my dataset and then save them as a new column in a new dataframe. i.e. for every 'GROUP' in the dataset, I want one columns in the new dataframe with its average and SD. I came up with the following script but I'm not able to name it dynamically.
Average_F1_S_list, Average_F1_M_list, SD_F1_S_list, SD_F1_M_list = ([] for i in range(4))
Groups= DF['GROUP'].unique().tolist()
for key in Groups:
Average_F1_S = DF_DICT[key]['F1_S'].mean()
Average_F1_S_list.append(Average_F1_S)
SD_F1_S = DF_DICT[key]['F1_S'].std()
SD_F1_S_list.append(SD_F1_S)
Average_F1_M = DF_DICT[key]['F1_M'].mean()
Average_F1_M_list.append(Average_F1_M)
SD_F1_M = DF_DICT[key]['F1_M'].std()
SD_F1_M_list.append(SD_F1_M)
df=pd.DataFrame({'Group':Groups,
'Average_F1_S':Average_F1_S_list,'Standard_Dev_F1_S':SD_F1_S_list,
'Average_F1_M':Average_F1_M_list,'Standard_Dev_F1_M':SD_F1_M_list},
columns=['Group','Average_F1_S','Standard_Dev_F1_S','Average_F1_M', 'Standard_Dev_F1_M'])
This will not be a good solution as there are too many features. Is there any way I can create the lists dynamically?
This should do the trick! Hope this helps
# These are all the keys you want
key_names = ['F1_S', 'F1_M']
# Holds the data you want to pass to the dataframe.
df_info = {'Groups': Groups}
for group_name in Groups:
# For each group in the groups, we iterate over all the keys we want.
for key in key_names:
# Generate a keyname that you want for your dataframe.
avg_key_name = key + '_Average'
std_key_name = key + '_Standard_Dev'
if avg_key_name not in df_info:
df_info[avg_key_name] = []
df_info[std_key_name] = []
df_info[avg_key_name].append(DF_DICT[group_name][key].mean())
df_info[std_key_name].append(DF_DICT[group_name][key].std())
df = pd.DataFrame(df_info)

work with chunked data while groupby operations are needed

I have a dataset df with three columns: 'String_key_val', 'Float_other_val1', 'Int_other_val2'. I want to groupby on key_val, then extract the sum of val1 (resp. val2) with respect to these groups. Here is my code:
df = pandas.read_csv('test.csv')
grouped = df.groupby('String_key_val')
series_calculus1 = grouped['Float_other_val1'].sum()
series_calculus2 = grouped['Int_other_val2'].sum()
res = pandas.concat([series_calculus1, series_calculus2], axis=1)
res.to_csv('output_test.csv')
My problem is: My entry dataset is 10GB and I have 4Go Ram so I need to chunk my calculus but I can't see how. I thought of using HDFStore, but since I only have to build a numerical dataset, I see no point of storing complete DataFrame, and I don't think HDFStore can store simple arrays.
What can I do?
I believe a simple approach would be something along these lines....
import pandas as pd
summary = pd.DataFrame()
chunker = pd.read_csv('test.csv',iterator=True,chunksize=50000)
for chunk in chunker:
group = chunk.groupby('String_key_val')
out = group[['Float_other_val1','Int_other_val2']].sum()
summary = summary.append(out)
summary = summary.reset_index()
group = summary.groupby('String_key_val')
summary = group[['Float_other_val1','Int_other_val2']].sum()

Categories