How to iterate through multiple data frames in a dictionary - python

I've created a dict that appears to consist of multiple data frames parsed by location. However, when I try to iterate through the dict to run correlations by location, it appears as if it's running a correlation on the entire set.
I have split the data frame by locations (Store_ID), and the loop will print each Store_ID but because the correlations are exactly the same in each iteration, I suspect it's just using the entire dataset and not iterating through data frame in the dict.
I started with:
stores = df.Store_ID.unique()
storedict = {elem : pd.DataFrame() for elem in stores}
for key in storedict.keys():
storedict[key] = df[:][df.Store_ID == key]
np.array(storedict) prints the array, grouped by each location.
But this loop (below), though it iterates through stores when it prints, seems to return the same correlation coefficients as though it's just repeating the Pearson correlation on the entire set of locations (stores).
What I'm trying to do is have it show, e.g. the Store ID and the correlation matrix for the data associated with that Store ID, then the next Store ID and its correlation matrix, and so on...
I must be missing something idiotically obvious here. What is it?
EDIT:
So when I run:
for store in stores:
print("\r")
print(store)
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
I get the same list of correlations. I wonder if it's because data is defined globally as:
data = df.drop(['datestring'], axis=1)
data.index = df.datestring
values = data.values
I think data.corr is ignoring the tuple and looking at the original dataframe. How do I define correlations so that it runs iteratively for "data" from each store, not all stores? Again here what I wanted to do was iteratively split the one data frame into many and run correlation on each store as a separate data frame (or however else is easiest to get it to work without multiplying the volume of code to tackle something that could be looped.

Related

How to handle looping twice in data frame using python and pandas

here's the data frame looks like,
this the data, in case if you wanna try it
link data from Kaggle
and then here's the code that looks like
total_data = (len(data_high))
for column in data_high:
data_type= (data_high[column].value_counts())
for y in np.array(data_type.index):
for x in data_type.values:
average = round((x/total_data)*100,2)
print(y, average)
and then here's the result
I wanted to loop each value in each column so I used the nested looping, but here's the problem the index are looped twice, how could I handle this
set will only contain unique items. If you put your indices into a set, then it only do each unique index once

Create Dataframe for What-If Visualization by Resorting Data

I'm trying to create an experimental dataframe (that will be used only for the comparative viz) from other correlated dataframes and then sort the column values independent of each other to later visualize and show what correlated data should look like (because my current data actually shows no correlation)
experimental_df = full_df[['Token_Rarity','Price_USD']]
Becomes:
experimental_df.sort_values(by=['Token_Rarity','Price_USD'],ascending=[True,True])
I'm trying to get the lowest value in Token Column, and lowest value in price column, or vise versa, regardless of any other values or arguments. Current result:
Please try this:
experimental_df = pd.DataFrame()
experimental_df['Token_Rarity'] = full_df['Token_Rarity'].sort_values()
experimental_df['Price_USD'] = full_df['Price_USD'].sort_values().reset_index(drop=True)

Append std,mean columns to a DataFrame with a for-loop

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)

pandas inserting rows in a monotonically increasing dataframe using itertuples

I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.
A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09
Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.
You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)

Categories