How to handle looping twice in data frame using python and pandas - python

here's the data frame looks like,
this the data, in case if you wanna try it
link data from Kaggle
and then here's the code that looks like
total_data = (len(data_high))
for column in data_high:
data_type= (data_high[column].value_counts())
for y in np.array(data_type.index):
for x in data_type.values:
average = round((x/total_data)*100,2)
print(y, average)
and then here's the result
I wanted to loop each value in each column so I used the nested looping, but here's the problem the index are looped twice, how could I handle this

set will only contain unique items. If you put your indices into a set, then it only do each unique index once

Related

Unable to loop through Dataframe rows: Length of values does not match length of index

I'm not entirely sure why I am getting this error as I have a very simple dataframe that I am currently working with. Here is a sample of the dataframe (the date column is the index):
date
News
2021-02-01
This is a news headline. This is a news summary.
2021-02-02
This is another headline. This is another summary
So basically, all I am trying to do is loop through the dataframe one row at a time and pull the News item, use the Sentiment Intensity Analyzer on it and store the compound value into a separate list (which I am appending to an empty list). However, when I run the loop, it gives me this error:
Length of values (5085) does not match the length of index (2675)
Here is a sample of the code that I have so far:
sia = SentimentIntensityAnalyzer()
news_sentiment_list = []
for i in range (0, (df_news.shape[0]-1)):
n = df_news.iloc[i][0]
news_sentiment_list.append(sia.polarity_scores(n)['compound'])
df['News Sentiment'] = news_sentiment_list
I've tried the loop statement a number of different ways using the FOR loop, and I always return that error. I am honestly lost at this point =(
edit: The shape of the dataframe is: (5087, 1)
The target dataframe is df whereas you loop on df_news, the indexes are probably not the same. You might need to merge the dataframes before doing so.
Moreover, there is an easier approach to your problem that would avoid having to loop on it. Assuming your dataframe df_news holds the column News (as shown on your table), you can add a column to this dataframe simply by doing:
sia = SentimentIntensityAnalyzer()
df_news['News Sentiment'] = df_news['News'].apply(lambda x: sia.polarity_scores(x)['compound'])
A general rule when using pandas is to avoid as much as possible using for-loops, except when you have a very specific edge case panda's built-in methods will be sufficient.

Remove rows from dataframe whose text does not contain items from a list

I am importing data from a table with inconsistent naming conventions. I have created a list of manufacturer names that I would like to use as a basis of comparison against the imported name. Ideally, I will delete all rows from the dataframe that do not align with the manufacturer list. I am trying to create an index vector using a for loop to iterate through each element of the dataframe column and compare against the list. If the text is there, update my index vector to true. If not, index vector is updated to false. Finally, I want to use the index vector to drop rows from the original data frame.
I have tried generators and sets, but to no avail. I thought a for loop would be less elegant but ultimately work, yet I'm still stuck. My code is below.
meltdat.Products is my dataframe column that contains the imported data
mfgs is my list of manufacturer names
prodex is my index vector
meltdat = pd.DataFrame(
{"Location":["S1","S1","S1","S1","S1"],
"Date":["1/1/2020", "1/1/2020", "1/1/2020", "1/1/2020", "1/1/2020"],
"Products":['CC304RED','COHoney','EtainXL','Med467','MarysTop'],
"Sold":[1,3,0,1,2]})
mfgs = ['CC', 'Etain', 'Marys']
for prods in meltdat.Products:
if any(mfg in meltdat.Products[prods] for mfg in mfgs):
prodex[prods] = TRUE
else:
prodex[prods] = FALSE
I added example data in the dataframe that mirrors my imported data.
you can use pd.DataFrame.apply:
meltdat[meltdat.Products.apply(lambda x: any(m in x for m in mfgs))]

How to iterate through multiple data frames in a dictionary

I've created a dict that appears to consist of multiple data frames parsed by location. However, when I try to iterate through the dict to run correlations by location, it appears as if it's running a correlation on the entire set.
I have split the data frame by locations (Store_ID), and the loop will print each Store_ID but because the correlations are exactly the same in each iteration, I suspect it's just using the entire dataset and not iterating through data frame in the dict.
I started with:
stores = df.Store_ID.unique()
storedict = {elem : pd.DataFrame() for elem in stores}
for key in storedict.keys():
storedict[key] = df[:][df.Store_ID == key]
np.array(storedict) prints the array, grouped by each location.
But this loop (below), though it iterates through stores when it prints, seems to return the same correlation coefficients as though it's just repeating the Pearson correlation on the entire set of locations (stores).
What I'm trying to do is have it show, e.g. the Store ID and the correlation matrix for the data associated with that Store ID, then the next Store ID and its correlation matrix, and so on...
I must be missing something idiotically obvious here. What is it?
EDIT:
So when I run:
for store in stores:
print("\r")
print(store)
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
I get the same list of correlations. I wonder if it's because data is defined globally as:
data = df.drop(['datestring'], axis=1)
data.index = df.datestring
values = data.values
I think data.corr is ignoring the tuple and looking at the original dataframe. How do I define correlations so that it runs iteratively for "data" from each store, not all stores? Again here what I wanted to do was iteratively split the one data frame into many and run correlation on each store as a separate data frame (or however else is easiest to get it to work without multiplying the volume of code to tackle something that could be looped.

pandas inserting rows in a monotonically increasing dataframe using itertuples

I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.
A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09
Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]

Store grouped data with variable

I have a general question about pandas. I have a DataFrame named d with a lot of info on parks. All unique park names are stored in an array called parks. There's another column with a location ID and I want to iterate through the parks array and print unique location ID counts associated with that park name.
d[d['Park']=='AKRO']
len(d['Location'].unique())
gives me a count of 24824.
x = d[d['Park']=='AKRO']
print(len(x['Location'].unique()))
gives me a location count of 1. Why? I thought these are the same except I am storing the info in a variable.
So naturally the loop I was trying doesn't work. Does anyone have any tips?
counts=[]
for p in parks:
x= d[d['Park']==p]
y= (len(x['Location'].unique()))
counts.append([p,y])
You can try something like,
d.groupby('Park')['Location'].nunique()
When you subset the first time, you're not assigning d[d['Park'] == 'ARKO'] to anything. So you haven't actually changed the data. You only viewed that section of the data.
When you assign x = d[d['Park']=='AKRO'], x is now only that section that you viewed with the first command. That's why you get the difference you are observing.
Your for loop is actually only looping through the columns of d. If you wish to loop through the rows, you can use the following.
for idx, row in d.iterrows():
print(idx, row)
However, if you want to count the number of locations with a for loop, you have to loop through each park. Something like the following.
for park in d['Park'].unique():
print(park, d.loc[d['Park'] == park, 'Location'].size())
You can accomplish your goal without iteration, however. This sort of approach is preferred.
d.groupby('Park')['Location'].nunique()
Be careful with Panda's DataFrame functions for which produce an inline change or not. For example, d[d['Park']=='AKRO'] doesn't actually change the DataFrame d. However, x = d[d['Park']=='AKRO'] sets the output of d[d['Park']=='AKRO'] to x so x now only has 1 Location.
Have you manually checked how many unique Location IDs exist for 'AKRO'? The for loop looks correct outside of the extra brackets around y= len(x['Location'].unique())

Categories