Printing and counting unique values from an .xlsx file - python

I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"

I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()

Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks

Related

Unable to loop through Dataframe rows: Length of values does not match length of index

I'm not entirely sure why I am getting this error as I have a very simple dataframe that I am currently working with. Here is a sample of the dataframe (the date column is the index):
date
News
2021-02-01
This is a news headline. This is a news summary.
2021-02-02
This is another headline. This is another summary
So basically, all I am trying to do is loop through the dataframe one row at a time and pull the News item, use the Sentiment Intensity Analyzer on it and store the compound value into a separate list (which I am appending to an empty list). However, when I run the loop, it gives me this error:
Length of values (5085) does not match the length of index (2675)
Here is a sample of the code that I have so far:
sia = SentimentIntensityAnalyzer()
news_sentiment_list = []
for i in range (0, (df_news.shape[0]-1)):
n = df_news.iloc[i][0]
news_sentiment_list.append(sia.polarity_scores(n)['compound'])
df['News Sentiment'] = news_sentiment_list
I've tried the loop statement a number of different ways using the FOR loop, and I always return that error. I am honestly lost at this point =(
edit: The shape of the dataframe is: (5087, 1)
The target dataframe is df whereas you loop on df_news, the indexes are probably not the same. You might need to merge the dataframes before doing so.
Moreover, there is an easier approach to your problem that would avoid having to loop on it. Assuming your dataframe df_news holds the column News (as shown on your table), you can add a column to this dataframe simply by doing:
sia = SentimentIntensityAnalyzer()
df_news['News Sentiment'] = df_news['News'].apply(lambda x: sia.polarity_scores(x)['compound'])
A general rule when using pandas is to avoid as much as possible using for-loops, except when you have a very specific edge case panda's built-in methods will be sufficient.

Best way to fuzzy match values in a data frame and then replace the value?

I'm working with a dataframe containing various datapoints of customer data. I'm looking to essentially replace any junk phone numbers as a blank value, right now I'm struggling to find an efficient way to find potential junk values such as a phone number like 111-111-1111 and replace that specific value with a blank entry.
I currently have a fairly ugly solution where I'm going through 3 fields; home phone, cell phone and work phone, locating the index values of the rows in question and respective column and then am replacing those,
with regards to actually finding junk values in a dataframe, is there a better approach to this than what I am currently doing?
row_index = dataset[dataset['phone'].str.contains('11111')].index
column_index = dataset.columns.get_loc('phone')
Afterwards, I would zip these up and cycle through a for loop, using dataset.iat[row_index, column_index] = ''. The row and column index variables would also have the junk values in the 'cellphone' and 'workphone' columns appended on as well.
Pandas 'where' function tends to be quick:
dataset['phone'] = dataset['phone'].where(~dataset['phone'].str.contains('11111'),
None)

Subset one dataframe from a second

I am sure I am missing a simple solution but I have been unable to figure this out, and have yet to find the answer in the existing questions. (If it is not obvious, I am a hack and just learning Python)
Lets say I have two data frames (DataFileDF, SelectedCellsRaw) with the same two key fields (MRBTS, LNCEL) and I want a subset of the first data frame (DataFileDF) containing only the corresponding key pairs in the second data frame.
e.g. rows of DataFileDF with Keys that correspond to the keys of Selected CellsRaw.
Note this needs to match by key pair MRBTS + LNCEL not each key individually.
I tried:
SelectedCellsRaw = DataFileDF.loc[DataFileDF['MRBTS'].isin(SelectedCells['MRBTS']) & DataFileDF['LNCEL'].isin(SelectedCells['LNCEL'])]
I get the MRBTS's, but also every occurrence of LNCEL (it has a possible range of 0-9 so there are many duplicates throughout the data set).
One way you could do is to use isin with indexes:
joincols = ['MRBTS','LNCEL']
DataFileDF[DataFileDF.set_index(joincols).index.isin(SelectedCellsRaw.set_index(joincols).index)]

Store grouped data with variable

I have a general question about pandas. I have a DataFrame named d with a lot of info on parks. All unique park names are stored in an array called parks. There's another column with a location ID and I want to iterate through the parks array and print unique location ID counts associated with that park name.
d[d['Park']=='AKRO']
len(d['Location'].unique())
gives me a count of 24824.
x = d[d['Park']=='AKRO']
print(len(x['Location'].unique()))
gives me a location count of 1. Why? I thought these are the same except I am storing the info in a variable.
So naturally the loop I was trying doesn't work. Does anyone have any tips?
counts=[]
for p in parks:
x= d[d['Park']==p]
y= (len(x['Location'].unique()))
counts.append([p,y])
You can try something like,
d.groupby('Park')['Location'].nunique()
When you subset the first time, you're not assigning d[d['Park'] == 'ARKO'] to anything. So you haven't actually changed the data. You only viewed that section of the data.
When you assign x = d[d['Park']=='AKRO'], x is now only that section that you viewed with the first command. That's why you get the difference you are observing.
Your for loop is actually only looping through the columns of d. If you wish to loop through the rows, you can use the following.
for idx, row in d.iterrows():
print(idx, row)
However, if you want to count the number of locations with a for loop, you have to loop through each park. Something like the following.
for park in d['Park'].unique():
print(park, d.loc[d['Park'] == park, 'Location'].size())
You can accomplish your goal without iteration, however. This sort of approach is preferred.
d.groupby('Park')['Location'].nunique()
Be careful with Panda's DataFrame functions for which produce an inline change or not. For example, d[d['Park']=='AKRO'] doesn't actually change the DataFrame d. However, x = d[d['Park']=='AKRO'] sets the output of d[d['Park']=='AKRO'] to x so x now only has 1 Location.
Have you manually checked how many unique Location IDs exist for 'AKRO'? The for loop looks correct outside of the extra brackets around y= len(x['Location'].unique())

Using xlrd to get list of excel values in python

I am trying to read a list of values in a column in an excel sheet. However, the length of the column varies every time, so I don't know the length of the list. How do I get python to read all the values in a column and stop when the cells are empty using xlrd?
for i in range(worksheet.nrows):
will iterate through all the rows in the worksheet
if you were interested in column 0 for example
c0 = [worksheet.row_values(i)[0] for i in range(worksheet.nrows) if worksheet.row_values(i)[0]]
or even better make this a generator
column_generator = (worksheet.row_values(i)[0] for i in range(worksheet.nrows))
then you can use itertools.takewhile for lazy evaluations... that will stop when you get your first empty... this will provide better performance if you just want to stop once you get your first empty value
from itertools import takewhile
print list(takewhile(str.strip,column_generator))

Categories