Using xlrd to get list of excel values in python - python

I am trying to read a list of values in a column in an excel sheet. However, the length of the column varies every time, so I don't know the length of the list. How do I get python to read all the values in a column and stop when the cells are empty using xlrd?

for i in range(worksheet.nrows):
will iterate through all the rows in the worksheet
if you were interested in column 0 for example
c0 = [worksheet.row_values(i)[0] for i in range(worksheet.nrows) if worksheet.row_values(i)[0]]
or even better make this a generator
column_generator = (worksheet.row_values(i)[0] for i in range(worksheet.nrows))
then you can use itertools.takewhile for lazy evaluations... that will stop when you get your first empty... this will provide better performance if you just want to stop once you get your first empty value
from itertools import takewhile
print list(takewhile(str.strip,column_generator))

Related

How do I modify pd.DataFrame using for loop

I got a problem with iterative modification of dataframe (h0). I need to replace strings in my dataframe, and I have a list of strings which should be replaced (two columns: first for string that should be replaced, second for string which should go to the dataframe as new value - it is the zv dataframe).
I want to replace 584 different strings in my dataframe, therefore I am trying to use a for loop.
I've tried this:
h0 = pd.DataFrame(pd.read_csv('C:/blablabla/vyberhp.csv', delimiter=';'))
zv = pd.DataFrame(pd.read_csv('C:/blablabla/vycuc2.csv', delimiter=';'))
zv_dlzka = len(zv.index)
for i in range(zv_dlzka):
h1 = h0.replace(zv.at[i, 'stary_kod'], zv.at[i,'kod'], regex=True)
print(h1)
The result is that I see only last iteration (so replaced only string from last row from my list for string replacement).
I know where the problem is. It's here:
h1 = h0.replace(zv.at[i, 'stary_kod'], zv.at[i,'kod'], regex=True)
because for cycle always calls the first DataFrame (h0), but I have no idea how to fix it.
What are the ways to make the for loop work with modified DataFrames (so not h0) in each iteration?
Sorry if this is a basic question, I am quite new to coding.

Printing and counting unique values from an .xlsx file

I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"
I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()
Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks

Unable to loop through Dataframe rows: Length of values does not match length of index

I'm not entirely sure why I am getting this error as I have a very simple dataframe that I am currently working with. Here is a sample of the dataframe (the date column is the index):
date
News
2021-02-01
This is a news headline. This is a news summary.
2021-02-02
This is another headline. This is another summary
So basically, all I am trying to do is loop through the dataframe one row at a time and pull the News item, use the Sentiment Intensity Analyzer on it and store the compound value into a separate list (which I am appending to an empty list). However, when I run the loop, it gives me this error:
Length of values (5085) does not match the length of index (2675)
Here is a sample of the code that I have so far:
sia = SentimentIntensityAnalyzer()
news_sentiment_list = []
for i in range (0, (df_news.shape[0]-1)):
n = df_news.iloc[i][0]
news_sentiment_list.append(sia.polarity_scores(n)['compound'])
df['News Sentiment'] = news_sentiment_list
I've tried the loop statement a number of different ways using the FOR loop, and I always return that error. I am honestly lost at this point =(
edit: The shape of the dataframe is: (5087, 1)
The target dataframe is df whereas you loop on df_news, the indexes are probably not the same. You might need to merge the dataframes before doing so.
Moreover, there is an easier approach to your problem that would avoid having to loop on it. Assuming your dataframe df_news holds the column News (as shown on your table), you can add a column to this dataframe simply by doing:
sia = SentimentIntensityAnalyzer()
df_news['News Sentiment'] = df_news['News'].apply(lambda x: sia.polarity_scores(x)['compound'])
A general rule when using pandas is to avoid as much as possible using for-loops, except when you have a very specific edge case panda's built-in methods will be sufficient.

Writing a whole list in a CSV row-Python/IronPython

Right now I have several long lists : One called variable_names.
Lets say Variable names= [ Velocity, Density, Pressure, ....] (length is 50+)
I want to write a row that reads every index of the list, leaves about 5 empty cells, then writes next value, and keeps doing it until list is done.
As shown in row1 Sample picture
The thing is I can't use xlrd due to compatibility issues with Iron Python and I need to dynamically write each row in the new csv , load data from old csv , then append that data in the new csv, the old csv keeps changing once I append the data in the new csv, so I need to iterate all values in the lists for every time I write the row, because it is much more difficult to append columns in csv.
What I basicall want to do is :
with open('data.csv','a') as f:
sWriter=csv.writer(f)
sWriter.writerow([Value_list[i],Value_list[i+1],Value_list[i+2].....Value_list[end])
But I can't seem to think of a way to do this with iteration
Because writerow method takes a list argument, you can first construct the list and then write the list so everything in the list will be in one row.
Like,
with open('data.csv','a') as f:
sWriter=csv.writer(f)
listOfColumns = []
for i in range(from, to): # append elements from Value_list
listOfColumns.append(Value_list[i])
for i in range(0, 2): # Or you may want some columns with blank
listOfColumns.append("")
for i in range(anotherFrom, anotherTo): # append elements from Value_list
listOfColumns.append(Value_list[i])
# At here, the listOfColumns will be [Value_list[from], ..., Value_list[to], "", "", Value_list[anotherFrom], ..., Value_list[anotherTo]]
sWriter.writerow(listOfColumns)

pandas inserting rows in a monotonically increasing dataframe using itertuples

I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.
A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09
Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]

Categories