Python openpyxl: append cell values on a 2d list - python

I am currently learning Python. I am using Python35.
Basically I have an excel sheet with a fixed number of columns and rows (that contain data), and I want to save those values in a 2D list using append.
I currently saved the data in a 1D list. This is my code:
import openpyxl
Values=[[]]
MaxColumn=sheet.max_column
MaxRow=sheet.max_row
for y in range (10,MaxRow):#Iterate for each row.
for x in range (1,MaxColumn):#Iterate for each column.
Values.append(sheet.cell(row=y,column=x).value)
#I have tried with the following:
Values[y].append(sheet.cell(row=y,column=x).value)
Traceback (most recent call last):
File "<pyshell#83>", line 4, in <module>
Values[y].append(sheet.cell(row=y,column=x).value)
AttributeError: 'int' object has no attribute 'append'
for x in range (1,MaxColumn):
#print(sheet.cell(row=y,column=x).value)
Values.append(sheet.cell(row=y,column=x).value)

You must have some code that redefines the Values object but in any case you can just do list(sheet.values).

Try the following:
# Define a list
list_2d = []
# Loop over all rows in the sheet
for row in ws.rows:
# Append a list of column values to your list_2d
list_2d.append( [cell.value for cell in row] )
print(list_2d)
Your Traceback Error:
Values[y].append(
AttributeError: 'int' object has no attribute 'append'
Values[y] is not a list object, its a int value at index y in your list object.

Related

extract sub string from column in dataframe, iteratively

I have a dataframe that contains multiple columns. The column 'group_email" contains multiple parts of data that's relevant, and I want to extract a specific subtring from the 'group_email' column and create a new column from it for each row. However, there are multiple patterns the email follows so I have to first check which sub string the email starts with to know which regex pattern to use.
for ind in group_member_df.index:
if(group_member_df['group_email'][ind].startswith("gcp") is True):
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('(?:prod-)(.*)-'))
elif(group_member_df['group_email'][ind].startswith("irm") is True):
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('^(?:[^-]*\-){6}([^.]*)'))
else:
group_member_df['group_code'][ind] = '0'
I have this logic, where i iterate through each row in the dataframe, see if the email starts with 'gcp' or 'irm' if one of those, I want to extract from the group_email using a specific regex, if neither just set the group_code to 0.
However i'm getting an error:
Traceback (most recent call last):
File "directory.py", line 225, in <module>
main(sys.argv[1:])
File "directory.py", line 202, in main
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('(?:prod-)(.*)-'))
AttributeError: 'str' object has no attribute 'str'
When trying to call .str.extract... on the specific index of the dataframe. What would be the correct way of doing this?
Here is raw data from the dataframe that I want to parse from:
,group_kind,group_id,group_etag,group_email,group_description,group_directMembersCount,group_name,kind,etag,id,email,role,type,status
0,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/XprY4N1E2ZREZ95Av98__pbQZXg""",115332437364675590394,astronomer#irm-eap-edp-core-prod.iam.gserviceaccount.com,MEMBER,USER,ACTIVE
1,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/WDJKr0BpbrpusytGd_HBA_wVzRQ""",102931703871297935722,hema.sundarreddy.contr#im.com,MEMBER,USER,ACTIVE
2,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/1z_mHHk4rwh93nZf55UPPWGjFyc""",111625551155802089398,irm-eap-edp-core-prod#appspot.gserviceaccount.com,MEMBER,USER,
3,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/Q7YEC8F_JeB1jKBsNam3u2fiF1o""",107499294203545833692,jarrett.garcia#im.com,OWNER,USER,ACTIVE
4,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/z5Cw_9BaO6gEOiiiX2k9HXfW5uc""",102874697335989237851,shalini.rajamani#im.com,MEMBER,USER,ACTIVE
5,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/G8PLD_6sZpjHCS44h6_9rRXIt0I""",103243562666022054078,suraj.angadi.contr#im.com,MEMBER,USER,ACTIVE
6,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/UU6ouU-RZwaU6rXCFtRmUm0Tjdk""",103099940548030708420,svc.appscripts#im.com,MANAGER,USER,ACTIVE
.str.extract() is a pandas.Series method while
group_member_df['group_email'][ind] is a string so it doesn't have the extract() method.
I would try something like
prefix_dict = {"gcp":'(?:prod-)(.*)-',"irm":'^(?:[^-]*\-){6}([^.]*)'}
res={}
for prefix in prefix_dict.keys():
mask = group_member_df.loc[:,'group_email'].str.startswith("gcp")
reg = prefix_dict[prefix]
res[prefix] = (group_member_df.loc[mask, 'group_email'].str.extract(reg))
Note that extract() returns a pandas DataFrame so cannot be directly inserted back into group_member_df without further processing. Here I collect it into a dict to allow such processing.
I use the mask because vector functions are faster then iterating over rows or columns.

'Series' object is not callable on a dataframe column with lists

I have a dataframe with many rows and columns.
One column - 'diagnostic superclass' has labels for every patient stored in rows as lists.
It looks like this:
['MI', 'HYP', 'STTC']
['MI', 'CD', 'STTC']
I need to obtain a first label from every row
The desired output is a column which stores every first list element of every row
so I wrote a function:
def labels(column_with_lists):
label = column_with_lists
for a in column_with_lists() :
list_label = column_with_lists[0]
label = list_label[0]
return label
So when I run the code I face the following problem:
Traceback (most recent call last):
File "C:/Users/nikit/PycharmProjects/ECG/ECG.py", line 77, in <module>
print(labels(y_true['diagnostic_superclass']))
File "C:/Users/nikit/PycharmProjects/ECG/ECG.py", line 63, in labels
for a in column_with_lists() :
TypeError: 'Series' object is not callable
The error is because of the () after the variable name, which means to to call it as a function. It's a pandas series, not a function.
One way to get a new series from the series of lists is with pandas.Series.apply()
def labels(column_with_lists):
return column_with_lists.apply(lambda x: x[0])
As #Vivek Kalyanarangan said, remove the parenthesis and it will work but I think that you are confuse, why you are iterate in this part if you dont use "a" for anything?
for a in column_with_lists :
list_label = column_with_lists[0]
label = list_label[0]
I think that you must storage the first item of each row in a list. In fact, you don't need to use a function:
first_element_of_each_row = [i[0] for i in y_true['diagnostic_superclass'].to_numpy()]
This should be work.

How to get the maximum number of digits after the decimal point in a Pandas series

I read a list of float values of varying precision from a csv file into a Pandas Series and need the number of digits after the decimal point. So, for 123.4567 I want to get 4.
I managed to get the number of digits for randomly generated numbers like this:
df = pd.Series(np.random.rand(100)*1000)
precision_digits = (df - df.astype(int)).astype(str).str.split(".", expand=True)[1].str.len().max()
However, if I read data from disk using pd.read_csv where some of the rows are empty (and thus filled with nan), I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/tgamauf/workspace/mostly-sydan/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 4376, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
What is going wrong here?
Is there a better way to do what I need?
pd.read_csv() typically returns a DataFrame object. The StringMethods object returned by using .str is only defined for a Series object. Try using pd.read_csv('your_data.csv' , squeeze=True) to have it return a Series object; then you will be able to use .str
For example you have following data with NaN in it .
df=pd.Series([1.111,2.2,3.33333,np.nan])
idx=df.index# record the original index
df=df.dropna()# remove the NaN row
(df - df.astype(int)).astype(str).str.split(".", expand=True)[1].str.len().reindex(idx)
The version with df - df.astype(int) does not work correctly for me, simply applying the same str.split without it does:
def get_max_decimal_length(df):
"""Get the maximum length of the fractional part of the values or None if no values present."""
values = df.dropna()
return None if values.empty else values.astype(str).str.split(".", expand=True)[1].str.len().max()

openpyxl iterate through cells of column => TypeError: 'generator' object is not subscriptable

i I want to loop over all values op a column, to safe them as the key in a dict. As far as i know, everything is ok, but python disagrees.
So my question is: "what am i doing wrong?"
>>> wb = xl.load_workbook('/home/x/repos/network/input/y.xlsx')
>>> sheet = wb.active
>>> for cell in sheet.columns[0]:
... print(cell.value)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'generator' object is not subscriptable
I must be missing something, but after a few hours of trying, I must throw in the towle and call in the cavalecry ;)
thanks in advance for all the help!
===================================================================
#Charlie Clark: thanks for taking the time to answer. The thing is, I need the first column as keys in a dict, afterwards, I need to do
for cell in sheet.columns[1:]:
so, while it would resolve my issue, it would come back to me a few lines later, in my code. I will use your suggestion in my code to extract the keys.
The question I mostly have is:
why doesn't it work, I am sure I used this code snippet before and this is also how, when googling, people are suggesting to do it.
==========================================================================
>>> for column in ws.iter_cols(min_col = 2):
... for cell in column:
... print(cell.value)
goes over all columns of the sheet, except the first. Now, I still need to exclude the first 3 rows
ws.columns returns a generator of columns because this is much more efficient on large workbooks, as in your case: you only want one column. Version 2.4 now provides the option to get columns directly: ws['A']
apparently, the openpyxl module got upgraded to 2.4, without my knowledge.
>>> for col in ws.iter_cols(min_col = 2, min_row=4):
... for cell in col:
... print(cell.value)
should do the trick.
I posted this in case other people are looking for the same answer.
iter_cols(min_col=None, max_col=None, min_row=None, max_row=None)[source]
Returns all cells in the worksheet from the first row as columns.
If no boundaries are passed in the cells will start at A1.
If no cells are in the worksheet an empty tuple will be returned.
Parameters:
min_col (int) – smallest column index (1-based index)
min_row (int) – smallest row index (1-based index)
max_col (int) – largest column index (1-based index)
max_row (int) – smallest row index (1-based index)
Return type:
generator
I'm a newbie, but I came up with the following code based on openpyxl document to print out the cell contents for a column:
x = sheet.max_row + 1
for r in range(1,x):
d=sheet.cell(row=r,column=2)
print(d.value)

Checking for the first value of data frame and assigning a value in Python

r is a dataframe with five columsn i,j,k,l,m
Below is my code,
for i in pd.unique(r.id):
sub=r[(r.id==i)] //subsetting the dataframe for each ID
sub=sub.drop_duplicates(["i","j","k","l","m"]) // dropping the duplicates
sub['k']=pd.to_datetime(sub['k'],unit='s',utc=False)
g=int(sub.iloc[0]['m']) // want to get the first value of the column
if g>64:
r=(g/64)-1
else:
r=0
if(len(sub)>1):
sub.m=r*64 + m
This works well for one ID. When there are multiple IDs, I am getting,
Traceback (most recent call last):
File "C:/project1/Final.py", line 90, in <module>
sub=r[(r.id==i)]
AttributeError: 'int' object has no attribute 'id'
Can anybody help me in solving this problem? I want to loop for all the IDs in r dataframe so that I can make some calculations.
Inside the for loop you overwrite the dataframe r with an int:
if g>64:
r=(g/64)-1
else:
r=0
This is fine with one ID because on the first iteration r is still a dataframe.
It fails with multiple IDs because on the second iteration it has been overwritten by the int, and throws the error.

Categories