I have a dataframe that contains multiple columns. The column 'group_email" contains multiple parts of data that's relevant, and I want to extract a specific subtring from the 'group_email' column and create a new column from it for each row. However, there are multiple patterns the email follows so I have to first check which sub string the email starts with to know which regex pattern to use.
for ind in group_member_df.index:
if(group_member_df['group_email'][ind].startswith("gcp") is True):
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('(?:prod-)(.*)-'))
elif(group_member_df['group_email'][ind].startswith("irm") is True):
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('^(?:[^-]*\-){6}([^.]*)'))
else:
group_member_df['group_code'][ind] = '0'
I have this logic, where i iterate through each row in the dataframe, see if the email starts with 'gcp' or 'irm' if one of those, I want to extract from the group_email using a specific regex, if neither just set the group_code to 0.
However i'm getting an error:
Traceback (most recent call last):
File "directory.py", line 225, in <module>
main(sys.argv[1:])
File "directory.py", line 202, in main
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('(?:prod-)(.*)-'))
AttributeError: 'str' object has no attribute 'str'
When trying to call .str.extract... on the specific index of the dataframe. What would be the correct way of doing this?
Here is raw data from the dataframe that I want to parse from:
,group_kind,group_id,group_etag,group_email,group_description,group_directMembersCount,group_name,kind,etag,id,email,role,type,status
0,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/XprY4N1E2ZREZ95Av98__pbQZXg""",115332437364675590394,astronomer#irm-eap-edp-core-prod.iam.gserviceaccount.com,MEMBER,USER,ACTIVE
1,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/WDJKr0BpbrpusytGd_HBA_wVzRQ""",102931703871297935722,hema.sundarreddy.contr#im.com,MEMBER,USER,ACTIVE
2,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/1z_mHHk4rwh93nZf55UPPWGjFyc""",111625551155802089398,irm-eap-edp-core-prod#appspot.gserviceaccount.com,MEMBER,USER,
3,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/Q7YEC8F_JeB1jKBsNam3u2fiF1o""",107499294203545833692,jarrett.garcia#im.com,OWNER,USER,ACTIVE
4,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/z5Cw_9BaO6gEOiiiX2k9HXfW5uc""",102874697335989237851,shalini.rajamani#im.com,MEMBER,USER,ACTIVE
5,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/G8PLD_6sZpjHCS44h6_9rRXIt0I""",103243562666022054078,suraj.angadi.contr#im.com,MEMBER,USER,ACTIVE
6,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/UU6ouU-RZwaU6rXCFtRmUm0Tjdk""",103099940548030708420,svc.appscripts#im.com,MANAGER,USER,ACTIVE
.str.extract() is a pandas.Series method while
group_member_df['group_email'][ind] is a string so it doesn't have the extract() method.
I would try something like
prefix_dict = {"gcp":'(?:prod-)(.*)-',"irm":'^(?:[^-]*\-){6}([^.]*)'}
res={}
for prefix in prefix_dict.keys():
mask = group_member_df.loc[:,'group_email'].str.startswith("gcp")
reg = prefix_dict[prefix]
res[prefix] = (group_member_df.loc[mask, 'group_email'].str.extract(reg))
Note that extract() returns a pandas DataFrame so cannot be directly inserted back into group_member_df without further processing. Here I collect it into a dict to allow such processing.
I use the mask because vector functions are faster then iterating over rows or columns.
I have a dataframe with many rows and columns.
One column - 'diagnostic superclass' has labels for every patient stored in rows as lists.
It looks like this:
['MI', 'HYP', 'STTC']
['MI', 'CD', 'STTC']
I need to obtain a first label from every row
The desired output is a column which stores every first list element of every row
so I wrote a function:
def labels(column_with_lists):
label = column_with_lists
for a in column_with_lists() :
list_label = column_with_lists[0]
label = list_label[0]
return label
So when I run the code I face the following problem:
Traceback (most recent call last):
File "C:/Users/nikit/PycharmProjects/ECG/ECG.py", line 77, in <module>
print(labels(y_true['diagnostic_superclass']))
File "C:/Users/nikit/PycharmProjects/ECG/ECG.py", line 63, in labels
for a in column_with_lists() :
TypeError: 'Series' object is not callable
The error is because of the () after the variable name, which means to to call it as a function. It's a pandas series, not a function.
One way to get a new series from the series of lists is with pandas.Series.apply()
def labels(column_with_lists):
return column_with_lists.apply(lambda x: x[0])
As #Vivek Kalyanarangan said, remove the parenthesis and it will work but I think that you are confuse, why you are iterate in this part if you dont use "a" for anything?
for a in column_with_lists :
list_label = column_with_lists[0]
label = list_label[0]
I think that you must storage the first item of each row in a list. In fact, you don't need to use a function:
first_element_of_each_row = [i[0] for i in y_true['diagnostic_superclass'].to_numpy()]
This should be work.
I read a list of float values of varying precision from a csv file into a Pandas Series and need the number of digits after the decimal point. So, for 123.4567 I want to get 4.
I managed to get the number of digits for randomly generated numbers like this:
df = pd.Series(np.random.rand(100)*1000)
precision_digits = (df - df.astype(int)).astype(str).str.split(".", expand=True)[1].str.len().max()
However, if I read data from disk using pd.read_csv where some of the rows are empty (and thus filled with nan), I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/tgamauf/workspace/mostly-sydan/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 4376, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
What is going wrong here?
Is there a better way to do what I need?
pd.read_csv() typically returns a DataFrame object. The StringMethods object returned by using .str is only defined for a Series object. Try using pd.read_csv('your_data.csv' , squeeze=True) to have it return a Series object; then you will be able to use .str
For example you have following data with NaN in it .
df=pd.Series([1.111,2.2,3.33333,np.nan])
idx=df.index# record the original index
df=df.dropna()# remove the NaN row
(df - df.astype(int)).astype(str).str.split(".", expand=True)[1].str.len().reindex(idx)
The version with df - df.astype(int) does not work correctly for me, simply applying the same str.split without it does:
def get_max_decimal_length(df):
"""Get the maximum length of the fractional part of the values or None if no values present."""
values = df.dropna()
return None if values.empty else values.astype(str).str.split(".", expand=True)[1].str.len().max()
i I want to loop over all values op a column, to safe them as the key in a dict. As far as i know, everything is ok, but python disagrees.
So my question is: "what am i doing wrong?"
>>> wb = xl.load_workbook('/home/x/repos/network/input/y.xlsx')
>>> sheet = wb.active
>>> for cell in sheet.columns[0]:
... print(cell.value)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'generator' object is not subscriptable
I must be missing something, but after a few hours of trying, I must throw in the towle and call in the cavalecry ;)
thanks in advance for all the help!
===================================================================
#Charlie Clark: thanks for taking the time to answer. The thing is, I need the first column as keys in a dict, afterwards, I need to do
for cell in sheet.columns[1:]:
so, while it would resolve my issue, it would come back to me a few lines later, in my code. I will use your suggestion in my code to extract the keys.
The question I mostly have is:
why doesn't it work, I am sure I used this code snippet before and this is also how, when googling, people are suggesting to do it.
==========================================================================
>>> for column in ws.iter_cols(min_col = 2):
... for cell in column:
... print(cell.value)
goes over all columns of the sheet, except the first. Now, I still need to exclude the first 3 rows
ws.columns returns a generator of columns because this is much more efficient on large workbooks, as in your case: you only want one column. Version 2.4 now provides the option to get columns directly: ws['A']
apparently, the openpyxl module got upgraded to 2.4, without my knowledge.
>>> for col in ws.iter_cols(min_col = 2, min_row=4):
... for cell in col:
... print(cell.value)
should do the trick.
I posted this in case other people are looking for the same answer.
iter_cols(min_col=None, max_col=None, min_row=None, max_row=None)[source]
Returns all cells in the worksheet from the first row as columns.
If no boundaries are passed in the cells will start at A1.
If no cells are in the worksheet an empty tuple will be returned.
Parameters:
min_col (int) – smallest column index (1-based index)
min_row (int) – smallest row index (1-based index)
max_col (int) – largest column index (1-based index)
max_row (int) – smallest row index (1-based index)
Return type:
generator
I'm a newbie, but I came up with the following code based on openpyxl document to print out the cell contents for a column:
x = sheet.max_row + 1
for r in range(1,x):
d=sheet.cell(row=r,column=2)
print(d.value)
r is a dataframe with five columsn i,j,k,l,m
Below is my code,
for i in pd.unique(r.id):
sub=r[(r.id==i)] //subsetting the dataframe for each ID
sub=sub.drop_duplicates(["i","j","k","l","m"]) // dropping the duplicates
sub['k']=pd.to_datetime(sub['k'],unit='s',utc=False)
g=int(sub.iloc[0]['m']) // want to get the first value of the column
if g>64:
r=(g/64)-1
else:
r=0
if(len(sub)>1):
sub.m=r*64 + m
This works well for one ID. When there are multiple IDs, I am getting,
Traceback (most recent call last):
File "C:/project1/Final.py", line 90, in <module>
sub=r[(r.id==i)]
AttributeError: 'int' object has no attribute 'id'
Can anybody help me in solving this problem? I want to loop for all the IDs in r dataframe so that I can make some calculations.
Inside the for loop you overwrite the dataframe r with an int:
if g>64:
r=(g/64)-1
else:
r=0
This is fine with one ID because on the first iteration r is still a dataframe.
It fails with multiple IDs because on the second iteration it has been overwritten by the int, and throws the error.