extract sub string from column in dataframe, iteratively - python

I have a dataframe that contains multiple columns. The column 'group_email" contains multiple parts of data that's relevant, and I want to extract a specific subtring from the 'group_email' column and create a new column from it for each row. However, there are multiple patterns the email follows so I have to first check which sub string the email starts with to know which regex pattern to use.
for ind in group_member_df.index:
if(group_member_df['group_email'][ind].startswith("gcp") is True):
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('(?:prod-)(.*)-'))
elif(group_member_df['group_email'][ind].startswith("irm") is True):
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('^(?:[^-]*\-){6}([^.]*)'))
else:
group_member_df['group_code'][ind] = '0'
I have this logic, where i iterate through each row in the dataframe, see if the email starts with 'gcp' or 'irm' if one of those, I want to extract from the group_email using a specific regex, if neither just set the group_code to 0.
However i'm getting an error:
Traceback (most recent call last):
File "directory.py", line 225, in <module>
main(sys.argv[1:])
File "directory.py", line 202, in main
group_member_df['group_code'][ind] = (group_member_df['group_email'][ind].str.extract('(?:prod-)(.*)-'))
AttributeError: 'str' object has no attribute 'str'
When trying to call .str.extract... on the specific index of the dataframe. What would be the correct way of doing this?
Here is raw data from the dataframe that I want to parse from:
,group_kind,group_id,group_etag,group_email,group_description,group_directMembersCount,group_name,kind,etag,id,email,role,type,status
0,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/XprY4N1E2ZREZ95Av98__pbQZXg""",115332437364675590394,astronomer#irm-eap-edp-core-prod.iam.gserviceaccount.com,MEMBER,USER,ACTIVE
1,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/WDJKr0BpbrpusytGd_HBA_wVzRQ""",102931703871297935722,hema.sundarreddy.contr#im.com,MEMBER,USER,ACTIVE
2,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/1z_mHHk4rwh93nZf55UPPWGjFyc""",111625551155802089398,irm-eap-edp-core-prod#appspot.gserviceaccount.com,MEMBER,USER,
3,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/Q7YEC8F_JeB1jKBsNam3u2fiF1o""",107499294203545833692,jarrett.garcia#im.com,OWNER,USER,ACTIVE
4,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/z5Cw_9BaO6gEOiiiX2k9HXfW5uc""",102874697335989237851,shalini.rajamani#im.com,MEMBER,USER,ACTIVE
5,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/G8PLD_6sZpjHCS44h6_9rRXIt0I""",103243562666022054078,suraj.angadi.contr#im.com,MEMBER,USER,ACTIVE
6,admin#directory#group,037m2jsg1zte0ru,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/H_trseaMC0ciMbbaeYJ5C7J1vdU""",gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups#im.com,This is created for taxonomy,7,gcp-edp-platform-dgov-prod-aadrpt-allsensitive.groups,admin#directory#member,"""ncll-7bPS7lrDES-QUXBlfs2Pot1Y168LPxnrGE6FJU/UU6ouU-RZwaU6rXCFtRmUm0Tjdk""",103099940548030708420,svc.appscripts#im.com,MANAGER,USER,ACTIVE

.str.extract() is a pandas.Series method while
group_member_df['group_email'][ind] is a string so it doesn't have the extract() method.
I would try something like
prefix_dict = {"gcp":'(?:prod-)(.*)-',"irm":'^(?:[^-]*\-){6}([^.]*)'}
res={}
for prefix in prefix_dict.keys():
mask = group_member_df.loc[:,'group_email'].str.startswith("gcp")
reg = prefix_dict[prefix]
res[prefix] = (group_member_df.loc[mask, 'group_email'].str.extract(reg))
Note that extract() returns a pandas DataFrame so cannot be directly inserted back into group_member_df without further processing. Here I collect it into a dict to allow such processing.
I use the mask because vector functions are faster then iterating over rows or columns.

Related

Trying to replicate a sql statement in pyspark, getting column not iterable

Using Pyspark to transform data a DataFrame. The old extract used this SQL line :
case when location_type = 'SUPPLIER' then SUBSTRING(location_id,1,length(location_id)-3)
I brought in the data and loaded it into a DF, then was trying to do the transform using:
df = df.withColumn("location_id", F.when(df.location_type == "SUPPLIER",
F.substring(df.location_id, 1, length(df.location_id) - 3))
.otherwise(df.location_id))`
The substring method takes a int as the third argument but the length() method is giving a Column object. I had no luck trying to cast it and haven't found a method that would accept the Column. Also tried using the expr() wrapper but again could not make it work.
the supplier IDs look like 12345-01. The transform needs to strip the -01.
As you mention it, you can use expr to be able to use substring with indices that come from other columns like this:
df = df.withColumn("location_id",
F.when(df.location_type == "SUPPLIER",
F.expr("substring(location_id, 1, length(location_id) - 3)")
).otherwise(df.location_id)
)

python pandas is giving a keyerror for a column I group by, even though a boolean expression shows that the column is part of the dataframe

I cannot seem to print the following line: summarydata["Name"].groupby(["Tag"]).size()
without getting the error:
File "C:\Users\rspatel\untitled0.py", line 76, in <module>
print(summarydata["Name"].groupby(["Tag"]).size())
File "C:\Users\rspatel\Anaconda3\lib\site-packages\pandas\core\series.py", line 1720, in groupby
return SeriesGroupBy(
File "C:\Users\rspatel\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 560, in __init__
grouper, exclusions, obj = get_grouper(
File "C:\Users\rspatel\Anaconda3\lib\site-packages\pandas\core\groupby\grouper.py", line 811, in get_grouper
raise KeyError(gpr)
KeyError: 'Tag'
I have checked that Tag is included as a column in the summarydata dataframe by the following:
if 'Tag' in summarydata.columns:
print("true")
else :
print("false")
which prints out as true. Therefore I am not sure why a key error is being thrown when the column is in the dataframe.
You are trying to group by a key on the column itself. Instead you want:
summarydata["name"].groupby(summarydata["Tag"])
from the docs:
by: (mapping, function, label, or list of labels)
Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.
In other words you can pass it anything ;) (This is why I don't like pandas...)
You can pass it:
a fn, which is called on every value
a dict(!) or series, which will be used to group (what you want)
a numpy array (ditto)
a label or list of labels, in which case it groups by the column in the object in question
But in your case, you've already selected the name column, so the Tag column no longer exists! (Think about what summarydata["name"] returns.)
So if you want to group like that, you need to group first:
summarydata.groupby("Tag")["name"]

'Series' object is not callable on a dataframe column with lists

I have a dataframe with many rows and columns.
One column - 'diagnostic superclass' has labels for every patient stored in rows as lists.
It looks like this:
['MI', 'HYP', 'STTC']
['MI', 'CD', 'STTC']
I need to obtain a first label from every row
The desired output is a column which stores every first list element of every row
so I wrote a function:
def labels(column_with_lists):
label = column_with_lists
for a in column_with_lists() :
list_label = column_with_lists[0]
label = list_label[0]
return label
So when I run the code I face the following problem:
Traceback (most recent call last):
File "C:/Users/nikit/PycharmProjects/ECG/ECG.py", line 77, in <module>
print(labels(y_true['diagnostic_superclass']))
File "C:/Users/nikit/PycharmProjects/ECG/ECG.py", line 63, in labels
for a in column_with_lists() :
TypeError: 'Series' object is not callable
The error is because of the () after the variable name, which means to to call it as a function. It's a pandas series, not a function.
One way to get a new series from the series of lists is with pandas.Series.apply()
def labels(column_with_lists):
return column_with_lists.apply(lambda x: x[0])
As #Vivek Kalyanarangan said, remove the parenthesis and it will work but I think that you are confuse, why you are iterate in this part if you dont use "a" for anything?
for a in column_with_lists :
list_label = column_with_lists[0]
label = list_label[0]
I think that you must storage the first item of each row in a list. In fact, you don't need to use a function:
first_element_of_each_row = [i[0] for i in y_true['diagnostic_superclass'].to_numpy()]
This should be work.

How to get the maximum number of digits after the decimal point in a Pandas series

I read a list of float values of varying precision from a csv file into a Pandas Series and need the number of digits after the decimal point. So, for 123.4567 I want to get 4.
I managed to get the number of digits for randomly generated numbers like this:
df = pd.Series(np.random.rand(100)*1000)
precision_digits = (df - df.astype(int)).astype(str).str.split(".", expand=True)[1].str.len().max()
However, if I read data from disk using pd.read_csv where some of the rows are empty (and thus filled with nan), I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/tgamauf/workspace/mostly-sydan/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 4376, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
What is going wrong here?
Is there a better way to do what I need?
pd.read_csv() typically returns a DataFrame object. The StringMethods object returned by using .str is only defined for a Series object. Try using pd.read_csv('your_data.csv' , squeeze=True) to have it return a Series object; then you will be able to use .str
For example you have following data with NaN in it .
df=pd.Series([1.111,2.2,3.33333,np.nan])
idx=df.index# record the original index
df=df.dropna()# remove the NaN row
(df - df.astype(int)).astype(str).str.split(".", expand=True)[1].str.len().reindex(idx)
The version with df - df.astype(int) does not work correctly for me, simply applying the same str.split without it does:
def get_max_decimal_length(df):
"""Get the maximum length of the fractional part of the values or None if no values present."""
values = df.dropna()
return None if values.empty else values.astype(str).str.split(".", expand=True)[1].str.len().max()

Checking for the first value of data frame and assigning a value in Python

r is a dataframe with five columsn i,j,k,l,m
Below is my code,
for i in pd.unique(r.id):
sub=r[(r.id==i)] //subsetting the dataframe for each ID
sub=sub.drop_duplicates(["i","j","k","l","m"]) // dropping the duplicates
sub['k']=pd.to_datetime(sub['k'],unit='s',utc=False)
g=int(sub.iloc[0]['m']) // want to get the first value of the column
if g>64:
r=(g/64)-1
else:
r=0
if(len(sub)>1):
sub.m=r*64 + m
This works well for one ID. When there are multiple IDs, I am getting,
Traceback (most recent call last):
File "C:/project1/Final.py", line 90, in <module>
sub=r[(r.id==i)]
AttributeError: 'int' object has no attribute 'id'
Can anybody help me in solving this problem? I want to loop for all the IDs in r dataframe so that I can make some calculations.
Inside the for loop you overwrite the dataframe r with an int:
if g>64:
r=(g/64)-1
else:
r=0
This is fine with one ID because on the first iteration r is still a dataframe.
It fails with multiple IDs because on the second iteration it has been overwritten by the int, and throws the error.

Categories