New Pandas Columns with Regex Parsing - python

I am trying to parse text data in Pandas DataFrame based on certain tags and values in another column's fields and store them in their own columns. For example, if I created this dataframe, df:
df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']])
df = df.T
df.columns = ['col1','col2']
df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x))
all_tags = []
for val in df['tags']:
all_tags = all_tags + val
all_tags = list(set(all_tags))
for val in all_tags:
df[val] = ''
df:
col1 col2 tags A C B
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
How would I populate each of the new "tag" columns with their values from col2 so I get this df:
col1 col2 tags \
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
A C B
0 this is a value and here is c. this is the b val
1 and heres another a. and another c

Another option using str.extractall with regex (?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$):
The regex captures the key (?P<key>\w+) before the semi colon and value after the semi colon (?P<val>[^:]*) as two separate columns key and val, the val will match non : characters until it reaches the next key value pair restricted by a look ahead syntax (?=\w+:|$); This assumes the key is always a single word which would be ambiguous otherwise:
import re
pat = re.compile("(?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)")
pd.concat([
df,
(
df.col2.str.extractall(pat)
.reset_index('match', drop=True)
.set_index('key', append=True)
.val.unstack('key')
)
], axis=1).fillna('')
Where str.extractall gives:
df.col2.str.extractall(pat)
And then you pivot the result and concatenate with the original data frame.

Here's one way
In [683]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x])))
)
Out[683]:
A B C
0 this is a value this is the b val and here is c.
1 and heres another a. NaN and another c
You could append back the results using join
In [690]: df.join(df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x]))))
Out[690]:
col1 col2 tags \
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
A B C
0 this is a value this is the b val and here is c.
1 and heres another a. NaN and another c
Infact, you could get df['tags'] using string method
In [688]: df.col2.str.findall('(?:\s|)(\w*)(?::)')
Out[688]:
0 [A, B, C]
1 [A, C]
Name: col2, dtype: object
Details:
Split groups into lists
In [684]: df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
Out[684]:
0 [A: this is a value, B: this is the b val, C: ...
1 [A: and heres another a., C: and another c]
Name: col2, dtype: object
Now, to key and value pairs of lists.
In [685]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: [v.split(':', 1) for v in x]))
Out[685]:
0 [[A, this is a value], [B, this is the b val...
1 [[A, and heres another a.], [C, and another c]]
Name: col2, dtype: object

Related

Groupby pandas dataframe keeping unique values for some columns and list other columns

I want to group the following output by material_id keeping the unique values of material_description and MPN, but list the plant_id. picture for reference
def search_output(materials):
df=pd.DataFrame(materials)
df_ref = df.loc[:, df.columns!='#search.score'].groupby('material_id').agg({lambda
x:list(x)})
return df_ref
This currently groups by material_id and list other columns.
The following code i use to keep unique values grouped by material_id, but now I am missing the plant_id list column.
df_t = df.loc[:, df.columns!='#search.score'].groupby('material_id' ['material_description','MPN'].agg(['unique'])
picture for reference#2
I'm looking for a way to combine the two. A way to group by a column, keep unique values of specific columns and list other columns at the same time.
Hope you can help - and sorry for the pictures, but can't figure out how to add output otherwise :)
You can create dictionary by lists - first for aggregation by unique and for all another columns by list with dict.fromkeys, join them an pass to GroupBy.agg:
print (df)
material_id material_description MPN A B
0 1 descr1 a b c
1 1 descr2 a d e
2 1 descr1 b b c
3 2 descr3 a b c
4 2 descr4 a b c
5 2 descr4 a b c
u_cols = ['material_description','MPN']
d = {c: 'unique' if c in u_cols else list for c in df.columns.drop('material_id')}
df_ref = df.loc[:, df.columns!='#search.score'].groupby('material_id').agg(d)
print (df_ref)
material_description MPN A B
material_id
1 [descr1, descr2] [a, b] [b, d, b] [c, e, c]
2 [descr3, descr4] [a] [b, b, b] [c, c, c]

How to count the number of values for each index in Python->Pandas->DataFrame

I have the next DataFrame:
a = [{'x1':'a, b, c, d, e, f'}, {'x1':'a, b, c'}, {'x1':'a'}]
df = pd.DataFrame(a)
print(df)
I need to create the new column -> len_x1 in DataFrame and insert into it the amount of values in each cell.
I need the next result:
I will be grateful for help.
If possible count separator and add 1 use:
df['x1_len'] = df['x1'].str.count(',') + 1
print (df)
x1 x1_len
0 a, b, c, d, e, f 6
1 a, b, c 3
2 a 1
Or count words:
df['x1_len'] = df['x1'].str.count('\w+')

Insert Blank Row In Python Data frame when value in column changes?

I have a dataframe and I'd like to insert a blank row as a separator whenever the value in the first column changes.
For example:
Column 1 Col2 Col3 Col4
A s b d
A s j k
A b d q
B b a d
C l k p
becomes:
Column 1 Col2 Col3 Col4
A s b d
A s j k
A b d q
B b a d
C l k p
because the value in Column 1 changed
The only way that I figured out how to do this is using VBA as indicated by the correctly marked answer here:
How to automatically insert a blank row after a group of data
But I need to do this in Python.
Any help would be really appreciated!
Create helper DataFrame with index values of last changes, add .5, join together with original by concat, sorting indices by sort_index, create default index by reset_index and lasr remove last row by positions with iloc:
mask = df['Column 1'].ne(df['Column 1'].shift(-1))
df1 = pd.DataFrame('',index=mask.index[mask] + .5, columns=df.columns)
df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
print (df)
Column 1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4 B b a d
5
6 C l k p

Pandas: Sort before aggregate within a group

I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.
Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object
First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}

For every row in Pandas dataframe determine if a column value exists in another column

I have a pandas data frame like this:
df = pd.DataFrame({'category' : ['A', 'B', 'C', 'A'], 'category_pred' : [['A'], ['B','D'], ['A','B','C'], ['D']]})
print(df)
category category_pred
0 A [A]
1 B [B, D]
2 C [A, B, C]
3 A [D]
I would like to have an output like this:
category category_pred count
0 A [A] 1
1 B [B, D] 1
2 C [A, B, C] 1
3 A [D] 0
That is, for every row, determine if the value in 'category' appears in 'category_pred'. Note that 'category_pred' can contain multiple values.
I can do a for-loop like this one, but it is really slow.
for i in df.index:
if df.category[i] in df.category_pred[i]:
df['count'][i] = 1
I am looking for an efficient way to do this operation. Thanks!
You can make use of the DataFrame's apply method.
df['count'] = df.apply(lambda x: 1 if x.category in x.category_pred else 0, axis = 1)
This will add the new column as you want

Categories