extract certain words from column in a pandas df - python

I have a pandas df in which one column is the message and having a string and have data like below:-
df['message']
2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n
So I want to extract the raddr from the data and join it back to the df.
I am doing it with the code below and thought that its on position 7 after the split:-
df[['raddr']]=df['message'].str.split(' ', 100, expand=True)[[7]]
df['raddr']=df['raddr'].str[6:]
the issue is in some columns it's coming at 8 and some at 7 so in some columns, it gives me a report and not radar because of the issue.
How can I extract that so that it will extract it on a string search and not using split?
Note:- Also, I want a faster approach as I am doing in on hunters of thousands of records every minute.

You can use series.str.extract
df['raddr'] = df['message'].str.extract(r'raddr=([\d\.]*)') # not tested
The pattern has only one capturing group with the value after the equal sign. It will capture any combination of digits and periods until it finds something else (a blank space, letter, symbol, or end of line).

>>> import re
>>> s = '2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n'
>>> re.search('raddr=.*?\s',s).group()
'raddr=11.00.111.212 '

Related

Capturing columns with similar patterns with Python regex

I'm scraping a pdf using regex and Python. The patterns repeat through each column. I don't understand how to target each column of information separately.
Text string:
2000 2001 2002 2003\n
14,756 10,922 9,745 12,861\n
9,882 11,568 8,176 10,483\n
13,925 10,724 10,032 8,927\n
I need to return the data by year like:
[('2000', '14,756', '9,882', '13,925'),
('2001', '10,922', '11,568', '10,742'),
('2002', '9,745', '8,176', '10,032'),
('2003', '12,861', '10,483', '8,927')]
once I have the regex, I understand how to pull it from the page and put it into a df. I'm just not understanding how to target the columns separately. I just capture everything all at once.
I am afraid it is impossible to capture columns, but you can combine regex with matching the groups of the columns and transpose with zip.
(?:^|\n)([\d,]+)\s([\d,]+)\s([\d,]+)\s([\d,]+)(?:$|\n)
See how this regex works.
import re
text = """2000 2001 2002 2003
14,756 10,922 9,745 12,861
9,882 11,568 8,176 10,483
13,925 10,724 10,032 8,927"""
pattern = r"(?:^|\n)([\d,]+)\s([\d,]+)\s([\d,]+)\s([\d,]+)(?:$|\n)"
grouped = re.findall(pattern, text, flags=re.M)
columns = list(zip(*grouped)) # the expected result

Effective way to regexp match pandas and strip inside df?

Hoping someone on here is kind enough to at least point me in the right direction. Overall, I'm trying to match regex for each row and produce the below output (in 'desired example output').
To elaborate, data is being matched from a 'Device Pool' column from a rather large CSV (all settings from a phone). I need to:
input only the regexp match in the Device Pool column/row and still have it corresponding to the Directory Number 1 data. I may add other columns later
Also strip the D in the regexp as well, as it is only useful for the
initial lookup.
Example input data(humongous file with lots of columns):
... ,Device Pool,CSS,forward 1,Directory Number 1, ...
YART01-432-D098-00-1,CSS-bobville-1,12223041234,12228675309
BART-1435-C512-00-1,CSS-willis-3,12223041234,12228215486
HOMER-1435-A134-00-1,CSS-willis-2,12223041238,12228212345
VAR05-1435-D099-00-1,CSS-willis-2,12223041897,12228215486
POA01-4-D100-00-1,CSS-hmbrgr-chz,12223043151,12228843454
...
Tried a few different approaches to no avail. with findall, I'd have to add the other columns back I guess (doesn't seem very efficient). It was pulling the data but not the other associated columns pertaining to the row. I since dropped that direction. Surely there is a cleaner way, that might even drop the need to filter first. This is where I'm at:
df1 = pd.read.csv(some_file.csv)
dff1 = df1.filter(items=['Device Pool', 'Directory Number 1']))
df2 = dff1.loc[d1.iloc[:,0].str.contains('[D][0-9][0-9][0-9]', regex=True)]
dff2 = # stuck here
current example output:
Device Pool Directory Number 1
YART01-432-D098-00-1 12228675309
VAR05-1435-D099-00-1 12228215486
POA01-4-D100-00-1 12228843454
...
desired example output:
Device Pool Directory Number 1
098 12228675309
099 12228215486
100 12228843454
...
I'll be using these trimmed numbers to reference an address code csv, then pulling coordinates from geo location code, to then map. Pretty fun project really.
You can use
df['Device Pool'] = df['Device Pool'].str.replace(r'.*-D(\d+).*', r'\1', regex=True)
Or, with Series.str.extract:
df['Device Pool'] = df['Device Pool'].str.extract(r'-D(\d+)', expand=False)
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'Device Pool':['YART01-432-D098-00-1', 'VAR05-1435-D099-00-1', 'POA01-4-D100-00-1'], 'Directory Number 1':['12228675309', '12228215486', '12228843454']})
df['Device Pool'].str.replace(r'.*-D(\d+).*', r'\1', regex=True)
>>> df
Device Pool Directory Number 1
0 098 12228675309
1 099 12228215486
2 100 12228843454
The .*-D(\d+).* regex matches
.* - any zero or more chars other than line break chars as many as possible
-D - a -D string
(\d+) - Group 1: one or more digits
.* - the rest of the line.

Extract columns from string

I have a pandas df column containing the following strings:
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
I made some tests with str.extract but couldn't get what I want.
Any ideas on how to achieve it?
Thanks
You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':
df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]
One solution using pandas.Series.str.extract (as you tried using it):
>>> df
col
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
In the above, I assume that:
Type takes the two values Future or Stock
conId consists of digits
symbol consists of capital alphabet letters
localSymbol consists of digits and capital alphabet letters
You may want to adapt the pattern to better fit your needs.

Alignment of column names and its corresponding rows in Python

I have a CSV file which is very messy in terms of column and row alignment. In the first cell, all column names are stated, but they do not align with the rows beneath. So when I load this CSV in python using pandas
I do not get a clean dataframe
In the below picture, there is an example of how it should look like with the columns separated and matching the rows.
Some details:
Few lines of raw CSV file:
Columns:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu"
Rows:
ITLT4301;1;"1-5-2018";976439;35059255;53842;6545371441;3235864;95200029;"MemActive";"4096";"0";"0"
Code:
df = pd.read_csv(file_location, sep=";")
Output when loading the dataframe in python:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu",,,
ITLT4301;1;"1-5-2018";976439,35059255 53842,6545371441 3235864,"95200029 MemActive"" 4096"" 0"" 0"""
Desired output:
VMName Cluster time AvgValue MinValue MaxValue MetricId MemoryMB CpuMHz
ITLT4301 1 1-5-201 976439 35059255 53842 6545371441 95200029 MemActive
NumCpu
4096
Hopefully this clears up the topic and problem a bit. Desired output is a well-organized data frame where the columns match the rows based on separater sign ";"
Your input data file is not a standard csv file. The correct way would be to fix the previous step in order to get a normal csv file instead of a mess of double quotes preventing any decent csv parser to correctly extract data.
As a workaround, it is possible to remove the initial and terminating double quote, remove any doubled double quote, and split every line on semi-column ignoring any remaining double quote. Optionnaly, you could also try to just remove any double quote and split the lines on ';'. It really depends on what values you expect.
A possible code could be:
def split_line(line):
'''split a line on ; after stripping white spaces, the initial and terminating "
doubles double quotes are also removed'''
return line.strip()[1:-1].replace('""', '').split(';')
with open('file.dat') as fd:
cols = split_line(next(fd)) # extract column names from header line
data = [split_line(line) for line in fd] # process data lines
df = pd.DataFrame(data, columns=cols) # build a dataframe from that
With that input:
"VMName;""Cluster"";""time"";""AvgValue"";""MinValue"";""MaxValue"";""MetricId"";""MemoryMB"";""CpuMHz"";""NumCpu"""
"ITLT4301;1;""1-5-2018"";976439" 35059255;53842 6545371441;3235864 "95200029;""MemActive"";""4096"";""0"";""0"""
"ITLT4301;1;""1-5-2018"";98" 9443749608104;29 3435452286154;673 "067568681366;""CpuUsageMHz"";""0"";""5600"";""2"""
It gives:
VMName Cluster time AvgValue MinValue \
0 ITLT4301 1 1-5-2018 976439" 35059255 53842 6545371441
1 ITLT4301 1 1-5-2018 98" 9443749608104 29 3435452286154
MaxValue MetricId MemoryMB CpuMHz NumCpu
0 3235864 "95200029 MemActive 4096 0 0
1 673 "067568681366 CpuUsageMHz 0 5600 2

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

Categories