Effective way to regexp match pandas and strip inside df?

Effective way to regexp match pandas and strip inside df? - python

Hoping someone on here is kind enough to at least point me in the right direction. Overall, I'm trying to match regex for each row and produce the below output (in 'desired example output').
To elaborate, data is being matched from a 'Device Pool' column from a rather large CSV (all settings from a phone). I need to:
input only the regexp match in the Device Pool column/row and still have it corresponding to the Directory Number 1 data. I may add other columns later
Also strip the D in the regexp as well, as it is only useful for the
initial lookup.
Example input data(humongous file with lots of columns):
... ,Device Pool,CSS,forward 1,Directory Number 1, ...
YART01-432-D098-00-1,CSS-bobville-1,12223041234,12228675309
BART-1435-C512-00-1,CSS-willis-3,12223041234,12228215486
HOMER-1435-A134-00-1,CSS-willis-2,12223041238,12228212345
VAR05-1435-D099-00-1,CSS-willis-2,12223041897,12228215486
POA01-4-D100-00-1,CSS-hmbrgr-chz,12223043151,12228843454
...
Tried a few different approaches to no avail. with findall, I'd have to add the other columns back I guess (doesn't seem very efficient). It was pulling the data but not the other associated columns pertaining to the row. I since dropped that direction. Surely there is a cleaner way, that might even drop the need to filter first. This is where I'm at:
df1 = pd.read.csv(some_file.csv)
dff1 = df1.filter(items=['Device Pool', 'Directory Number 1']))
df2 = dff1.loc[d1.iloc[:,0].str.contains('[D][0-9][0-9][0-9]', regex=True)]
dff2 = # stuck here
current example output:
Device Pool Directory Number 1
YART01-432-D098-00-1 12228675309
VAR05-1435-D099-00-1 12228215486
POA01-4-D100-00-1 12228843454
...
desired example output:
Device Pool Directory Number 1
098 12228675309
099 12228215486
100 12228843454
...
I'll be using these trimmed numbers to reference an address code csv, then pulling coordinates from geo location code, to then map. Pretty fun project really.

You can use
df['Device Pool'] = df['Device Pool'].str.replace(r'.*-D(\d+).*', r'\1', regex=True)
Or, with Series.str.extract:
df['Device Pool'] = df['Device Pool'].str.extract(r'-D(\d+)', expand=False)
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'Device Pool':['YART01-432-D098-00-1', 'VAR05-1435-D099-00-1', 'POA01-4-D100-00-1'], 'Directory Number 1':['12228675309', '12228215486', '12228843454']})
df['Device Pool'].str.replace(r'.*-D(\d+).*', r'\1', regex=True)
>>> df
Device Pool Directory Number 1
0 098 12228675309
1 099 12228215486
2 100 12228843454
The .*-D(\d+).* regex matches
.* - any zero or more chars other than line break chars as many as possible
-D - a -D string
(\d+) - Group 1: one or more digits
.* - the rest of the line.

Related

Capturing columns with similar patterns with Python regex

I'm scraping a pdf using regex and Python. The patterns repeat through each column. I don't understand how to target each column of information separately.
Text string:
2000 2001 2002 2003\n
14,756 10,922 9,745 12,861\n
9,882 11,568 8,176 10,483\n
13,925 10,724 10,032 8,927\n
I need to return the data by year like:
[('2000', '14,756', '9,882', '13,925'),
('2001', '10,922', '11,568', '10,742'),
('2002', '9,745', '8,176', '10,032'),
('2003', '12,861', '10,483', '8,927')]
once I have the regex, I understand how to pull it from the page and put it into a df. I'm just not understanding how to target the columns separately. I just capture everything all at once.

I am afraid it is impossible to capture columns, but you can combine regex with matching the groups of the columns and transpose with zip.
(?:^|\n)([\d,]+)\s([\d,]+)\s([\d,]+)\s([\d,]+)(?:$|\n)
See how this regex works.
import re
text = """2000 2001 2002 2003
14,756 10,922 9,745 12,861
9,882 11,568 8,176 10,483
13,925 10,724 10,032 8,927"""
pattern = r"(?:^|\n)([\d,]+)\s([\d,]+)\s([\d,]+)\s([\d,]+)(?:$|\n)"
grouped = re.findall(pattern, text, flags=re.M)
columns = list(zip(*grouped)) # the expected result

Outputting DataFrame to tsv, how to ignore or override 'need to escape' error

Related to, but distinct from, this question.
I want to output my pandas dataframe to a tsv file. The first column of my data is a pattern that actually contains 3 bits of information which I'd like to separate into their own columns:
Range c1
chr1:2953-2965 -0.001069
chr1:35397-35409 -0.001050
chr1:37454-37466 -0.001330
chr2:37997-38009 -0.001235
chrX:44465-44477 -0.001292
So I do this:
Df = Df.reset_index()
Df["Range"] = Df["Range"].str.replace( ":", "\t" ).str.replace( "-", "\t" )
Df
Range c1
0 chr1\t2953\t2965 -0.001069
1 chr1\t35397\t35409 -0.001050
2 chr1\t37454\t37466 -0.001330
3 chr2\t37997\t38009 -0.001235
4 chrX\t44465\t44477 -0.001292
All I need to do now is output with no header or index, and add one more '\t' to separate the last column and I'll have my 4-column output file as desired. Unfortunately...
Df.to_csv( "~/testout.bed",
header=None,
index=False,
sep="\t",
quoting=csv.QUOTE_NONE,
quotechar=""
)
Error: need to escape, but no escapechar set
Here is where I want to ignore this error and say "No, python, actually you Don't need to escape anything. I put those tab characters in there specifically to create column separators."
I get why this error occurs. Python thinks I forgot about those tabs, and this is a safety catch, but actually I didn't forget about anything and I know what I'm doing. I know that the tab characters in my data will be indistinguishable from column-separators, and that's exactly what I want. I put them there specifically for this reason.
Surely there must be some way to override this, no? Is there any way to ignore the error and force the output?

You can simply use str.split to split the Range column directly -
df['Range'].str.split(r":|-", expand=True)
# 0 1 2
#0 chr1 2953 2965
#1 chr1 35397 35409
#2 chr1 37454 37466
#3 chr2 37997 38009
#4 chrX 44465 44477
To retain all the columns, you can simply join this split with the original
df = df.join(df['Range'].str.split(r":|-", expand=True))

extract certain words from column in a pandas df

I have a pandas df in which one column is the message and having a string and have data like below:-
df['message']
2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n
So I want to extract the raddr from the data and join it back to the df.
I am doing it with the code below and thought that its on position 7 after the split:-
df[['raddr']]=df['message'].str.split(' ', 100, expand=True)[[7]]
df['raddr']=df['raddr'].str[6:]
the issue is in some columns it's coming at 8 and some at 7 so in some columns, it gives me a report and not radar because of the issue.
How can I extract that so that it will extract it on a string search and not using split?
Note:- Also, I want a faster approach as I am doing in on hunters of thousands of records every minute.

You can use series.str.extract
df['raddr'] = df['message'].str.extract(r'raddr=([\d\.]*)') # not tested
The pattern has only one capturing group with the value after the equal sign. It will capture any combination of digits and periods until it finds something else (a blank space, letter, symbol, or end of line).

>>> import re
>>> s = '2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n'
>>> re.search('raddr=.*?\s',s).group()
'raddr=11.00.111.212 '

How do you go through a list of strings using the series.str.contains function?

I have credit card charge data that has a column containing the description for the charge. I also created a dictionary that contains categories for different charges. For example, I have a category called grocery expenses (value) and regular expressions (Ralphs, Target). I combined my values in a string with the separator |.
I am using the Series.str.contains(pat,case=True,flags=0,na=nan,regex=True) function to see if the string in each index contains my regular expressions.
# libraries needed
# import pandas as pd
# import re
joined_string=['|'.join(value) for value in values]
the_list=joined_string
Example output: the_list=[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK"]
df['Description']='FOOD4LESS 0508 0000FULLERTON CA'
The Dataframe contains a column of different charges on your credit card
```python
for character_sequence in the_list:
boolean_output=df['Description'].str.contains(character_sequence,regex=True)
For some reason, the code is not going through each character sequence in my list. It only goes through one character sequence, but I need it to go through multiple character sequences.

Since there is no data to compare with, I will just present some dummy data.
import pandas as pd
names = ['Adam','Barry','Chuck','Dennis','Elon','Fridman','George','Harry']
df = pd.DataFrame(names, columns=['Names'])
# Apply regex and save to column: Regex
df['Regex'] = df.Names.str.contains('[ae]', regex=True)
df
Output:
Names Regex
0 Adam True
1 Barry True
2 Chuck False
3 Dennis True
4 Elon False
5 Fridman True
6 George True
7 Harry True
Solution with another Example akin to the Problem
First, your the_list variable is not correct. Assuming it is a typo, I would present my solution here. Please note that regex or regular expression, when applied to a column of data, essentially means that you are trying to find some patterns. How would you in first place know/check if your pattern recognition is working fine? Well, you would need a few data-points to at least validate the regex results. Since, you only provided one line of data, therefore, I will make some dummy data here and test if the regex produces expected results.
Note: Please check the Data Prepeartions section to see the data so you can replicate and test the solution.
import pandas as pd
import re
# Make regex string from the list of target keywords
regex_expression = '|'.join(the_list)
# Make dataframe from the list of descriptions
# --> see under Data section of the solution.
df = pd.DataFrame(descriptions, columns=['Description'])
# Regex search results for a subset of
# target keywords: "Gas|Internet|Water|Electricity,VONS"
df['Regex_A'] = df.Description.str.contains("Gas|Internet|Water|Electricity,VONS", regex=True)
# Regex search result of all target keywords
df['Regex_B'] = df.Description.str.contains(regex_expression, regex=True)
df
Output:
Description Regex_A Regex_B
0 FOOD4LESS 0508 0000FULLERTON CA False True
1 Electricity,VONS 0777 0123FULLERTON NY True True
2 PAVILIONS 1248 9800Ralphs MA False True
3 SPROUTS 9823 0770MARKET#WORK WI False True
4 Internet 0333 1008Water NJ True True
5 Enternet 0444 1008Wager NJ False False
Data Preparation
In a practical scenario, I would assume that in case of the type of problem you presented in the question, you would have a list of words, that you would like to look for in the dataframe column.
So, I took the liberty to first convert your string into a list of strings.
the_list="[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK]"
the_list = the_list.replace("[","").replace("]","").split("|")
the_list
Output:
['Gas',
'Internet',
'Water',
'Electricity,VONS',
'RALPHS',
'Ralphs',
'PAVILIONS',
'FOOD4LESS',
"TRADER JOE'S",
'GROCERY OUTLET',
'FOOD 4 LESS',
'SPROUTS',
'MARKET#WORK']
Also, we make five rows of data where we have have the keywords we are looking for; and then add another row to it where we expect a False as a result of the regex pattern search.
descriptions = [
'FOOD4LESS 0508 0000FULLERTON CA',
'Electricity,VONS 0777 0123FULLERTON NY',
'PAVILIONS 1248 9800Ralphs MA',
'SPROUTS 9823 0770MARKET#WORK WI',
'Internet 0333 1008Water NJ',
'Enternet 0444 1008Wager NJ',
]

Alignment of column names and its corresponding rows in Python

I have a CSV file which is very messy in terms of column and row alignment. In the first cell, all column names are stated, but they do not align with the rows beneath. So when I load this CSV in python using pandas
I do not get a clean dataframe
In the below picture, there is an example of how it should look like with the columns separated and matching the rows.
Some details:
Few lines of raw CSV file:
Columns:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu"
Rows:
ITLT4301;1;"1-5-2018";976439;35059255;53842;6545371441;3235864;95200029;"MemActive";"4096";"0";"0"
Code:
df = pd.read_csv(file_location, sep=";")
Output when loading the dataframe in python:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu",,,
ITLT4301;1;"1-5-2018";976439,35059255 53842,6545371441 3235864,"95200029 MemActive"" 4096"" 0"" 0"""
Desired output:
VMName Cluster time AvgValue MinValue MaxValue MetricId MemoryMB CpuMHz
ITLT4301 1 1-5-201 976439 35059255 53842 6545371441 95200029 MemActive
NumCpu
4096
Hopefully this clears up the topic and problem a bit. Desired output is a well-organized data frame where the columns match the rows based on separater sign ";"

Your input data file is not a standard csv file. The correct way would be to fix the previous step in order to get a normal csv file instead of a mess of double quotes preventing any decent csv parser to correctly extract data.
As a workaround, it is possible to remove the initial and terminating double quote, remove any doubled double quote, and split every line on semi-column ignoring any remaining double quote. Optionnaly, you could also try to just remove any double quote and split the lines on ';'. It really depends on what values you expect.
A possible code could be:
def split_line(line):
'''split a line on ; after stripping white spaces, the initial and terminating "
doubles double quotes are also removed'''
return line.strip()[1:-1].replace('""', '').split(';')
with open('file.dat') as fd:
cols = split_line(next(fd)) # extract column names from header line
data = [split_line(line) for line in fd] # process data lines
df = pd.DataFrame(data, columns=cols) # build a dataframe from that
With that input:
"VMName;""Cluster"";""time"";""AvgValue"";""MinValue"";""MaxValue"";""MetricId"";""MemoryMB"";""CpuMHz"";""NumCpu"""
"ITLT4301;1;""1-5-2018"";976439" 35059255;53842 6545371441;3235864 "95200029;""MemActive"";""4096"";""0"";""0"""
"ITLT4301;1;""1-5-2018"";98" 9443749608104;29 3435452286154;673 "067568681366;""CpuUsageMHz"";""0"";""5600"";""2"""
It gives:
VMName Cluster time AvgValue MinValue \
0 ITLT4301 1 1-5-2018 976439" 35059255 53842 6545371441
1 ITLT4301 1 1-5-2018 98" 9443749608104 29 3435452286154
MaxValue MetricId MemoryMB CpuMHz NumCpu
0 3235864 "95200029 MemActive 4096 0 0
1 673 "067568681366 CpuUsageMHz 0 5600 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Effective way to regexp match pandas and strip inside df? - python

Related

Capturing columns with similar patterns with Python regex

Outputting DataFrame to tsv, how to ignore or override 'need to escape' error

extract certain words from column in a pandas df

How do you go through a list of strings using the series.str.contains function?

Alignment of column names and its corresponding rows in Python

Categories

Resources