I'm scraping a pdf using regex and Python. The patterns repeat through each column. I don't understand how to target each column of information separately.
Text string:
2000 2001 2002 2003\n
14,756 10,922 9,745 12,861\n
9,882 11,568 8,176 10,483\n
13,925 10,724 10,032 8,927\n
I need to return the data by year like:
[('2000', '14,756', '9,882', '13,925'),
('2001', '10,922', '11,568', '10,742'),
('2002', '9,745', '8,176', '10,032'),
('2003', '12,861', '10,483', '8,927')]
once I have the regex, I understand how to pull it from the page and put it into a df. I'm just not understanding how to target the columns separately. I just capture everything all at once.
I am afraid it is impossible to capture columns, but you can combine regex with matching the groups of the columns and transpose with zip.
(?:^|\n)([\d,]+)\s([\d,]+)\s([\d,]+)\s([\d,]+)(?:$|\n)
See how this regex works.
import re
text = """2000 2001 2002 2003
14,756 10,922 9,745 12,861
9,882 11,568 8,176 10,483
13,925 10,724 10,032 8,927"""
pattern = r"(?:^|\n)([\d,]+)\s([\d,]+)\s([\d,]+)\s([\d,]+)(?:$|\n)"
grouped = re.findall(pattern, text, flags=re.M)
columns = list(zip(*grouped)) # the expected result
Related
Hoping someone on here is kind enough to at least point me in the right direction. Overall, I'm trying to match regex for each row and produce the below output (in 'desired example output').
To elaborate, data is being matched from a 'Device Pool' column from a rather large CSV (all settings from a phone). I need to:
input only the regexp match in the Device Pool column/row and still have it corresponding to the Directory Number 1 data. I may add other columns later
Also strip the D in the regexp as well, as it is only useful for the
initial lookup.
Example input data(humongous file with lots of columns):
... ,Device Pool,CSS,forward 1,Directory Number 1, ...
YART01-432-D098-00-1,CSS-bobville-1,12223041234,12228675309
BART-1435-C512-00-1,CSS-willis-3,12223041234,12228215486
HOMER-1435-A134-00-1,CSS-willis-2,12223041238,12228212345
VAR05-1435-D099-00-1,CSS-willis-2,12223041897,12228215486
POA01-4-D100-00-1,CSS-hmbrgr-chz,12223043151,12228843454
...
Tried a few different approaches to no avail. with findall, I'd have to add the other columns back I guess (doesn't seem very efficient). It was pulling the data but not the other associated columns pertaining to the row. I since dropped that direction. Surely there is a cleaner way, that might even drop the need to filter first. This is where I'm at:
df1 = pd.read.csv(some_file.csv)
dff1 = df1.filter(items=['Device Pool', 'Directory Number 1']))
df2 = dff1.loc[d1.iloc[:,0].str.contains('[D][0-9][0-9][0-9]', regex=True)]
dff2 = # stuck here
current example output:
Device Pool Directory Number 1
YART01-432-D098-00-1 12228675309
VAR05-1435-D099-00-1 12228215486
POA01-4-D100-00-1 12228843454
...
desired example output:
Device Pool Directory Number 1
098 12228675309
099 12228215486
100 12228843454
...
I'll be using these trimmed numbers to reference an address code csv, then pulling coordinates from geo location code, to then map. Pretty fun project really.
You can use
df['Device Pool'] = df['Device Pool'].str.replace(r'.*-D(\d+).*', r'\1', regex=True)
Or, with Series.str.extract:
df['Device Pool'] = df['Device Pool'].str.extract(r'-D(\d+)', expand=False)
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'Device Pool':['YART01-432-D098-00-1', 'VAR05-1435-D099-00-1', 'POA01-4-D100-00-1'], 'Directory Number 1':['12228675309', '12228215486', '12228843454']})
df['Device Pool'].str.replace(r'.*-D(\d+).*', r'\1', regex=True)
>>> df
Device Pool Directory Number 1
0 098 12228675309
1 099 12228215486
2 100 12228843454
The .*-D(\d+).* regex matches
.* - any zero or more chars other than line break chars as many as possible
-D - a -D string
(\d+) - Group 1: one or more digits
.* - the rest of the line.
I have a pandas df column containing the following strings:
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
I made some tests with str.extract but couldn't get what I want.
Any ideas on how to achieve it?
Thanks
You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':
df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]
One solution using pandas.Series.str.extract (as you tried using it):
>>> df
col
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
In the above, I assume that:
Type takes the two values Future or Stock
conId consists of digits
symbol consists of capital alphabet letters
localSymbol consists of digits and capital alphabet letters
You may want to adapt the pattern to better fit your needs.
I have a pandas df in which one column is the message and having a string and have data like below:-
df['message']
2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n
So I want to extract the raddr from the data and join it back to the df.
I am doing it with the code below and thought that its on position 7 after the split:-
df[['raddr']]=df['message'].str.split(' ', 100, expand=True)[[7]]
df['raddr']=df['raddr'].str[6:]
the issue is in some columns it's coming at 8 and some at 7 so in some columns, it gives me a report and not radar because of the issue.
How can I extract that so that it will extract it on a string search and not using split?
Note:- Also, I want a faster approach as I am doing in on hunters of thousands of records every minute.
You can use series.str.extract
df['raddr'] = df['message'].str.extract(r'raddr=([\d\.]*)') # not tested
The pattern has only one capturing group with the value after the equal sign. It will capture any combination of digits and periods until it finds something else (a blank space, letter, symbol, or end of line).
>>> import re
>>> s = '2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n'
>>> re.search('raddr=.*?\s',s).group()
'raddr=11.00.111.212 '
The question has been asked a lot, however I'm still not close to the solution. I have a column which looks something like this
What I want to do is separate the country and language in different columns like
Country Language
Vietnam Vietnamese_display 1
Indonesia Tamil__1
India Tamil_Video_5
I'm using the following code to get it done however there are a lot of factors that needs to be taken into account and I'm not sure how to do it
df[['Country', 'Language']] = df['Line Item'].str.split('_\s+', n=1, expand=True)
How can I skip the first "_" to get my desired results? Thanks
You may use
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
See the regex demo
Details
^ - start of string
_* - 0 or more underscores
([^_]+) - Capturing group 1: any one or more chars other than _
_ - a _ char
(.+) - Group 2: any one or more chars other than line break chars.
Pandas test:
df = pd.DataFrame({'Line Item': ['Vietnam_Vietnamese_display 1','Indonesia_Tamil__1','India_Tamil_Video_5']})
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
df
# Line Item Country Language
# 0 Vietnam_Vietnamese_display 1 Vietnam Vietnamese_display 1
# 1 Indonesia_Tamil__1 Indonesia Tamil__1
# 2 India_Tamil_Video_5 India Tamil_Video_5
I have credit card charge data that has a column containing the description for the charge. I also created a dictionary that contains categories for different charges. For example, I have a category called grocery expenses (value) and regular expressions (Ralphs, Target). I combined my values in a string with the separator |.
I am using the Series.str.contains(pat,case=True,flags=0,na=nan,regex=True) function to see if the string in each index contains my regular expressions.
# libraries needed
# import pandas as pd
# import re
joined_string=['|'.join(value) for value in values]
the_list=joined_string
Example output: the_list=[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK"]
df['Description']='FOOD4LESS 0508 0000FULLERTON CA'
The Dataframe contains a column of different charges on your credit card
```python
for character_sequence in the_list:
boolean_output=df['Description'].str.contains(character_sequence,regex=True)
For some reason, the code is not going through each character sequence in my list. It only goes through one character sequence, but I need it to go through multiple character sequences.
Since there is no data to compare with, I will just present some dummy data.
import pandas as pd
names = ['Adam','Barry','Chuck','Dennis','Elon','Fridman','George','Harry']
df = pd.DataFrame(names, columns=['Names'])
# Apply regex and save to column: Regex
df['Regex'] = df.Names.str.contains('[ae]', regex=True)
df
Output:
Names Regex
0 Adam True
1 Barry True
2 Chuck False
3 Dennis True
4 Elon False
5 Fridman True
6 George True
7 Harry True
Solution with another Example akin to the Problem
First, your the_list variable is not correct. Assuming it is a typo, I would present my solution here. Please note that regex or regular expression, when applied to a column of data, essentially means that you are trying to find some patterns. How would you in first place know/check if your pattern recognition is working fine? Well, you would need a few data-points to at least validate the regex results. Since, you only provided one line of data, therefore, I will make some dummy data here and test if the regex produces expected results.
Note: Please check the Data Prepeartions section to see the data so you can replicate and test the solution.
import pandas as pd
import re
# Make regex string from the list of target keywords
regex_expression = '|'.join(the_list)
# Make dataframe from the list of descriptions
# --> see under Data section of the solution.
df = pd.DataFrame(descriptions, columns=['Description'])
# Regex search results for a subset of
# target keywords: "Gas|Internet|Water|Electricity,VONS"
df['Regex_A'] = df.Description.str.contains("Gas|Internet|Water|Electricity,VONS", regex=True)
# Regex search result of all target keywords
df['Regex_B'] = df.Description.str.contains(regex_expression, regex=True)
df
Output:
Description Regex_A Regex_B
0 FOOD4LESS 0508 0000FULLERTON CA False True
1 Electricity,VONS 0777 0123FULLERTON NY True True
2 PAVILIONS 1248 9800Ralphs MA False True
3 SPROUTS 9823 0770MARKET#WORK WI False True
4 Internet 0333 1008Water NJ True True
5 Enternet 0444 1008Wager NJ False False
Data Preparation
In a practical scenario, I would assume that in case of the type of problem you presented in the question, you would have a list of words, that you would like to look for in the dataframe column.
So, I took the liberty to first convert your string into a list of strings.
the_list="[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK]"
the_list = the_list.replace("[","").replace("]","").split("|")
the_list
Output:
['Gas',
'Internet',
'Water',
'Electricity,VONS',
'RALPHS',
'Ralphs',
'PAVILIONS',
'FOOD4LESS',
"TRADER JOE'S",
'GROCERY OUTLET',
'FOOD 4 LESS',
'SPROUTS',
'MARKET#WORK']
Also, we make five rows of data where we have have the keywords we are looking for; and then add another row to it where we expect a False as a result of the regex pattern search.
descriptions = [
'FOOD4LESS 0508 0000FULLERTON CA',
'Electricity,VONS 0777 0123FULLERTON NY',
'PAVILIONS 1248 9800Ralphs MA',
'SPROUTS 9823 0770MARKET#WORK WI',
'Internet 0333 1008Water NJ',
'Enternet 0444 1008Wager NJ',
]