python keyword search in csv comments - python

I am trying to do multiple keyword search in csv file just in column comments. for some reason when I try to search I get this error message 'DataFrame' object has no attribute 'description'
for example
table1.csv
id_Acco, user_name, post_time comments
1543603, SameDavie , "2020/09/06" The car in the house
1543595, Johntim, "2020/09/11" You can filter the data
1558245, ACAtesdfgsf , "2020/09/19" if you’re looking at a ship
1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show
Output
id_Acco, user_name, post_time comments
1543603, SameDavie , "2020/09/06" The car in the house
1543595, Johntim, "2020/09/11" You can filter the data
1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show
code
df = pd.read_csv('table1.csv')
df[df.description.str.contains('house| filter | table | car')]
df.to_csv('forum_fraud_date_keyword.csv')

You can use the following code for filtering using regex by .str.contains()
df = df.loc[df.comments.str.contains(r'\b(?:house|filter|table|car)\b')]
Here, we use r-string for including regex meta-characters.
We use \b to encompass the 4 target words so that it will only match whole word instead of partial string. E.g. carmen won't be matched with car, tablespoon would not match table. If you want to match partial string, you can remove the pair of \b in the regex above.
You can look at this Regex Demo for the matching demo.
Result:
print(df)
id_Acco, user_name, post_time comments
0 1543603, SameDavie , "2020/09/06" The car in the house
1 1543595, Johntim, "2020/09/11" You can filter the data
3 1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show

Related

How do I match a string in a pandas column then return what follows it?

I have a pandas dataframe which contains a column containing twitter profile descriptions. In some of these description, there are strings like 'insta: profile_name'.
How can I create a line of code which would search for a string (eg, 'insta:' or 'instagram:') and then return the rest of the string of whatever is next to it?
1252: 'lad who loves to cook 🥘 • insta: xxx',
1254: 'founder and head chef | insta: xxx |',
1992: '🇬🇧 |bakery instagram - xxx',
2291: 'insta: #xxx for enquiries'
2336: 'self taught baker. ig:// xxxx 🍥🧆',
You can use Regex to match each of the keywords such as: Insta
The code should be something like this:
import re
container = list()
for word in [list of keywords, ex: "insta","face"]:
_tag = re.findall( word + 'Regex Syntax', the_string_to_find_from)
container.append([word,_tag])
then you can unpack the resulted Container variable when you want to get the result. I can help you write the Regex syntax but I need more information on the way your required information is wrapped in the text.
Answer provided by Nk03 in the comments:
df['name'].str.extract(pat = r'(insta:|ig:)(.*)')[1].str.strip('\',')

finding an element between a tag and a list of tags using regex

I want to find elements between two different tags but the catch is the first tag is constant but the second tag can be any tag belonging to a particular list.
for example a string
'TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
I have a list of tags ['TRSF','SND=','ORG=','OGB=','OBI=']
edit : added the availability of '=' in the list itself
My output should look some what like this
TRSF : BOOK TRANSFER CREDIT
SND : abcd bank , 123
ORG : qwer123
OGB : qwerasd
OBI : 123433
The order of tags, as well as the availability of the tags, may change also new tags may come into the picture
till now I was writing separate regex and string parsing code for each type but that seems impractical as the combination can be infinite
Here is what I was doing :
org = re.findall("ORG=(.*?) OGB=",string_1)
snd = re.findall("SND=(.*?) ORG=",string_1)
,,obi = string_1.partition('OBI=')
Is there any way to do it like
<tag>(.*?)<tag in list>
or any other method ?
If the tag list is complete, you can use a regex like
\b(TRSF|SND|ORG|OGB|OBI)\b=?\s*(.*?)(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z)
See the regex demo. Details:
\b - a word boundary
(TRSF|SND|ORG|OGB|OBI) - a tag captured into Group 1
\b - a word boundary
=? - an optional =
\s* - 0+ whitespaces
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z) - either end of string (\Z) or zero or more whitespaces followed with a tag as a whole word.
See the Python demo:
import re
s='TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
tags = ['TRSF','SND','ORG','OGB','OBI']
print( dict(re.findall(fr'\b({"|".join(tags)})\b=?\s*(.*?)(?=\s*\b(?:{"|".join(tags)})\b|\Z)', s.strip(), re.DOTALL)) )
# => {'TRSF': 'BOOK TRANSFER CREDIT', 'SND': 'abcd bank , 123', 'ORG': 'qwer123', 'OGB': 'qwerasd', 'OBI': '123433'}
Note the re.DOTALL (equal to re.S) makes the . match any chars including line break chars.

Extracting #mentions from tweets using findall python (Giving incorrect results)

I have a csv file something like this
text
RT #CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in #CellCellPress htp://.co/HrjDwbm7NN
RT #gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT #sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT #MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via #nucAmbiguous htp://…
I want to extract all the mentions (starting with '#') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^#\w])#(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row.
What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^#\w]) which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![#\w])#(\w{1,25})').apply(','.join)
df
# text mention
#0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT #gvwilson: Where's the theory for software ... gvwilson
#3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![#\w])#(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'#([^:\s]+)')
with open("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT #gvwilson: Where's the theory for software ... gvwilson
3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df once it's already constructed.
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
#.*? carries out a non-greedy match for a word starting
with a hashtag
(?=\s|$) look-ahead for the end of the word or end of the sentence
(?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a # is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

Extract hashtags from columns of a pandas dataframe

i have a dataframe df. I want to extract hashtags from tweets where Max==45.:
Max Tweets
42 via #VIE_unlike at #fashion
42 Ny trailer #katamaritribute #ps3
45 Saved a baby bluejay from dogs #fb
45 #Niley #Niley #Niley
i m trying something like this but its giving empty dataframe:
df.loc[df['Max'] == 45, [hsh for hsh in 'tweets' if hsh.startswith('#')]]
is there something in pandas which i can use to perform this effectively and faster.
You can use pd.Series.str.findall:
In [956]: df.Tweets.str.findall(r'#.*?(?=\s|$)')
Out[956]:
0 [#fashion]
1 [#katamaritribute, #ps3]
2 [#fb]
3 [#Niley, #Niley, #Niley]
This returns a column of lists.
If you want to filter first and then find, you can do so quite easily using boolean indexing:
In [957]: df.Tweets[df.Max == 45].str.findall(r'#.*?(?=\s|$)')
Out[957]:
2 [#fb]
3 [#Niley, #Niley, #Niley]
Name: Tweets, dtype: object
The regex used here is:
#.*?(?=\s|$)
To understand it, break it down:
#.*? - carries out a non-greedy match for a word starting with a hashtag
(?=\s|$) - lookahead for the end of the word or end of the sentence
If it's possible you have # in the middle of a word that is not a hashtag, that would yield false positives which you wouldn't want. In that case, You can modify your regex to include a lookbehind:
(?:(?<=\s)|(?<=^))#.*?(?=\s|$)
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

A Python regex to find soccer team fixtures in string

I am using the Requests module to access the HTML from my target website and then using Beautiful Soup to select a specific element on the website. The element in question is a table that contains the results thus far of the English Premier League 2016/2017 season. The table contains the match date, the teams involved, the full-time score and the half-time score. I want to use Python to parse the HTML of the table element and extract the fixtures listed on there. The teams are always listed as:
Team A - Team B
A team name can be 1-3 separate strings (e.g. Burnley, Manchester United, West Ham United.
My attempt so far is:
import re
teamsRegex = re.compile(r'((\w+\s)+-(\s\w+)+)')
My logic here is that the first team can be 1-3 separate strings in length and each string is always followed by a white space. Therefore, the pattern (\w+\s)+ represents a string of any length followed by a white space and can be repeated 1 or many times. The second team name will always begin with a white space following the "-" character and again can be a string of any length, repeated 1 or many times (\s\w+)+.
I'm sort of achieving the desired results but the above is not entirely correct. I am returned a list with my desired result at index 0 followed by the first string of index 0 as index 1, and the last string in index 0 as index 2.
Example string:
'Burnley - Swansea City align=center width=45> 0 - 1 align=center> (0-0)'
Regex finds:
[('Burnley - Swansea City', 'Burnley ', ' City'), ('0 - 1', '0 ', ' 1')]
I would just like it to find [('Burnley - Swansea City')]
Many thanks in anticipation of any help!
r'(?:[A-Z][a-z]*\s)+-(?:\s[A-Z][a-z]*)+'
Here you have two non-capturing (?:, so you'll get the full match only) groups to match the teams' names. I chose to use letters explicitly, so the expressions only match words beginning with capital letters and exclude digits. You should change that if the teams' names can contain digits (like "BVB 09").
Depending on the HTML file's content one could add a final lookahead (?= align) to increase specifity.
Edit:
To match up to three capitals and optional '&'s, try this :
r'(?:[A-Z&]{1,3}[a-z]*\s)+-(?:\s[A-Z&]{1,3}[a-z]*)+'

Categories