Find all website links, group and count from column of dataframe - Python

Find all website links, group and count from column of dataframe - Python - python

I have a dataframe with the following columns: Date,Time,Tweet,Client,Client Simplified
The column Tweet contains sometimes a website link.
I am trying to define a function which extract the number of times this link is showed in the tweet and which link it is.
I don't want the answer of the whole function. I am now struggling with the function findall, before I program all this into a function:
import pandas as pd
import re
csv_doc = pd.read_csv("/home/datasci/prog_datasci_2/activities/activity_2/data/TrumpTweets.csv")
URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc)
The error I'm getting is:
TypeError Traceback (most recent call last)
<ipython-input-20-0085f7a99b7a> in <module>
7 # csv_doc.head()
8 tweets = csv_doc.Tweet
----> 9 URL= re.split('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',tweets)
10
11 # URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc[Tweets])
/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
229 and the remainder of the string is returned as the final element
230 of the list."""
--> 231 return _compile(pattern, flags).split(string, maxsplit)
232
233 def findall(pattern, string, flags=0):
TypeError: expected string or bytes-like object
Could you please let me know what is wrong?
Thanks.

try to add r in front of the string. It will tell Python that this is a regex pattern
also re package mostly work on single string, not list or series of string. You can try to use a simple list comprehension like this :
[re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',x) for x in csv_doc.Tweet]

Related

I'm trying to extract emails, and I'm getting a TypeError [duplicate]

This question already has answers here:
How to extract text from an existing docx file using python-docx
(6 answers)
I'm getting a TypeError. How do I fix it?
(2 answers)
Closed 6 months ago.
I'm attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:
import pandas as pd
from docx.api import Document
import os
import re
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
data.append(match)
df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()
print(df)
and I'm getting an error showing:
TypeError Traceback (most recent call last)
Input In [6], in <cell line: 19>()
17 data = []
19 for wordDoc in worddocs_list:
---> 20 match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
21 data.append(match)
24 df = pd.DataFrame(data)
File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
233 def findall(pattern, string, flags=0):
234 """Return a list of all non-overlapping matches in the string.
235
236 If one or more capturing groups are present in the pattern, return
(...)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
What am I doing wrong here?
Many thanks.

Your wordDoc variable doesn't contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.
It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
And then use that as the string to match against:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', documentText)
If you're going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time:
regex = re.compile(r'[\w.+-]+#[\w-]+\.[\w.-]+')
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = regex.findall(documentText)

How to complete for loop with pdfplumber?

Problem
I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16
when the code has returned my this error.
Goal
I need to scrape a pdf that looks like this (I wanted to attach the pdf but I do not know how):
170001WO01
English (US) into Arabic (DZ)
Trans./Edit/Proof. 22.117,00 Words 1,350 29.857,95
TM - Fuzzy Match 2.941,00 Words 0,500 1.470,50
TM - Exact Match 353,00 Words 0,100 35,30
Approach
I am following the tutorial aforementioned with pdfplumber.
import re
import pdfplumber
import PyPDF2
import pandas as pd
from collections import namedtuple
ap = open('test.pdf', 'rb')
I name the column of the dataframe that I want as a final product.
Serv = namedtuple('Serv', 'case_number language num_trans num_fuzzy num_exact')
Issues
I have 5 different lines compared to the tutorial example which has 2.
case_li = re.compile(r'(\d{6}\w{2}\d{2})')
language_li = re.compile(r'(nglish \(US\) into )(.*)')
trans_li = re.compile(r'(Trans./Edit/Proof. )(\d{2}\.\d{3})')
fuzzy_li = re.compile(r'(TM - Fuzzy Match )(\d{1}\.\d{3})')
exact_li = re.compile(r'(M - Exact Match )(\d{3})')
Issue
When I introduce the third line in the code, I got an error which I do not know. I have modified the code as 2e0byo suggested but I still get an error.
This is the new code:
line_items = []
with pdfplumber.open(ap) as pdf:
page = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = case_li.search(line)
if line:
case_number = line
line = language_li.search(line)
if line:
language = line.group(2)
line = trans_li.search(line)
if line:
num_trans = line.group(2)
line = fuzzy_li.search(line)
if line:
num_fuzzy = line.group(2)
line = exact_li.search(line)
if line:
num_exact = line.group(2)
line_items.append(Serv(case_number, language, num_trans, num_fuzzy, num_exact))```
---------------------------------------------------------------------------
and this is the new error:
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13992/1572426536.py in <module>
10 case_number = line
11
---> 12 line = language_li.search(line)
13 if line:
14 language = line.group(2)
TypeError: expected string or bytes-like object
TypeError: expected string or bytes-like object
# GOAL
It would be to append the lines to line_items and eventually
df = pd.DataFrame(line_items)

You have reassigned line, here:
for line in text.split("\n"):
# line is a str (the line)
line = language_li.search(line)
# line is no longer a str, but the result of a re.search
so line is no longer the text line, but the result of that match. Thus trans_li.search(line) is not searching the line you thought it was.
To fix your code, adopt a consistent pattern:
for line in text.split("\n"):
match = language_li.search(line)
# line is still a str (the line)
# match is the result of re.search
if match:
do_something(match.groups())
...
# line is *still* a str
match = trans_li.search(line):
if match:
...
For completeness' sake, with the dreaded walrus operator you can now write this:
if match := language_li.search(line) is not None:
do_something(match.groups())
Which I briefly thought was neater, but now think ugly. I fully expect to get downvoted just for mentioning the walrus operator. (If you look at the edit history of this post you will see that I have even forgotten how to use it and wrote it backwards first.)
PS: you may wish to read up on variable scope in python, although no language I know would allow this particular scope collision (overwriting a loop variable within the loop). Incidentally doing this kind of thing by mistake is why conventionally we avoid similarly-named variables (like line and Line) and go with things like line and match instead.

How to fix this type error thrown by regular expression in python?

I am trying to collect all the internal links of Requests library for python and filter out all the external links.
I am using regular expression to do the same. But it is throwing this type error that I am unable to solve.
My code:
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://2.python-requests.org/en/master/')
content = BeautifulSoup(r.text)
[i['href'] for i in content.find_all('a') if not re.match("http", i)]
Error:
TypeError Traceback (most recent call last)
<ipython-input-10-b7d82067fe9c> in <module>
----> 1 [i['href'] for i in content.find_all('a') if not re.match("http", i)]
<ipython-input-10-b7d82067fe9c> in <listcomp>(.0)
----> 1 [i['href'] for i in content.find_all('a') if not re.match("http", i)]
~\Anaconda3\lib\re.py in match(pattern, string, flags)
171 """Try to apply the pattern at the start of the string, returning
172 a Match object, or None if no match was found."""
--> 173 return _compile(pattern, flags).match(string)
174
175 def fullmatch(pattern, string, flags=0):
TypeError: expected string or bytes-like object

You are passing it a BeautifulSoup node object not a string. Try this:
[i['href'] for i in content.find_all('a') if not re.match("http", i['href'])]

Expected string or buffer error while splitting in python [duplicate]

This question already has answers here:
TypeError: expected string or buffer
(5 answers)
Closed 4 years ago.
I am trying to split a document into paragraph first and then the paragraph into lines. Then check for the lines and print the paragraph.
Although I am able to achieve that with the code below, there is some 'expected string or buffer' error that shows up when I am trying to do the same for multiple documents.
with io.open(input_path, mode='r') as f, io.open(write_path, mode='w') as f2:
data = f.read()
splat = re.split(r"\n(\s)*\n", data)
mylist=[]
for para1 in splat:
splat2= re.split(r"\n", para1)
for line1 in splat2:
PERFORM SOME OPERATION
Error
<ipython-input-218-18e633df1d46> in custom_section(input_path, write_path)
14 mylist=[]
15 for para1 in splat:
---> 16 splat2= re.split(r"\n", para1)
17 for line1 in splat2:
18 # line1 = line1.decode("utf-8")
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.pyc in split(pattern, string, maxsplit, flags)
169 """Split the source string by the occurrences of the pattern,
170 returning a list containing the resulting substrings."""
--> 171 return _compile(pattern, flags).split(string, maxsplit)
172
173 def findall(pattern, string, flags=0):
TypeError: expected string or buffer

I believe this error is occurring because the list of strings returned as your variable splat contains one or more None objects. If you insist on using re.split() you could remove the None objects with the filter() function, like so: filter(None, splat).

float' object has no attribute 'lower'

I'm facing this error and I'm really not able to find the reason for it.
Can somebody please point out the reason for it ?
for i in tweet_raw.comments:
mns_proc.append(processComUni(i))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-416-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_processed.append(processtwt(i))
3
<ipython-input-414-4e1b8a8fb285> in processtwt(tweet)
4 #Convert to lower case
5 #tweet = re.sub('RT[\s]+','',tweet)
----> 6 tweet = tweet.lower()
7 #Convert www.* or https?://* to URL
8 #tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','',tweet)
AttributeError: 'float' object has no attribute 'lower'
A second similar error that facing is this :
for i in tweet_raw.comments:
tweet_proc.append(processtwt(i))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-423-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_proc.append(processtwt(i))
3
<ipython-input-421-38fab2ef704e> in processComUni(tweet)
11 tweet=re.sub(('[http]+s?://[^\s<>"]+|www\.[^\s<>"]+'),'', tweet)
12 #Convert #username to AT_USER
---> 13 tweet = re.sub('#[^\s]+',' ',tweet)
14 #Remove additional white spaces
15 tweet = re.sub('[\s]+', ' ', tweet)
C:\Users\m1027201\AppData\Local\Continuum\Anaconda\lib\re.pyc in sub(pattern, repl, string, count, flags)
149 a callable, it's passed the match object and must return
150 a replacement string to be used."""
--> 151 return _compile(pattern, flags).sub(repl, string, count)
152
153 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or buffer
Shall I check whether of not a particluar tweet is tring before passing it to processtwt() function ? For this error I dont even know which line its failing at.

Just try using this:
tweet = str(tweet).lower()
Lately, I've been facing many of these errors, and converting them to a string before applying lower() always worked for me.

My answer will be broader than shalini answer. If you want to check if the object is of type str then I suggest you check type of object by using isinstance() as shown below. This is more pythonic way.
tweet = "stackoverflow"
## best way of doing it
if isinstance(tweet,(str,)):
print tweet
## other way of doing it
if type(tweet) is str:
print tweet
## This is one more way to do it
if type(tweet) == str:
print tweet
All the above works fine to check the type of object is string or not.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find all website links, group and count from column of dataframe - Python - python

Related

I'm trying to extract emails, and I'm getting a TypeError [duplicate]

How to complete for loop with pdfplumber?

How to fix this type error thrown by regular expression in python?

Expected string or buffer error while splitting in python [duplicate]

float' object has no attribute 'lower'

Categories

Resources