Python: split string with delimiters from a list - python

I'd like to split a string with delimiters which are in a list.
The string has this pattern: Firstname, Lastname Email
The list of delimiters has this: [', ',' '] taken out of the pattern.
I'd like to split the string to get a list like this
['Firstname', 'Lastname', 'Email']
For a better understanding of my problem, this is what I'm trying to achieve:
The user shall be able to provide a source pattern: %Fn%, %Ln% %Mail% of data to be imported
and a target pattern how the data shall be displayed:
%Ln%%Fn%; %Ln%, %Fn; %Mail%
This is my attempt:
data = "Firstname, Lastname Email"
for delimiter in source_pattern_delimiter:
prog = re.compile(delimiter)
data_tuple = prog.split(data)
How do I 'merge' the data_tuple list(s)?

import re
re.split(re.compile("|".join([", ", " "])), "Firstname, Lastname Email")
hope it helps

Seems you want something like this,
>> s = "Firstname, Lastname Email"
>>> delim = [', ',' ']
>>> re.split(r'(?:' + '|'.join(delim) + r')', s)
['Firstname', 'Lastname', 'Email']

A solution without regexes and if you want to apply a particular delimiter at a particular position:
def split(s, delimiters):
for d in delimiters:
item, s = s.split(d, 1)
yield item
else:
yield s
>>> list(split("Firstname, Lastname Email", [", ", " "]))
["Firstname", "Lastname", "Email"]

What about splitting on spaces, then removing any trailing commas?
>>> data = "Firstname, Lastname Email"
>>> [s.rstrip(',') for s in data.split(' ')]
['Firstname', 'Lastname', 'Email']

You are asking for a template based way to reconstruct the split data. The following script could give you an idea how to progress. It first splits the data into the three parts and assigns each to a dictionary entry. This can then be used to give a target pattern:
import re
data = "Firstname, Lastname Email"
# Find a list of entries and display them
entries = re.findall("(\w+)", data)
print entries
# Convert the entries into a dictionary
dEntries = {"Fn": entries[0], "Ln": entries[1], "Mail": entries[2]}
# Use dictionary-based string formatting to provide a template system
print "%(Ln)s%(Fn)s; %(Ln)s, %(Fn)s; %(Mail)s" % dEntries
This displays the following:
['Firstname', 'Lastname', 'Email']
LastnameFirstname; Lastname, Firstname; Email
If you really need to use the exact template system you have provided then the following could be done to first convert your target pattern into one suitable for use with Python's dictionary system:
def display_with_template(data, target_pattern):
entries = re.findall("(\w+)", data)
dEntries = {"Fn": entries[0], "Ln": entries[1], "Mail": entries[2]}
for item in ["Fn", "Ln", "Mail"]:
target_pattern= target_pattern.replace("%%%s%%" % item, "%%(%s)s" % item)
return target_pattern % dEntries
print display_with_template("Firstname, Lastname Email", r"%Ln%%Fn%; %Ln%, %Fn%; %Mail%")
Which would display the same result, but uses a custom target pattern:
LastnameFirstname; Lastname, Firstname; Email

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Filter multiple fields using single input

My backend in Python :
def resview(request, *args, **kwargs):
if 'uid' not in kwargs:
kwargs = kwargs.copy()
kwargs['filter'] = {}
username = request.GET.get('username')
if username:
kwargs['filter']['user__username__contains'] = username
return resource_view(request, UserProfile, get_userprof, put_userprof,
deleter=del_userprof, api=request.api, order_by='user__username', **kwargs)
I can successfully search for the username but I want to be able to search on multiple fields with a single input and I don't know how to do it.
I have once in similar situation used Regular Expression to search based on keywords & then pulling out entire line until newline. Here I'm mentioning my small code.
Step1: Save the content in Text file
str = open('file.txt', 'r').read()
import re
#to print id
m = re.search('(?<=id: )(.*)', str)
print ("id= " , (m.groups()))
#to print username
m = re.search('(?<=username: )(.*)', str)
print (username= " , (m.groups()))
############
#and so on----Replace keywords whatever you need...
#############
Considering you're using django, I would suggest to take look at SearchFilter which supports searching multiple fields at once.
Beside of that, you can filter queryset by hands with custom implementation of Filter:
class F(django_filters.FilterSet):
username = CharFilter(method='my_custom_filter')
class Meta:
model = User
fields = ['username', 'firstname', 'lastname', 'email']
def my_custom_filter(self, queryset, name, value):
return queryset.filter(
Q(username__ilike=value)
| Q(firstname__ilike=value)
| Q(lastname__ilike=value)
| Q(email__ilike=value)
)
You can store the JSON list in a list variable and filter the results.
filterData(searchString){
searchString = searchString.trim();
return this.data.filter(user=> user.username.toLowerCase().indexOf(searchString.toLowerCase()) !== -1 || user.firstname.toLowerCase().indexOf(searchString.toLowerCase()) !== -1 || user.lastname.toLowerCase().indexOf(searchString.toLowerCase()) !== -1);
}
I suggest that you use two different lists to save data. One filtered and the other un filtered. This way you won't have to call the service again to get original data. Filter is always applied on unfiltered data.

Extracting information from Strings in Python

The format is
"FirstName LastName Type_Date-Time_ref_PhoneNumber"
All this is a single string
Example: "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
I want to extract Name, Type, Date, Time, ref, Phonenumber from this string.
You can do
a = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
name = " ".join(a.split(" ")[:2])
Type, Date, ref, Phonenumber = a.replace(name, "").strip().split("_")
Time = Date[Date.find("-")+ 1:]
Date = Date.replace(f"-{Time}", "")
print(name, Type, Date, Time, ref, Phonenumber)
That will output
('Yasir Pirkani', 'MCD', '20201105', '134700', 'abc123', '12345678')
There are multiple ways to do so, one of them is using regex.
s = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
splited_s = re.split('[\s_]+', s)
# splited_s -> ['Yasir', 'Pirkani', 'MCD', '20201105-134700', 'abc123', '12345678']
Then you can access each element of splited_s and adjust it properly

Python - creating same sender email/sender in Random email/conversation generator

I'm super new to Python and just trying my hand at a random email generator.
I'm just using json files with datasets in them, so there may be a better way to do this.
I can get the script to work no problems, but I need some advice on something. I want the senders email to be the same as the sign off name.
I.E. david_jones#hotmail etc comes from Regards, David Jones. At the moment i've got it generating a separate random email, and separate sign off name. I need to link the two. Everything else is ok at the moment.
Can anyone help me with a better way to do this?
Code:
import json
import random
f = open("C:/Users/*/Desktop/Email.txt", "a")
sentfrom = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Send.json').read())
send = sentfrom [random.randint(0,4)]
carboncopy = "CC:"
receiver = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/To.json').read())
to = receiver[random.randint(0,4)]
datesent = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Date.json').read())
date = datesent[random.randint(0,4)]
subjects = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Subject.json').read())
subject = subjects[random.randint(0,4)]
greetings = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Greeting.json').read())
greeting= greetings[random.randint(0,4)]
firstsentence = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Sent1.json').read())
sent1 = firstsentence[random.randint(0,4)]
secondsentence = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Sent2.json').read())
sent2 = secondsentence[random.randint(0,4)]
thirdsentence = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Sent3.json').read())
sent3 = thirdsentence[random.randint(0,4)]
fourthsentence = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Sent4.json').read())
sent4 = fourthsentence[random.randint(0,4)]
farewell = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Goodbye.json').read())
goodbye = farewell[random.randint(0,4)]
regards = json.loads(open('C:/Users/*/Desktop/*/Scripts/Test/Sender.json').read())
salutation = regards[random.randint(0,4)]
conversation = send +'\n'+ to +'\n'+ carboncopy +'\n'+ date +'\n'+ subject +'\n'+ '\n' + greeting +', \n'+ '\n' + sent1 +'\n'+ '\n' + sent2 +'\n'+'\n'+ sent3 +'\n'+'\n'+ sent4 +'\n'+'\n'+ goodbye +'\n'+'\n'+ salutation
f.write(conversation)
f.close()
Thanks in advance,
Buzz
Assuming that regards is what contains the sign off name..
You want to first get rid of the sign off name, instead of 'Regards, John Doe', Have all of them be 'Regards', 'Best', 'Thanks!' etc. maybe just create a list instead of reading it from json:
regards = ['Regards,', 'Best,', 'Thanks!' ...]
Assuming everyone's format in email is the same, i.e. john_doe#whatever.com, you can get the name from this:
my_name = to.split('#')[0].replace('_', ' ').title()
# my_name will be 'John Doe'
And then add my_name to the conversation after salutation.

Split and strip output python

I am really having some issues with getting clean output from python after reading in a file.
Here is my code so far:
user_pass_list = []
count = 0
for Line in open(psw):
fields = Line.strip("\n").strip(",").split(":")
user_pass_list.append(fields)
count = count + 1
print (count)
for item in range (0, count):
print (user_pass_list[item])
Here is what I keep getting as output:
['administrator', 'admin']
['']
['administrator', 'password']
['']
['admin', '12345']
['']
['admin', '1234']
['']
['root', 'root']
['']
['root', 'password']
['']
['root', 'toor']
Here is the text file that I am trying to read in to a list.
administrator:admin
administrator:password
admin:12345
admin:1234
root:root
root:password
root:toor
Could someone please help me? What I want is for each field to have its own list.
users[0]="administrator"
passwords[0]="admin"
users[1]="administrator"
passwords[1]="password"
Any suggestions?
You could use 2 lists, users and passwords like this:
users = []
passwords = []
with open(psw, 'rb') as fin:
for line in fin.readlines():
fields = line.strip().split(':')
users.append(fields[0])
passwords.append(fields[1])
But I think it would be more useful to have a list of tuples:
credentials = []
with open(psw, 'rb') as fin:
for line in fin.readlines():
fields = line.strip().split(':')
credentials.append((fields[0], field[1]))
How about instead of fields you try to unpack it into two vars immediately, and wrap it in a try/except so if it doesn't unpack to exactly two fields, it simply fails and skips it?
for Line in open(psw):
try:
user, pswd = Line.strip("\n").strip(",").split(":")
user_pass_list.append([user, pswd])
count = count + 1
except:
pass
You might also want to strip spaces and tabs.
There are a number of different ways to do this...
The straight forward way:
up0 = []
with open('pwd.txt') as fp:
for line in fp:
words = line.split(':')
username, password = [w.strip() for w in words]
up0.append((username, password))
using a nested list comprehension:
up1 = [(username, password.strip())
for (username, password)
in [line.split(':') for line in open('pwd.txt').readlines()]]
if you're using Python 2.7 that could be written with a simple map:
up2 = [map(str.strip, line.split(':')) for line in open('pwd.txt').readlines()]
Python 3 requires that you wrap the map inside list(), or you could replace the map by a list comprehension:
up3 = [[w.strip() for w in line.split(':')] for line in open('pwd.txt').readlines()]
and finally for those times when you feel funky, regular expressions:
import re
up4 = re.findall(r'([^:]+):([^\s]+).*\n', open('pwd.txt').read())
there seems like there should be a solution using itertools too... ;-)
You can print them all out by e.g. (Python 2.7):
print "Count:", len(up0)
for item in up0:
print item
or (Python 3):
print("Count:", len(up0))
for username, password in up0:
print("username={}, password={}".format(username, password)
I would suggest using the up0 version if you want a recommendation..
[update]: just saw you wanted all the usernames in one array and the passwords in another...
The above code creates a list of tuples, as you can see by e.g. using pprint:
import pprint
pprint.pprint(up0)
gives
[('administrator', 'admin'),
('administrator', 'password'),
('admin', '12345'),
('admin', '1234'),
('root', 'root'),
('root', 'password'),
('root', 'toor')]
this can be easily converted to what you want by:
username, password = zip(*up0)
print 'username:', username
print 'password:', password
which gives
username: ('administrator', 'administrator', 'admin', 'admin', 'root', 'root', 'root')
password: ('admin', 'password', '12345', '1234', 'root', 'password', 'toor')

Categories