I have a log file that is full of tweets. Each tweet is on its own line so that I can iterate though the file easily.
An example tweet would be like this:
# sample This is a sample string $ 1.00 # sample
I want to be able to clean this up a bit by removing the white space between the special character and the following alpha-numeric character. "# s", "$ 1", "# s"
So that it would look like this:
#sample This is a sample string $1.00 #sample
I'm trying to use regular expressions to match these instances because they can be variable, but I am unsure of how to go about doing this.
I've been using re.sub() and re.search() to find the instances, but am struggling to figure out how to only remove the white space while leaving the string intact.
Here is the code I have so far:
#!/usr/bin/python
import csv
import re
import sys
import pdb
import urllib
f=open('output.csv', 'w')
with open('retweet.csv', 'rb') as inputfile:
read=csv.reader(inputfile, delimiter=',')
for row in read:
a = row[0]
matchObj = re.search("\W\s\w", a)
print matchObj.group()
f.close()
Thanks for any help!
Something like this using re.sub:
>>> import re
>>> strs = "# sample This is a sample string $ 1.00 # sample"
>>> re.sub(r'([##$])(\s+)([a-z0-9])', r'\1\3', strs, flags=re.I)
'#sample This is a sample string $1.00 #sample'
>>> re.sub("([#$#]) ", r"\1", "# sample This is a sample string $ 1.00 # sample")
'#sample This is a sample string $1.00 #sample'
This seemed to work pretty nice.
print re.sub(r'([#$])\s+',r'\1','# blah $ 1')
Related
I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$
Looking for some alternative to clean a tabular file containing informations between parenthesis.
It will be the first step to include in a pipeline and I need to remove every value inside parenthesis (parenthesis included).
What I have
> Otu00467 Bacteria(100);Gracilibacteria(99);unclassified(99);unclassified(99);unclassified(99);unclassified(99);
> Otu00469 Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);
What I desire:
Otu00467 Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
Otu00469 Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
Otu00470 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;
My first approach was to split the second column by ";" , "(" , ")" and further join everything. Not bad but too ugly.
Thank you.
import re
new_string = re.sub(r'\(.*?\)', '', your_string)
I would try regexp for it. Something like that:
pattern = re.compile('(\w+)\(\d+\);')
';'.join(re.findall(pattern, string))
For each string
This regex gets rid of parenthesized groups of digits, it also gets rid of any '>' characters, since it appears that you want to eliminate them as well.
import re
data = '''\
> Otu00467 Bacteria(100);Gracilibacteria(99);unclassified(99);>unclassified(99);unclassified(99);unclassified(99);
> Otu00469 Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);
'''
data = re.sub(r'>|\(\d+\)', '', data)
print(data)
output
Otu00467 Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
Otu00469 Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
Otu00470 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;
This code works on Python 2 & 3.
Use re.sub:
import re
with open open('file.txt') as file:
text = re.sub(r'\(.*?\)', '', file.read(), flags=re.M)
This removes all occurrences of the text enclosed in parentheses. The re.M flag is the multiline specifier, which is useful when your string has newlines within the matching pattern.
#Use re module to use regex
import re
#Open file and read data in data variable
data = open('file.txt').read()
#Apply search and replace on data variable
data = re.sub('\(\d+\)', '', data)
#Print data to output.txt file
with open('output.txt', 'w') as out:
out.write(data)
I have a basic knowledge of python (completed one class) and I'm unsure of how to tackle this next script. I have two files, one is a newick tree - looks like this, but much larger:
(((1:0.01671793,2:0.01627631):0.00455274,(3:0.02781576,4:0.05606947):0.02619237):0.08529440,5:0.16755623);
The second file is a tab delimited text file that looks like this but is much larger:
1 \t Human
2 \t Chimp
3 \t Mouse
4 \t Rat
5 \t Fish
I want to replace the sequence ID numbers (only those followed by colons) in the newick file with the species names in the text file to create
(((Human:0.01671793,Chimp:0.01627631):0.00455274,(Mouse:0.02781576,Rat:0.05606947):0.02619237):0.08529440,Fish:0.16755623);
My pseudocode (after opening both files) would look something like
for line in txtfile:
if line[0] matches \(\d*\ in newick:
replace that \d* with line[2]
Any suggestions would be greatly appreciated!
this can be done by defining a callback function that is run on every match of the regexp \(\d*:.
here's an (unrelated) example from https://docs.python.org/2/library/re.html#text-munging that illustrates how the callback function is used together with re.sub() that performs the regexp substitution:
>>> def repl(m):
... inner_word = list(m.group(2))
... random.shuffle(inner_word)
... return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
You can also do it using findall:
import re
s = "(((1:0.01671793,2:0.01627631):0.00455274,(3:0.02781576,4:0.05606947):0.02619237):0.08529440,5:0.16755623)"
rep = {'1':'Human',
'2':'Chimp',
'3':'Mouse',
'4':'Rat',
'5':'Fish'}
for i in re.findall(r'(\d+:)', s):
s = s.replace(i, rep[i[:-1]]+':')
>>> print s
(((Human:0.01671793,Chimp:0.01627631):0.00455274,(Mouse:0.02781576,Rat:0.05606947):0.02619237):0.08529440,Fish:0.16755623)
I have a file with such data:
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
What I want to print out are all Sentences0. This is what I have done, but it prints out a blank list.
from nltk import *
import codecs
f=codecs.open('topon.txt','r+','cp1251')
text = f.readlines()
first=[sentence for sentence in text if re.findall('\.\n^Abc',sentence)]
print first
You don't need NLTK for this (nor are you using it). Unless I misunderstand the question, this should do the trick:
with open('topon.txt') as infile:
for line in infile:
print line.split('.', 1)[0]
In addition to #inspectorG4dget 's answer, you can do it by regexes:
from nltk import *
import codecs
f = codecs.open('a.txt', 'r+', 'cp1251')
text = f.readlines()
print [re.findall('^[^.]+', sentence) for sentence in text]
Splitting a paragraph at periods works only if every sentence ends with a period, and periods are used for nothing else. If you have a lot of real text, neither of these is even close to true. Abbreviations, questions? exclamations! etc. will trip you up a lot. So, use the tool that the nltk provides for this purpose: the function sent_tokenize(). It's not perfect, but it's a whole lot better than looking for periods. If text is your list of paragraphs, you use it like this:
first = [ ]
for par in text:
sentences = nltk.sent_tokenize(par)
first.append(sentences[0])
You could fold the above into a list comprehension, but it's not going to be very readable...
How to replace unicode values using re in Python ?
I'm looking for something like this:
line.replace('Ã','')
line.replace('¢','')
line.replace('â','')
Or is there any way which will replace all the non-ASCII characters from a file. Actually I converted PDF file to ASCII, where I'm getting some non-ASCII characters [e.g. bullets in PDF]
Please help me.
Edit after feedback in comments.
Another solution would be to check the numeric value of each character and see if they are under 128, since ascii goes from 0 - 127. Like so:
# coding=utf-8
def removeUnicode():
text = "hejsanäöåbadasd wodqpwdk"
asciiText = ""
for char in text:
if(ord(char) < 128):
asciiText = asciiText + char
return asciiText
import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())
Here's an altered version of jd's answer with benchmarks:
# coding=utf-8
def removeUnicode():
text = u"hejsanäöåbadasd wodqpwdk"
if(isinstance(text, str)):
return text.decode('utf-8').encode("ascii", "ignore")
else:
return text.encode("ascii", "ignore")
import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())
Output first solution using a str string as input:
computer:~ Ancide$ python test1.py
Time taken: 5.88719677925
Output first solution using a unicode string as input:
computer:~ Ancide$ python test1.py
Time taken: 7.21077990532
Output second solution using a str string as input:
computer:~ Ancide$ python test1.py
Time taken: 2.67580914497
Output second solution using a unicode string as input:
computer:~ Ancide$ python test1.py
Time taken: 1.740680933
Conclusion
Encoding is the faster solution and encoding the string is less code; Thus the better solution.
Why you want to replace if you have
title.decode('latin-1').encode('utf-8')
or if you want to ignore
unicode(title, errors='replace')
You have to encode your Unicode string to ASCII, ignoring any error that occurs. Here's how:
>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'
Try to pass re.UNICODE flag to params. Like this:
re.compile("pattern", re.UNICODE)
For more info see manual page.