I'm working with a .csv file and, as always, it has format problems. In this case it's a ; separated table, but there's a row that sometimes has semicolons, like this:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2
So there are three cases:
no semicolon -> no problem
word character(non-numeric), semicolon, whitespace, word character(non-numeric)
word character(non-numeric), semicolon, 2xwhitespace, word character(non-numeric)
I turned the .csv into a .txt and then imported it as a string and then I compiled this regex:
re.compile('([^\d\W]);\s+([^\d\W])', re.S)
Which should do. I almost managed to replace those semicolons for commas, doing the following:
def replace_comma(match):
text = match.group()
return text.replace(';', ',')
regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)
string2 = string.split('\n')
for n,i in enumerate(string2):
if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
string2[n] = regex.sub(replace_comma, i)
This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0 after the comma. I have two problems with this approach:
It's not very straightforward
Why is it leaving this \xa0 character ?
Do you know any better way to approach this?
Thanks
Edit: My desired output would be:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2
Edit: Added explanation about turning the file into a string for better manipulation.
For this case I wouldn't use regex, split() and rsplit() with maxpslit= parameter is enough:
data = '''1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2'''
for line in data.splitlines():
row = line.split(';', maxsplit=1)
row = row[:1] + row[-1].rsplit(';', maxsplit=2)
row[1] = row[1].replace(';', ',')
print(';'.join(row))
Prints:
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2
Related
I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.
string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'
I currently have a list of filenames in a txt file and I am trying to sort them. The first this I am trying to do is split them into a list since they are all in a single line. There are 3 types of file types in the list. I am able to split the list but I would like to keep the delimiters in the end result and I have not been able to find a way to do this. The way that I am splitting the files is as follows:
import re
def breakLines():
unsorted_list = []
file_obj = open("index.txt", "rt")
file_str = file_obj.read()
unsorted_list.append(re.split('.txt|.mpd|.mp4', file_str))
print(unsorted_list)
breakLines()
I found DeepSpace's answer to be very helpful here Split a string with "(" and ")" and keep the delimiters (Python), but that only seems to work with single characters.
EDIT:
Sample input:
file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4
Expected output:
file_name1234.mp4
file_name1235.mp4
file_name1236.mp4
file_name1237.mp4
In re.split, the key is to parenthesise the split pattern so it's kept in the result of re.split. Your attempt is:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.split('.txt|.mpd|.mp4', s)
['file_name1234', 'file_name1235', 'file_name1236', 'file_name1237', '']
okay that doesn't work (and the dots would need escaping to be really compliant with what an extension is), so let's try:
>>> re.split('(\.txt|\.mpd|\.mp4)', s)
['file_name1234',
'.mp4',
'file_name1235',
'.mp4',
'file_name1236',
'.mp4',
'file_name1237',
'.mp4',
'']
works but this is splitting the extensions from the filenames and leaving a blank in the end, not what you want (unless you want an ugly post-processing). Plus this is a duplicate question: In Python, how do I split a string and keep the separators?
But you don't want re.split you want re.findall:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.findall('(\w*?(?:\.txt|\.mpd|\.mp4))',s)
['file_name1234.mp4',
'file_name1235.mp4',
'file_name1236.mp4',
'file_name1237.mp4']
the expression matches word characters (basically digits, letters & underscores), followed by the extension. To be able to create a OR, I created a non-capturing group inside the main group.
If you have more exotic file names, you can't use \w anymore but it still reasonably works (you may need some str.strip post-processing to remove leading/trailing blanks which are likely not part of the filenames):
>>> s = " file name1234.mp4file-name1235.mp4 file_name1236.mp4file_name1237.mp4"
>>> re.findall('(.*?(?:\.txt|\.mpd|\.mp4))',s)
[' file name1234.mp4',
'file-name1235.mp4',
' file_name1236.mp4',
'file_name1237.mp4']
So sometimes you think re.split when you need re.findall, and the reverse is also true.
I want to replace numbers from a text file in a new text file. I tried to solve it with the function Dictionary, but now python also replaces the substrings.
For Example: I want to replace the number 014189 to 1489, with this code it also replaces 014896 to 1489 - how can i get rid of this? Thank you!!!
replacements = {'01489':'1489', '01450':'1450'}
infile = open('test_fplan.txt', 'r')
outfile = open('test_fplan_neu.txt', 'w')
for line in infile:
for src, target in replacements.iteritems():
line = line.replace(src, target)
outfile.write(line)
I don't know how your input file looks, but if the numbers are surrounded by spaces, this should work:
replacements = {' 01489 ':' 1489 ', ' 01450 ':' 1450 '}
It looks like your concern is that it's also modifying numbers that contain your src pattern as a substring. To avoid that, you'll need to first define the boundaries that should be respected. For instance, do you want to insist that only matched numbers surrounded by spaces get replaced? Or perhaps just that there be no adjacent digits (or periods or commas). Since you'll probably want to use regular expressions to constrain the matching, as suggested by JoshuaF in another answer, you'll likely need to avoid the simple replace function in favor of something from the re library.
Use regexp with negative lookarounds:
import re
replacements = {'01489':'1489', '01450':'1450'}
def find_replacement(match_obj):
number = match_obj.group(0)
return replacements.get(number, number)
with open('test_fplan.txt') as infile:
with open('test_fplan_neu.txt', 'w') as outfile:
outfile.writelines(
re.sub(r'(?<!\d)(\d+)(?!\d)', find_replacement, line)
for line in infile
)
Check out the regular expression syntax https://docs.python.org/2/library/re.html. It should allow you to match whatever pattern you're looking for exactly.
I am new here and just start using regular expressions in my python codes. I have a string which has 6 commas inside. One of the commas is fallen between two quotation marks. I want to get rid of the quotation marks and the last comma.
The input:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
I want this output:
string = 'Fruits,Pear,Cherry,Apple,Orange,Cherry'
The output of my code:
string = 'Fruits,Pear,**CherryApple**,Orange,Cherry'
here is my code in python:
if (re.search('"', string)):
matches = re.findall(r'\"(.+?)\"',string);
matches1 = re.sub(",", "", matches[0]);
string = re.sub(matches[0],matches1,string);
string = re.sub('"','',string);
My problem is, I want to give a condition that the code only works for the last bit ("Cherry,") but unfortunately it affects other words in the middle (Cherry,Apple), which has the same text as the one between the quotation marks! That results in reducing the number of commas (from 6 to 4) as it merges two fields (Cherry,Apple) and I want to be left with 5 commas.
fullString = '2000-04-24 12:32:00.000,22186CBD0FDEAB049C60513341BA721B,0DDEB5,COMP,Cherry Corp.,DE,100,0.57,100,31213C678CC483768E1282A9D8CB524C,365.00000,business,acquisitions-mergers,acquisition-bid,interest,acquiree,fact,,,,,,,,,,,,,acquisition-interest-acquiree,Cherry Corp. Gets Buyout Offer From Chairman President,FULL-ARTICLE,B5569E,Dow Jones Newswires,0.04,-0.18,0,0,1,0,0,0,0,1,1,5,RPA,DJ,DN20000424000597,"Cherry Corp. Gets Buyout Offer From Chairman President,"\n'
Many Thanks in advance
For your task you don't need regular expressions, just use replace:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
new_string = string.replace('"').strip(',')
The best way would be to use the newer regex module where (*SKIP)(*FAIL) is supported:
import regex as re
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
# parts
rx = re.compile(r'"[^"]+"(*SKIP)(*FAIL)|,')
def cleanse(match):
rxi = re.compile(r'[",]+')
return rxi.sub('', match)
parts = [cleanse(match) for match in rx.split(string)]
print(parts)
# ['Fruits', 'Pear', 'Cherry', 'Apple', 'Orange', 'Cherry']
Here you match anything between double quotes and throw it away afterwards, thus only commas outside quotes are used for the split operation. The rest is a list comprehension with a cleaning function.
See a demo on regex101.com.
Why not simply use this:
>>>ans_string=string.replace('"','')[0:-1]
Output
>>>ans_string
'Fruits,Pear,Cherry,Apple,Orange,Cherry'
For the sake of simplicity and algorithmic complexity.
You might consider using the csv module to do this.
Example:
import csv
s='Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
>>> ','.join([e.replace(',','') for row in csv.reader([s]) for e in row])
Fruits,Pear,Cherry,Apple,Orange,Cherry
The csv module will strip the quotes but keep the commas on each quoted field. Then you can just remove that comma that was kept.
This will take care of any modifications desired (remove , for example) on a field by field basis. The fields with quotes and commas could be any field in the string.
If your content is in a csv file, you would do something like this (in pseudo code)
with open(file, 'rb') as csv_fo:
# modify(string) stands for what you want to do to each field...
for row in csv.reader(csv_fo):
new_row=[modify(field) for field in row]
# now do what you need with that row
Basically, I'm asking the user to input a string of text into the console, but the string is very long and includes many line breaks. How would I take the user's string and delete all line breaks to make it a single line of text. My method for acquiring the string is very simple.
string = raw_input("Please enter string: ")
Is there a different way I should be grabbing the string from the user? I'm running Python 2.7.4 on a Mac.
P.S. Clearly I'm a noob, so even if a solution isn't the most efficient, the one that uses the most simple syntax would be appreciated.
How do you enter line breaks with raw_input? But, once you have a string with some characters in it you want to get rid of, just replace them.
>>> mystr = raw_input('please enter string: ')
please enter string: hello world, how do i enter line breaks?
>>> # pressing enter didn't work...
...
>>> mystr
'hello world, how do i enter line breaks?'
>>> mystr.replace(' ', '')
'helloworld,howdoienterlinebreaks?'
>>>
In the example above, I replaced all spaces. The string '\n' represents newlines. And \r represents carriage returns (if you're on windows, you might be getting these and a second replace will handle them for you!).
basically:
# you probably want to use a space ' ' to replace `\n`
mystring = mystring.replace('\n', ' ').replace('\r', '')
Note also, that it is a bad idea to call your variable string, as this shadows the module string. Another name I'd avoid but would love to use sometimes: file. For the same reason.
You can try using string replace:
string = string.replace('\r', '').replace('\n', '')
You can split the string with no separator arg, which will treat consecutive whitespace as a single separator (including newlines and tabs). Then join using a space:
In : " ".join("\n\nsome text \r\n with multiple whitespace".split())
Out: 'some text with multiple whitespace'
https://docs.python.org/2/library/stdtypes.html#str.split
The canonic answer, in Python, would be :
s = ''.join(s.splitlines())
It splits the string into lines (letting Python doing it according to its own best practices). Then you merge it. Two possibilities here:
replace the newline by a whitespace (' '.join())
or without a whitespace (''.join())
updated based on Xbello comment:
string = my_string.rstrip('\r\n')
read more here
Another option is regex:
>>> import re
>>> re.sub("\n|\r", "", "Foo\n\rbar\n\rbaz\n\r")
'Foobarbaz'
If anybody decides to use replace, you should try r'\n' instead '\n'
mystring = mystring.replace(r'\n', ' ').replace(r'\r', '')
A method taking into consideration
additional white characters at the beginning/end of string
additional white characters at the beginning/end of every line
various end-line characters
it takes such a multi-line string which may be messy e.g.
test_str = '\nhej ho \n aaa\r\n a\n '
and produces nice one-line string
>>> ' '.join([line.strip() for line in test_str.strip().splitlines()])
'hej ho aaa a'
UPDATE:
To fix multiple new-line character producing redundant spaces:
' '.join([line.strip() for line in test_str.strip().splitlines() if line.strip()])
This works for the following too
test_str = '\nhej ho \n aaa\r\n\n\n\n\n a\n '
Regular expressions is the fastest way to do this
s='''some kind of
string with a bunch\r of
extra spaces in it'''
re.sub(r'\s(?=\s)','',re.sub(r'\s',' ',s))
result:
'some kind of string with a bunch of extra spaces in it'
The problem with rstrip() is that it does not work in all cases (as I myself have seen few). Instead you can use
text = text.replace("\n"," ")
This will remove all new line '\n' with a space.
You really don't need to remove ALL the signs: lf cr crlf.
# Pythonic:
r'\n', r'\r', r'\r\n'
Some texts must have breaks, but you probably need to join broken lines to keep particular sentences together.
Therefore it is natural that line breaking happens after priod, semicolon, colon, but not after comma.
My code considers above conditions. Works well with texts copied from pdfs.
Enjoy!:
def unbreak_pdf_text(raw_text):
""" the newline careful sign removal tool
Args:
raw_text (str): string containing unwanted newline signs: \\n or \\r or \\r\\n
e.g. imported from OCR or copied from a pdf document.
Returns:
_type_: _description_
"""
pat = re.compile((r"[, \w]\n|[, \w]\r|[, \w]\r\n"))
breaks = re.finditer(pat, raw_text)
processed_text = raw_text
raw_text = None
for i in breaks:
processed_text = processed_text.replace(i.group(), i.group()[0]+" ")
return processed_text