Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Here's a copy of some of the lines (solution, pos and gloss) in my txt file:
solution: (كَتَبَ kataba) [katab-u_1]
pos: katab/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
gloss: ___ + write + he/it <verb>
I would like to return the word 'katab' that inside the square brackets in first line and remove all staff and lines and number every things. I'm working on python 2.7
I tried to write this code:
pattern = re.compile("'(?P[^']+)':\s*(?P<root>[^,]*)\d+")
Whenever you think "I need to match a pattern", you should think "Regular Expressions" as a good starting point. See doco. It is a little trickier since the input file is unicode.
import re
import codecs
with codecs.open("test.unicode.txt","rb", "utf-8") as f:
words = []
for line in f.readlines():
matches = re.match(b"solution:.+\[(?P<word>\w+).*\]", line, flags=re.U)
if matches:
words.append(matches.groups()[0])
print(words)
Output:
[u'katab']
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Hi I was wondering how I could format a large text file by adding line breaks after certain characters or words. For instance, everytime a comma was in the paragraph could I use python to make this output an extra linebreak.
You can do using str.replace() in python. Check out the below code, replacing every , with ,\n.
string = ""
with open('test.txt','r') as myfile:
for line in myfile:
string += line.replace(",",",\n")
myfile.close()
myfile = open('test.txt','w')
myfile.write(string)
File before execution:
Hello World and again HelloWorld,sdjakljsljsfs,asdgrwcfdssaasf,sdfoieunvsfaf,asdasdafjslkj,
After Execution:
Hello World and again HelloWorld,
sdjakljsljsfs,
asdgrwcfdssaasf,
sdfoieunvsfaf,
asdasdafjslkj,
you can use the ''.replace() method like so:
'roses can be blue, red, white'.replace(',' , ',\n') gives
'roses can be blue,\n red,\n white' efectively inserting '\n' after every ,
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
how can i parse this so that I can obtain how many unique urls there are regardless of the number behind it ? using python
You can open the file and get the lines as a string using:
with open("/path/to/file.txt") as file:
lines = list(file)
This will give you a list of all lines in the text file.
Now since you do not want duplicates, I think using set would be a good way. (Set does not contain duplicates)
answer=set()
for x in lines:
answer.add(x[x.find(" ")+1:x.rfind(":")])
This will iterate through all the lines and add the part after the space till and not including the : to the set, which will handle the case for duplicates. Now answer should contain all the unique urls
Tested for Python3.6
You can use regex to parse and extract uids from your file line per line.
import re
uids = set()
with open('...') as f:
for line in f:
m = re.match('$[a-z0-9]+', line)
if m:
uids.add(m.group(0))
print(len(uids))
import re
A, List = ("String_1 URL_1:10\nString_2 URL_2:20\nString_3 URL_1:30".replace(" ", ",")).split("\n"), []
for x in range(len(A)):
Result = re.search(",(.*):", A[x])
if Result.group(1) not in List:
List.append(Result.group(1))
print(len(List))
This should solve your problem.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
How do I match the following pattern using re?
2016-02-13 02:00:00.0,3525,http://www.heatherllindsey.com/2016/02/my-husband-left-his-9-5-job-for-good-it.html,158,0,2584490
I used python's split() function to separate the attributes out but as the data is huge, the process is getting killed due to memory errors.
If you put the long version of string it would be better.
So how can you make it ? That is the answer:
import re
str = "2016-02-13 02:00:00.0,3525,http://www.heatherllindsey.com/2016/02/my-husband-left-his-9-5-job-for-good-it.html,158,0,2584490"
pattern = re.compile("(.*?),", re.DOTALL) #we use re.DOTALL to continue splitting after endlines.
result = pattern.findall(str) #we can't find the last statement (2584490) because of the pattern so we will apply second process
pattern = re.compile("(.*?)", re.DOTALL)
str2 = str[-50:-1]+str[-1] #we take last partition of string to find out last statement by using split() method
result.append(str2.split(",")[-1])
print result
It works...
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
How do I extract all the numbers (integers) from a text file using python? I am only using them in the def part to make a function of a button. I should be able to calculate them after extracting them.
You could use re.findall function.
re.findall(r'\d+', string)
That is,
with open('path/file') as file:
x = file.read()
print(re.findall(r'\d+', x))
OR
import re
l = []
with open('/path/file') as file:
for line in file:
for i in re.findall(r'\d+', line):
l.append(i)
print(l)
\d+ matches one or more digits.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have written a script in Python that pulls in the body of a multi-line email. I want to grab all the text on a line containing a hashtag, except for the hashtag itself and dump it into a list. The code is below and I've isolated a small portion that is relevant:
Email Body:
#Delete 454454
John Smith
+1.555.555.5555
Desired Output:
[454454]
You can try like this,
>>> s = '''#Delete 544574
... #Delete 457545'''
>>> [int(item.split()[1]) for item in s.splitlines() if item.startswith('#')]
[544574, 457545]
you can use regex:
>>> my_string
'#Delete 544574\n#Delete 457545\n \n\nThis email body contains two examples here.'
>>> import re
>>> map(int,re.findall("#.* (\d+)\n",my_string))
[544574, 457545]
its working on your edit:
>>> my_new
'#Delete 454454\n\n \n\nJohn Smith\n\n+1.555.555.5555'
>>> map(int,re.findall("#.* (\d+)\n",my_new))
[454454]
You can build the list using comprehension, like that (assuming your lines are in the object lines, an iterable):
[int(line[8:-1]) for line in lines if line.startswith('#Delete ') and line[8:-1].isdigit()]