REGEX to find all matches inside a given string - python

I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
A type check did indeed give me string: print(type(item)) <class 'str'>
Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.
I found several questions here Match single quotes from python re: I tried re.findall(r"'(\w+)'", item) but only got empty brackets [].
Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\"(.+?)\"',item)
print(matches), but this failed in my case as well...
After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',') which gives me
e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]
Then I tried string methods to get rid of the stuff and then apply a regex:
list_categories = []
for item in list_cat_split:
item.strip('\"')
item.strip(']')
item.strip('[')
item.strip()
category = re.findall(r"'(\w+)'", item)
if category not in list_categories:
list_categories.append(category)
however even this approach failed: [['Electronics'], []]
I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?
UPDATE:
Somehow I cannot answer my own question, thererfore here an update:
thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of #FailSafe worked for me:screenshot of my jupyter notebook
here the result with list
But yes I totally agree with # Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)

this may not be your final answer but it creates a list of categories.
x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
y=x[2:-2]
z=y.split(',')
for item in z:
print(item)

Related

IDA python Find issues

my goal here is to search through the entire memory range of a process for the following pattern:
pop *
pop *
retn
I've tried using FindText but it seems that it only returns results for areas that have already been parsed for their instructions in IDA. so to use FindText id need to figure out how to parse the entire memory range for instructions (which seems like it would be intensive).
So i switched to FindBinary but i ran into an issue there as well. the pattern I'm searching only needs to match the first 5 bits of the byte and the rest is wildcard. so my goal would be to search for:
01011***
01011***
11000011
I've found posts claiming IDA has a ? wildcard for bytes, but i haven't been able to get it to work and even if it did it only seems to work for a full 8 bits. so for this approach i would need to find a way to search for bit patterns then parse the bits around the result. this seems like the most doable route but so far i haven't been able to find anything in the docs that can search bits like this.
does anyone know a way to accomplish what i want?
in classic stackoverflow style, i spent hours trying to figure it out then 20 minutes after asking for help i found the exact function i needed, get_byte()
def find_test():
base = idaapi.get_imagebase()
while True:
res = FindBinary(base, SEARCH_NEXT|SEARCH_DOWN, "C3")
if res==BADADDR: break
if 0b01011 == get_byte(res-1) >> 3 and 0b01011 == get_byte(res-2) >> 3:
print "{0:X}".format(res)
base=res+1
now, if only i could figure out how to do this with a wildcard in every instruction. because for this solution i need to know at least one full byte of the pattern

Extract keywords/phrases from a given short text using python and its libraries

From a user given input of job description, i need to extract the keywords or phrases, using python and its libraries. I am open for suggestions and guidance from the community of what libraries work best and if in case, its simple, please guide through.
Example of user input:
user_input = "i want a full stack developer. Specialization in python is a must".
Expected output:
keywords = ['full stack developer', 'python']
Well, a good keywords set is a good method. But, the key is how to build it. There are many way to do it.
Firstly, the simplest one is searching open keywords set in the web. It's depend on your luck and your knowledge. Your keywords (likes "python, java, machine learing") are common tags in Stackoverflow, Recruitment websites. Don't break the law!
The second one is IR(Information Extraction), it's more complex than the last one. There are many algorithms, likes "TextRank", "Entropy", "Apriori", "HMM", "Tf-IDF", "Conditional Random Fields", and so on.
Good lucky.
For matching keywords/phases, Trie Tree is more faster.
Well, i answered my own question. Thanks anyways for those who replied.
keys = ['python', 'full stack developer','java','machine learning']
keywords = []
for i in range(len(keys)):
word = keys[i]
if word in keys:
keywords.append(word)
else:
continue
print(keywords)
Output was as expected!

match hex string with list indice

I'm building a de-identify tool. It replaces all names by other names.
We got a report that <name>Peter</name> met <name>Jane</name> yesterday. <name>Peter</name> is suspicious.
outpout :
We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.
It can be done on multiple documents, and one name is always replaced by the same counterpart, so you can still understand who the text is talking about. BUT, all documents have an ID, referring to the person this file is about (I'm working with files in a public service) and only documents with the same people ID will be de-identified the same way, with the same names. (the goal is to watch evolution and people's history) This is a security measure, such as when I hand over the tool to a third party, I don't hand over the key to my own documents with it.
So the same input, with a different ID, produces :
We got a report that <name>Henry</name> met <name>Alicia</name> yesterday. <name>Henry</name> is suspicious.
Right now, I'm hashing each name with the document ID as a salt, I convert the hash to an integer, then subtract the length of the name list until I can request a name with that integer as an indice. But I feel like there should be a quicker/more straightforward approach ?
It's really more of an algorithmic question, but if it's of any relevance I'm working with python 2.7 Please request more explanation if needed. Thank you !
I hope it's clearer this way ô_o Sorry when you are neck-deep in your code you forget others need a bigger picture to understand how you got there.
As #LutzHorn pointed out, you could just use a dict to map real names to false ones.
You could also just do something like:
existing_names = []
for nameocurrence in original_text:
if not nameoccurence.name in existing_names:
nameoccurence.id = len(existing_names)
existing_names.append(nameoccurence.name)
else:
nameoccurence.id = existing_names.index(nameoccurence.name)
for idx, _ in enumerate(existing_names):
existing_names[idx] = gimme_random_name()
Try using a dictionary of names.
import re
names = {"Peter": "Billy", "Jane": "Elsa"}
for name in re.findall("<name>([a-zA-Z]+)</name>", s):
s = re.sub("<name>" + name + "</name>", "<name>"+ names[name] + "</name>", s)
print(s)
Output:
'We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.'

How to transform hyperlink codes into normal URL strings?

I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.
Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...
Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list
The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]
open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.
This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.
The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

Categories