I have data in the following format:
Input Data:
## <id_1d3s2ia_p3m_zkjp59>
<Eckhard_Christian> <hasGender> <ma<>le> .
## <id_1jmz109_1gi_t71dyx>
<Peter_Pinn<>e> <created> <In_Your_Arms_(Love_song_from_"Neighbours")> .
## <id_v9bcjt_ice_fraki6>
<Blanchester,_Ohio> <hasWebsite> <http://www.blanchester.com/> .
## <id_10tunwc_p3m_zkjp59>
<Hub_(bassi~st)> <hasGender> <ma??le> .
Output Data:
<Eckhard_Christian> <hasGender> <male> <id_1d3s2ia_p3m_zkjp59>.
<Peter_Pinne> <created> <In_Your_Arms_(Love_song_from_"Neighbours")> <id_1jmz109_1gi_t71dyx>.
<Blanchester,_Ohio> <hasWebsite> <http://www.blanchester.com/> <id_v9bcjt_ice_fraki6>.
<Hub_(bassist)> <hasGender> <male> <id_10tunwc_p3m_zkjp59>.
That is in the output data I want to delete all other characters except: alphanumeric and :,/,/,.,_,(,) between any two starting and ending < >. I know python allows me to split using string.split() but if I split using < > as demarkers then for <ma<>le> I get (<ma,<>,le>).
Is there some other way by which I may split in python so that I may get the data in the desired form. Also I want preceding lines < > (following # #) to appear as the last column.
Assuming that there is always a whitespace character before/after the "proper" < and >, you could try something like this, using regular expressions:
import re
with open('data') as data:
for line in data:
if line.startswith('##'):
id_ = re.search('\s(<.*>)', line).group(1)
fields = re.findall('(<.*?>)\s', next(data))
fields = ['<' + re.sub(r'[^\w:/._()"]', '', f) + '>' for f in fields]
print fields + [id_]
Output:
['<Eckhard_Christian>', '<hasGender>', '<male>', '<id_1d3s2ia_p3m_zkjp59>']
['<Peter_Pinne>', '<created>', '<In_Your_Arms_(Love_song_from_"Neighbours")>', '<id_1jmz109_1gi_t71dyx>']
['<Blanchester_Ohio>', '<hasWebsite>', '<http://www.blanchester.com/>', '<id_v9bcjt_ice_fraki6>']
['<Hub_(bassist)>', '<hasGender>', '<male>', '<id_10tunwc_p3m_zkjp59>']
Related
I'm trying to split current line into 3 chunks.
Title column contains comma which is delimiter
1,"Rink, The (1916)",Comedy
Current code is not working
id, title, genres = line.split(',')
Expected result
id = 1
title = 'Rink, The (1916)'
genres = 'Comedy'
Any thoughts how to split it properly?
Ideally, you should use a proper CSV parser and specify that double quote is an escape character. If you must proceed with the current string as the starting point, here is a regex trick which should work:
inp = '1,"Rink, The (1916)",Comedy'
parts = re.findall(r'".*?"|[^,]+', inp)
print(parts) # ['1', '"Rink, The (1916)"', 'Comedy']
The regex pattern works by first trying to find a term "..." in double quotes. That failing, it falls back to finding a CSV term which is defined as a sequence of non comma characters (leading up to the next comma or end of the line).
lets talk about why your code does not work
id, title, genres = line.split(',')
here line.split(',') return 4 values(since you have 3 commas) on the other hand you are expecting 3 values hence you get.
ValueError: too many values to unpack (expected 3)
My advice to you will be to not use commas but use other characters
"1#\"Rink, The (1916)\"#Comedy"
and then
id, title, genres = line.split('#')
Use the csv package from the standard library:
>>> import csv, io
>>> s = """1,"Rink, The (1916)",Comedy"""
>>> # Load the string into a buffer so that csv reader will accept it.
>>> reader = csv.reader(io.StringIO(s))
>>> next(reader)
['1', 'Rink, The (1916)', 'Comedy']
Well you can do it by making it a tuple
line = (1,"Rink, The (1916)",Comedy)
id, title, genres = line
I have the following code that extracts the Message-Id in gathers them in a Dataframe.It works and gives me the follwing results :
This an example of the lines in the dataframe :
Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>
What I want to have is only the string after < character and the before >. Because Message-ID ends with >. Also I have some lines where the Message-ID value is empty. I want to delete these lines.
Here is the code that I wrote
import pandas as pd
import numpy as np
f = open('C:\\Users\\hmk\\Desktop\\PFE 2019\\ML\\MachineLearningPhishing-
master\\MachineLearningPhishing-master\\code\\resources\\emails-
enron.mbox','r')
line_num = 0
e = []
search_phrase = "Message-ID"
for line in f.readlines():
line_num += 1
if line.find(search_phrase) >= 0:
#line = line[13:]
#line = line[:-2]
e.append(line)
f.close()
dfObj = pd.DataFrame(e)
One way to do it is using regex and pandas DataFrame replace:
clean_df = df.replace(to_replace='\<|\>', value='', regex=True)
clean_df = clean_df.replace(to_replace='(Message-ID:\s*$)', value=np.nan, regex=True).dropna()
the first line of code is removing the < and >, assuming the msgs will only contain those two
the second is checking if there is a message id on the body, if not it will replace for NaN.
note that I used numpy.nan just to simplify the process of dropping the blank msgs
You can use a regex which will extract the desired Message-ID for you.
So your first part for extracting the message id would be like below:
import re # import regex
s = 'Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>'
message_id = re.search(r'Message-ID: <(.*?)>', s).group(1)
print('message_id: ', message_id)
Your ideal Message ID:
>>> message_id: 23272646.1075847145300.JavaMail.evans#thyme>
So you can loop through your data end check for the regex like this:
for line in f.readlines():
line_num += 1
message_id = re.search(r'Message-ID: <(.*?)>', line)
if message_id:
msg_id_string = message_id.group(1)
e.append(line)
# your other works
The if message_id: checks whether there is a match for your Message-ID and if it doesn't match it will return None and won't go through the if instructions.
You want a substring of your lines
for line in f.readlines():
if all(word in line for word in [search_phrase, "<", ">"]):
e.append(line[line.find("<")+1:-1])
#-1 suppose ">" as the last character
Use in to check if a string is inside another string
Use find to get the index of your pattern
Use [in:out] to get substring between your two values
s = "We want <This text inside only>. yes we do."
s2 = s[s.find("<")+1:s.find(">")]
print(s2) # Prints : This text inside only
# If you want to remove empty lines :
lines = filter(lambda x: x.strip(), lines)
filter goes through the whole lines, no need for a for loop that way.
One suggestion for you:
import re
f = open('PATH/TO/FILE', 'r').read()
msgID = re.findall(r'(?<=<).*?(?=>)', f)
I have a .txt file looks like:
As you can see its a several relationships between the verbs (don't care about the numbers) the file has 5,000 lines.
Data is here: under Download & Use VerbOcean : http://demo.patrickpantel.com/demos/verbocean/
What I want is a dict for each relationship, so that we could say for example
similar-to['anger'] = 'energize'
happens-before['X'] = 'Y'
stronger-than ['A'] = 'B'
and so on.
So what I have so far is working perfectly for only [stronger-than] relationship. how should I extend it in a way that does all other relationships as well?
import csv
file = open("C:\\Users\\shide\\Desktop\\Independent study\\data.txt")
counter = 1
stronger = {}
strongerverb = []
secondverb = []
term1 = "[stronger-than]" #Look for stronger-than
words = line.split() #split sentence
if term1 in words: #if ['Stronger-than'] exists in the line then add the first word
strongerverb.append(line.split(None, 1)[0]) # add only first verb
secondverb.append(line.split()[2]) #add second verb
if term1 in words: # if ['Stronger-than'] exists in the line then add the first word
strongerverb.append(line.split(None, 1)[0]) # add only first verb
secondverb.append(line.split()[2]) # add second verb
capacity = len(strongerverb)
index = 0
while index!=capacity:
line = strongerverb[index]
for word in line.split():
# print(word)
index = index+1
#print("First verb:",firstverb)
#print("Second verb:",secondverb)
for i in range(len(strongerverb)):
stronger[strongerverb[i]] = secondverb[i]
#Write a CSV file that fist column is containing verbs that is stronger than the second column.
with open('output.csv', 'w') as output:
writer = csv.writer(output, lineterminator='\n')
for secondverb, strongerverb in stronger.items():
writer.writerow([strongerverb, secondverb])
One way is to do same way for all other relationships but I guess that would not be a smart thing. Any ideas?
what I want is a dict for each relationship, so that we could say:
similar-to['anger'] = 'energize'
happens-before['X'] = 'Y'
stronger-than ['A'] = 'B'
I am new to python and any help would be greatly appreciated.
This can be done using a regular expression:
import re
regexp = re.compile(r'^([^\[\]\s]+)\s*\[([^\[\]\s]+)\]\s*([^\[\]\s]+)\s*.*$', re.MULTILINE)
^: (At the beginning) means to start looking up in the beginning of the line.
$: (At the end) means that the expression should end at the and of the line.
[^\[\]\s]+: captures all the characters that are not [,] or spaces. ^ means not to capture the following characters inside the square parentheses.
We encapsulate the above expression with () to mark it as a group to be captured using the m.groups(). Since we want to get both verbs and their relationship, we encapsulate those three with ().
Between those groups, we capture all the spaces using \s*, and the rest of the line we capture using .*. Both being ignored eventually since they are not encapsulated with ().
For example:
data = """
invate [happens-beforeg] annex :: ....
annex [similar] invade :: ....
annex [opposite-of] cede :: ....
annex [stronger-than] occupy :: ....
"""
relationships = {}
for m in regexp.finditer(data):
v1,r,v2 = m.groups()
relationships.setdefault(r, {})[v1] = v2
print(relationships)
Outputs:
{'happens-before': {'invate': 'annex'},
'opposite-of': {'annex': 'cede'},
'similar': {'annex': 'invade'},
'stronger-than': {'annex': 'occupy'}}
Then to get the relationship 'similar' of the verb 'annex', use:
relationships['similar']['annex']
Which will return: 'occupy'
This program makes an array of verbs which come from a text file.
file = open("Verbs.txt", "r")
data = str(file.read())
table = eval(data)
num_table = len(table)
new_table = []
for x in range(0, num_table):
newstr = table[x].replace(")", "")
split = newstr.rsplit("(")
numx = len(split)
for y in range(0, numx):
split[y] = split[y].split(",", 1)[0]
new_table.append(split[y])
num_new_table = len(new_table)
for z in range(0, num_new_table):
print(new_table[z])
However the text itself contains hex characters such as in
('a\\xc4\\x9fr\\xc4\\xb1[Verb]+[Pos]+[Imp]+[A2sg]', ':', 17.6044921875)('A\\xc4\\x9fr\\xc4\\xb1[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]', ':', 11.5615234375)
I'm trying to get rid of those. How am supposed to do that?
I've looked up pretty much everywhere and decode() returns an error (even after importing codecs).
You could use parse, a python module that allows you to search inside a string for regularly-formatted components, and, from the components returned, you could extract the corresponding integers, replacing them from the original string.
For example (untested alert!):
import parse
# Parse all hex-like items
list_of_findings = parse.findall("\\x{:w}", your_string)
# For each item
for hex_item in list_of_findings:
# Replace the item in the string
your_string = your_string.replace(
# Retrieve the value from the Parse Data Format
hex_item[0],
# Convert the value parsed to a normal hex string,
# then to int, then to string again
str(int("0x"+hex_item[0]))
)
Obs: instead of "int", you could convert the found hex-like values to characters, using chr, as in:
chr(hex_item[0])
Can you use values from script to inform regexs dynamically how to operate?
For example:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{n_rep}'
line_matches = re.findall(new_pattern, some_text)
I keep getting problems with trying to get the grouping to work
Explanation
I am attempting to find the most common number of repetitions of a regex pattern in a text file in order to find table type data within files.
I have the idea to make a regex such as this:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
line_matches = np.array([re.findallbase_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
# Find where the text has similar number of words/data in each line
where_same_pattern= np.where(np.diff([len(x) for x in line_matches])==0)
line_matches_where_same = line_matches[where_same_pattern]
# Extract out just the lines which have data
interesting_lines = np.array([x for x in line_matches_where_same if x != []])
# Find how many words in each line of interest
len_of_lines = [len(l) for l in interesting_lines]
# Use the most prevalent as the most likely number of columns of data
n_cols = Counter(len_of_lines).most_common()[0][0]
# Rerun the data through a regex to find the columns
new_pattern = base_pattern + '{n_cols}'
line_matches = np.array([re.findall(new_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
you need to use the value of the variable, not a string literal with the name of the variable, e.g.:
new_pattern = base_pattern + '{' + n_cols + '}'
Your pattern is just a string. So, all you need is to convert your number into a string. You can use format (for example, https://infohost.nmt.edu/tcc/help/pubs/python/web/new-str-format.html) to do that:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{{{0}}}'.format(n_rep)
print new_pattern ## '\\s*(([\\d.\\w]+)[ \\h]+){6}'
Note that the two first and the two last curly braces are creating the curly braces in the new pattern, while {0} is being replaced by the number n_rep