Can't get rid of hex characters - python

This program makes an array of verbs which come from a text file.
file = open("Verbs.txt", "r")
data = str(file.read())
table = eval(data)
num_table = len(table)
new_table = []
for x in range(0, num_table):
newstr = table[x].replace(")", "")
split = newstr.rsplit("(")
numx = len(split)
for y in range(0, numx):
split[y] = split[y].split(",", 1)[0]
new_table.append(split[y])
num_new_table = len(new_table)
for z in range(0, num_new_table):
print(new_table[z])
However the text itself contains hex characters such as in
('a\\xc4\\x9fr\\xc4\\xb1[Verb]+[Pos]+[Imp]+[A2sg]', ':', 17.6044921875)('A\\xc4\\x9fr\\xc4\\xb1[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]', ':', 11.5615234375)
I'm trying to get rid of those. How am supposed to do that?
I've looked up pretty much everywhere and decode() returns an error (even after importing codecs).

You could use parse, a python module that allows you to search inside a string for regularly-formatted components, and, from the components returned, you could extract the corresponding integers, replacing them from the original string.
For example (untested alert!):
import parse
# Parse all hex-like items
list_of_findings = parse.findall("\\x{:w}", your_string)
# For each item
for hex_item in list_of_findings:
# Replace the item in the string
your_string = your_string.replace(
# Retrieve the value from the Parse Data Format
hex_item[0],
# Convert the value parsed to a normal hex string,
# then to int, then to string again
str(int("0x"+hex_item[0]))
)
Obs: instead of "int", you could convert the found hex-like values to characters, using chr, as in:
chr(hex_item[0])

Related

Extract numeric values from a string for python

I have a string with contains numeric values which are inside quotes. I need to remove numeric values from these and also the [ and ]
sample string: texts = ['13007807', '13007779']
texts = ['13007807', '13007779']
texts.replace("'", "")
texts..strip("'")
print texts
# this will return ['13007807', '13007779']
So what i need to extract from string is:
13007807
13007779
If your texts variable is a string as I understood from your reply, then you can use Regular expressions:
import re
text = "['13007807', '13007779']"
regex=r"\['(\d+)', '(\d+)'\]"
values=re.search(regex, text)
if values:
value1=int(values.group(1))
value2=int(values.group(2))
output:
value1=13007807
value2=13007779
You can use * unpack operator:
texts = ['13007807', '13007779']
print (*texts)
output:
13007807 13007779
if you have :
data = "['13007807', '13007779']"
print (*eval(data))
output:
13007807 13007779
The easiest way is to use map and wrap around in list
list(map(int,texts))
Output
[13007807, 13007779]
If your input data is of format data = "['13007807', '13007779']" then
import re
data = "['13007807', '13007779']"
list(map(int, re.findall('(\d+)',data)))
or
list(map(int, eval(data)))

Divide string data into list of lists by finding /r/n substring

I'm not sure what I'm doing wrong here? Do I need to treat the \ as a special character? The sequence \r\n appears literally in the string in the txt file I am using.
def split_into_rows(weather_data):
list_of_rows = []
while not (weather_data.find("\r\n") == -1):
firstreturnchar = weather_data.find("\r\n")
row = weather_data[ :firstreturnchar]
list_of_rows = list_of_rows.append(row)
return list_of_rows
What I need is, while there are still examples of the substring \r\n left in the string, to find the first instance of the substring "\r\n", chop everything before that and put it into the variable row, then append row to list_of_rows.
You could use split():
def split_into_rows(weather_data):
return weather_data.split('\\r\\n')

How to encode characters in a tweet to integers by using python

I read a file that contains tweet per line by using python. Now, I need to create a character vocabulary from it and encode each sentence by using it. However, I need to extract the emoji descriptions without dividing them to characters. To make my purpose more clear lets think the following tweet:
x='Wish she could have told me herself. #NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart:\xef\xb8\x8f:heavy_black_heart:\xef\xb8\x8f'
First of all, I should say that, I don't know why there are two \xef\xb8\x8f . When I look at the file, there aren't such things.
Let's say I have a dictionary that stores an unique integer for each character and emoji description (:heavy_black_heart:) :
dict = {'W => 1' , 'i=>2','s=>3','h=>4',':heavy_black_heart =>5',':smiling_face=>6','z=>7', .... etc}
Now, what I want to do is convert this X string to Y array that stores corresponding integers for each characters and emoji descriptions in the string.
Y= [1,2,3,4,......,5,5]
I read the file, put it into array but I couldn't find how could I make the last part. Here is what I've done so far:
def parse_dataset(fp):
y = []
corpus = []
with open(fp, 'rt') as data_in:
for line in data_in:
if not line.startswith("Tweet index"): # discard first line if it contains metadata
line = line.rstrip() # remove trailing whitespace
label = int(line.split("\t")[1])
tweet = line.split("\t")[2]
y.append(label)
corpus.append(tweet)
return corpus, y
if __name__ == "__main__":
DATASET_FP = "input_file.txt"
corpus, y = parse_dataset(DATASET_FP)
Is there anybody who can help me ?

How do I avoid errors when parsing a .csv file in python?

I'm trying to parse a .csv file that contains two columns: Ticker (the company ticker name) and Earnings (the corresponding company's earnings). When I read the file using the following code:
f = open('earnings.csv', 'r')
earnings = f.read()
The result when I run print earnings looks like this (it's a single string):
Ticker;Earnings
AAPL;52131400000
TSLA;-911214000
AMZN;583841600
I use the following code to split the string by the break line character (\n), followed by splitting each resulting line by the semi-colon character:
earnings_list = earnings.split('\n')
string_earnings = []
for string in earnings_list:
colon_list = string.split(';')
string_earnings.append(colon_list)
The result is a list of lists where each list contains the company's ticker at index[0] and its earnigns at index[1], like such:
[['Ticker', 'Earnings\r\r'], ['AAPL', '52131400000\r\r'], ['TSLA', '-911214000\r\r'], ['AMZN', '583841600\r\r']]
Now, I want to convert the earnings at index[1] of each list -which are currently strings- intro integers. So I first remove the first list containing the column names:
headless_earnings = string_earnings[1:]
Afterwards I try to loop over the resulting list to convert the values at index[1] of each list into integers with the following:
numerical = []
for i in headless_earnings:
num = int(i[1])
numerical.append(num)
I get the following error:
num = int(i[1])
IndexError: list index out of range
How is that index out of range?
You certainly mishandle the end of lines.
If I try your code with this string: "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600" it works.
But with this one: "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600\r\r\n" it doesn't.
Explanation: split creates a last list item containing only ['']. So at the end, python tries to access [''][1], hence the error.
So a very simple workaround would be to remove the last '\n' (if you're sure it's a '\n', otherwise you might have surprises).
You could write this:
earnings_list = earnings[:-1].split('\n')
this will fix your error.
If you want to be sure you remove a last '\n', you can write:
earnings_list = earnings[:-1].split('\n') if earnings[-1] == '\n' else earnings.split('\n')
EDIT: test code:
#!/usr/bin/env python2
earnings = "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600\r\r\n"
earnings_list = earnings[:-1].split('\n') if earnings[-1] == '\n' else earnings.split('\n')
string_earnings = []
for string in earnings_list:
colon_list = string.split(';')
string_earnings.append(colon_list)
headless_earnings = string_earnings[1:]
#print(headless_earnings)
numerical = []
for i in headless_earnings:
num = int(i[1])
numerical.append(num)
print numerical
Output:
nico#ometeotl:~/temp$ ./test_script2.py
[52131400000, -911214000, 583841600]

regex number of repetitions from code

Can you use values from script to inform regexs dynamically how to operate?
For example:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{n_rep}'
line_matches = re.findall(new_pattern, some_text)
I keep getting problems with trying to get the grouping to work
Explanation
I am attempting to find the most common number of repetitions of a regex pattern in a text file in order to find table type data within files.
I have the idea to make a regex such as this:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
line_matches = np.array([re.findallbase_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
# Find where the text has similar number of words/data in each line
where_same_pattern= np.where(np.diff([len(x) for x in line_matches])==0)
line_matches_where_same = line_matches[where_same_pattern]
# Extract out just the lines which have data
interesting_lines = np.array([x for x in line_matches_where_same if x != []])
# Find how many words in each line of interest
len_of_lines = [len(l) for l in interesting_lines]
# Use the most prevalent as the most likely number of columns of data
n_cols = Counter(len_of_lines).most_common()[0][0]
# Rerun the data through a regex to find the columns
new_pattern = base_pattern + '{n_cols}'
line_matches = np.array([re.findall(new_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
you need to use the value of the variable, not a string literal with the name of the variable, e.g.:
new_pattern = base_pattern + '{' + n_cols + '}'
Your pattern is just a string. So, all you need is to convert your number into a string. You can use format (for example, https://infohost.nmt.edu/tcc/help/pubs/python/web/new-str-format.html) to do that:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{{{0}}}'.format(n_rep)
print new_pattern ## '\\s*(([\\d.\\w]+)[ \\h]+){6}'
Note that the two first and the two last curly braces are creating the curly braces in the new pattern, while {0} is being replaced by the number n_rep

Categories