Reading in data from file using regex in python - python

I have a data file with tons of data like:
{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}
I want to read in the data and save it in a list. I am having trouble getting the exact right code to exact the data between the { }. I don't want the quotes and the ` after the numbers. Also, data is not separated by line so how do I tell re.search where to begin looking for the next set of data?

At first glance, you can break this data into chunks by splitting it on the string },{:
chunks = data.split('},{')
chunks[0] = chunks[0][1:] # first chunk started with '{'
chunks[-1] = chunks[-1][:-1] # last chunk ended with '}'
Now you have chunks like
"Passenger Quarters",27.`,"Cardassian","not injured"
and you can apply a regular expression to them.

You should do this in two passes. One to get the list of items and one to get the contents of each item:
import re
from pprint import pprint
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# This splits up the data into items where each item is the
# contents inside a pair of braces
item_pattern = re.compile("{([^}]+)}")
# This plits up each item into it's parts. Either matching a string
# inside quotation marks or a number followed by some garbage
contents_pattern = re.compile('(?:"([^"]+)"|([0-9]+)[^,]+),?')
rows = []
for item in item_pattern.findall(data):
row = []
for content in contents_pattern.findall(item):
if content[1]: # Number matched, treat it as one
row.append(int(content[1]))
else: # Number not matched, use the string (even if empty)
row.append(content[0])
rows.append(row)
pprint(rows)

The following will produce a list of lists, where each list is an individual record.
import re
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Pssenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# remove characters we don't want and split into individual fields
badchars = ['{','}','`','.','"']
newdata = data.translate(None, ''.join(badchars))
fields = newdata.split(',')
# Assemble groups of 4 fields into separate lists and append
# to the parent list. Obvious weakness here is if there are
# records that contain something other than 4 fields
records = []
myrecord = []
recordcount = 1
for field in fields:
myrecord.append(field)
recordcount = recordcount + 1
if (recordcount > 4):
records.append(myrecord)
myrecord = []
recordcount = 1
for record in records:
print record
Output:
['Passenger Quarters', '27', 'Cardassian', 'not injured']
['Passenger Quarters', '9', 'Cardassian', 'injured']
['Pssenger Quarters', '32', 'Romulan', 'not injured']
['Bridge', 'Unknown', 'Romulan', 'not injured']

Related

How to split a comma-separated line if the chunk contains a comma in Python?

I'm trying to split current line into 3 chunks.
Title column contains comma which is delimiter
1,"Rink, The (1916)",Comedy
Current code is not working
id, title, genres = line.split(',')
Expected result
id = 1
title = 'Rink, The (1916)'
genres = 'Comedy'
Any thoughts how to split it properly?
Ideally, you should use a proper CSV parser and specify that double quote is an escape character. If you must proceed with the current string as the starting point, here is a regex trick which should work:
inp = '1,"Rink, The (1916)",Comedy'
parts = re.findall(r'".*?"|[^,]+', inp)
print(parts) # ['1', '"Rink, The (1916)"', 'Comedy']
The regex pattern works by first trying to find a term "..." in double quotes. That failing, it falls back to finding a CSV term which is defined as a sequence of non comma characters (leading up to the next comma or end of the line).
lets talk about why your code does not work
id, title, genres = line.split(',')
here line.split(',') return 4 values(since you have 3 commas) on the other hand you are expecting 3 values hence you get.
ValueError: too many values to unpack (expected 3)
My advice to you will be to not use commas but use other characters
"1#\"Rink, The (1916)\"#Comedy"
and then
id, title, genres = line.split('#')
Use the csv package from the standard library:
>>> import csv, io
>>> s = """1,"Rink, The (1916)",Comedy"""
>>> # Load the string into a buffer so that csv reader will accept it.
>>> reader = csv.reader(io.StringIO(s))
>>> next(reader)
['1', 'Rink, The (1916)', 'Comedy']
Well you can do it by making it a tuple
line = (1,"Rink, The (1916)",Comedy)
id, title, genres = line

How to extract values from a csv splitting in the correct place (no imports)?

How can I read a csv file without using any external import (e.g. csv or pandas) and turn it into a list of lists? Here's the code I worked out so far:
m = []
for line in myfile:
m.append(line.split(','))
Using this for loop works pretty fine, but if in the csv I get a ',' is in one of the fields it breaks wrongly the line there.
So, for example, if one of the lines I have in the csv is:
12,"This is a single entry, even if there's a coma",0.23
The relative element of the list is the following:
['12', '"This is a single entry', 'even if there is a coma"','0.23\n']
While I would like to obtain:
['12', '"This is a single entry, even if there is a coma"','0.23']
I would avoid trying to use a regular expression, but you would need to process the text a character at a time to determine where the quote characters are. Also normally the quote characters are not included as part of a field.
A quick example approach would be the following:
def split_row(row, quote_char='"', delim=','):
in_quote = False
fields = []
field = []
for c in row:
if c == quote_char:
in_quote = not in_quote
elif c == delim:
if in_quote:
field.append(c)
else:
fields.append(''.join(field))
field = []
else:
field.append(c)
if field:
fields.append(''.join(field))
return fields
fields = split_row('''12,"This is a single entry, even if there's a coma",0.23''')
print(len(fields), fields)
Which would display:
3 ['12', "This is a single entry, even if there's a coma", '0.23']
The CSV library though does a far better job of this. This script does not handle any special cases above your test string.
Here is my go at it:
line ='12, "This is a single entry, more bits in here ,even if there is a coma",0.23 , 12, "This is a single entry, even if there is a coma", 0.23\n'
line_split = line.replace('\n', '').split(',')
quote_loc = [idx for idx, l in enumerate(line_split) if '"' in l]
quote_loc.reverse()
assert len(quote_loc) % 2 == 0, "value was odd, should be even"
for m, n in zip(quote_loc[::2], quote_loc[1::2]):
line_split[n] = ','.join(line_split[n:m+1])
del line_split[n+1:m+1]
print(line_split)

How to transfer plain text headings and listings to Python dictionary object?

My question:
I want to parse a plain text with headings and listings into a single Python object, where headings as dict key and listings as list of values. The text is shown below:
Playing cricket is my hobby:
(a) true.
(b) false.
Furthermore, the heading does not include:
(a) Singlets.
(b) fabrics.
(c) Smocks.
My desired output is:
{"Playing cricket is my hobby:":["(a)true.","(b)false."],"Furthermore, the heading does not include:":["(a) Singlets.","(b) Garments.","(c) Smocks."]}
What I have done
I firstly convert text to list of string:
plaintxtlist=['Playing cricket is my hobby:','(a) true.','(b) false.','Furthermore, the heading does not include:','(a) Singlets.',' (b) fabrics.','(c) Smocks.']
I tried to convert the list above into a dictionary which its keys are the index of heading and values and lists of text. Here is the code:
import re
data = {} #dictonary
lst = [] #list
regalter=r"^\s*\(([^\)]+)\).*|^\s*\-.*" #regex to identify (a)(A) or - type of lines
j=0
sub = [] #list
plaintxtlist=['Playing cricket is my hobby:','(a) true.','(b) false.','Furthermore, the heading does not include:','(a) Singlets.',' (b) fabrics.','(c) Smocks.']
for i in plaintxtlist: #the data in text files are converted to list of strings and passed to code
if sub:
match = re.match(regalter, i) # pattern matching using regex
if match:
sub.append(i) #if the line containes (a)or(A) it will be appended to list called sub
else:
j=j+1 #each list of lines will have value from 0 n (n is the last line)
sub = [i] #list of text will be appended to list called sub
data[str(j)] = sub # here the sub list will be added to dictonary named data with o,1,2,3 respectively we are laster converting that to string
else:
if sub:
data[str(j)] = sub #else if sub the content in the sublist will be appended to dictonary named data
sub = [i] #each line will be appended to sub list
data[str(j)] = i # if there is no match with regex the pain text will be appended to dictonary
print(data) #print the
And the output from the code below:
{"0":["Playing cricket is my hobby:","(a)true.","(b)false."],"1":["Furthermore, the heading does not include:","(a) Singlets.","(b) Garments.","(c) Smocks."]}
You don't need to transfer each line to fit into a list at first. To make it simpler, you can firstly organize the raw text content by regex, then parse them into the dictionary you want.
You can find out the grouping relationship by specifying the text content goes before a "period" that isn't followed by a "(" in the next line.
Suppose the text content is saved in a file called a_text_file.txt. The full code lies here:
import re
with open('a_text_file.txt') as f:
s = f.read()
pattern = re.compile(r'[\w\s\().:,]+?\.(?!\n\()')
data = dict()
for m in re.findall(pattern, s):
# Group the raw content by `regex`,
# and fit each line into a list
group = m.strip()
lst = group.split('\n')
# Strip out spaces in `key` and `value`
key = lst[0].strip()
value = [i.strip() for i in lst[1:]]
# Fit into the final output
data.update({key: value})
print(data)
The final output:
{'Playing cricket is my hobby:': ['(a) true.', '(b) false.'], 'Furthermore, the heading does not include:': ['(a) Singlets.', '(b) fabrics.', '(c) Smocks.']}

create csv that will only add one header for each line

I have a file that looks like this
a:1
a:2
a:3
b:1
b:2
b:2
and I would like it to take the the a and b portion of the file and add it as the first column and and the number below, like this.
a b
1 1
2 2
3 3
can this be done?
A CSV (Comma Separated File) should have commas in it, so the output should have commas instead of space-separators.
I recommend writing your code in two parts: The first part should read the input; the second should write out the output.
If your input looks like this:
a:1
a:2
a:3
b:1
b:2
b:2
c:7
you can read in the input like this:
#!/usr/bin/env python3
# Usage: python3 scripy.py < input.txt > output.csv
import sys
# Loop through all the input lines and put the values in
# a list according to their category:
categoryList = {} # key => category, value => list of values
for line in sys.stdin.readlines():
line = line.rstrip('\n')
category, value = line.split(':')
if category not in categoryList:
categoryList[category] = []
categoryList[category].append(value)
# print(categoryList) # debug line
# Debug line prints: {'a': ['1', '2', '3'], 'b': ['1', '2', '2']}
This will read in all your data into a categoryList dict. It's a dict that contains the categories (the letters) as keys, and contains lists (of numbers) as the values. Once you have all the data held in that dict, you can output it.
Outputting involves getting a list of categories (the letters, in your example case) so that they can be written out first as your header:
# Get the list of categories:
categories = sorted(categoryList.keys())
assert categories, 'No categories found!' # sanity check
From here, you can use Python's nice csv module to output the header and then the rest of the lines. When outputting the main data, we can use an outer loop to loop through the nth entries of each category, then we can use an inner loop to loop through every category:
import csv
csvWriter = csv.writer(sys.stdout)
# Output the categories as the CSV header:
csvWriter.writerow(categories)
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
csvWriter.writerow(values)
i += 1 # increment index for the next time through the loop
If you don't want to use Python's csv module, you will still need to figure out how to group the entries in the category together. And if all you have is simple output (where none of the entries contain quotes, newlines, or commas), you can get away with manually writing out the output.
You could use something like this to output your values:
# Output the categories as the CSV header:
print(','.join(categories))
# Now output the values we just gathered as
# Comma Separated Values:
i = 0 # the index into an individual category list
while True:
values = []
for category in categories:
try:
values.append(categoryList[category][i])
except IndexError:
values.append('') # no value, so use an empty string
if len(''.join(values)) == 0:
break # we've run out of categories that contain input
print(','.join(values))
i += 1 # increment index for the next time through the loop
This will print out:
a,b,c
1,1,7
2,2,
3,2,
It does this by looping through all the list entries (the outer loop), and then looping through all the categories (the inner loop), and then printing out the values joined together by commas.
If you don't want the commas in your output, then you're technically not looking for CSV (Comma Separated Value) output. Still, in that case, it should be easy to modify the code to get what you want.
But if you have more complicated output (that is, values that have quotes, commas, and newlines in it) you should strongly consider using the csv module to output your data. Otherwise, you'll spend lots of time trying to fix obscure bugs with odd input that the csv module already handles.

How do I avoid errors when parsing a .csv file in python?

I'm trying to parse a .csv file that contains two columns: Ticker (the company ticker name) and Earnings (the corresponding company's earnings). When I read the file using the following code:
f = open('earnings.csv', 'r')
earnings = f.read()
The result when I run print earnings looks like this (it's a single string):
Ticker;Earnings
AAPL;52131400000
TSLA;-911214000
AMZN;583841600
I use the following code to split the string by the break line character (\n), followed by splitting each resulting line by the semi-colon character:
earnings_list = earnings.split('\n')
string_earnings = []
for string in earnings_list:
colon_list = string.split(';')
string_earnings.append(colon_list)
The result is a list of lists where each list contains the company's ticker at index[0] and its earnigns at index[1], like such:
[['Ticker', 'Earnings\r\r'], ['AAPL', '52131400000\r\r'], ['TSLA', '-911214000\r\r'], ['AMZN', '583841600\r\r']]
Now, I want to convert the earnings at index[1] of each list -which are currently strings- intro integers. So I first remove the first list containing the column names:
headless_earnings = string_earnings[1:]
Afterwards I try to loop over the resulting list to convert the values at index[1] of each list into integers with the following:
numerical = []
for i in headless_earnings:
num = int(i[1])
numerical.append(num)
I get the following error:
num = int(i[1])
IndexError: list index out of range
How is that index out of range?
You certainly mishandle the end of lines.
If I try your code with this string: "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600" it works.
But with this one: "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600\r\r\n" it doesn't.
Explanation: split creates a last list item containing only ['']. So at the end, python tries to access [''][1], hence the error.
So a very simple workaround would be to remove the last '\n' (if you're sure it's a '\n', otherwise you might have surprises).
You could write this:
earnings_list = earnings[:-1].split('\n')
this will fix your error.
If you want to be sure you remove a last '\n', you can write:
earnings_list = earnings[:-1].split('\n') if earnings[-1] == '\n' else earnings.split('\n')
EDIT: test code:
#!/usr/bin/env python2
earnings = "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600\r\r\n"
earnings_list = earnings[:-1].split('\n') if earnings[-1] == '\n' else earnings.split('\n')
string_earnings = []
for string in earnings_list:
colon_list = string.split(';')
string_earnings.append(colon_list)
headless_earnings = string_earnings[1:]
#print(headless_earnings)
numerical = []
for i in headless_earnings:
num = int(i[1])
numerical.append(num)
print numerical
Output:
nico#ometeotl:~/temp$ ./test_script2.py
[52131400000, -911214000, 583841600]

Categories