How to read a CSV column as a string in Python - python

I wrote code with pandas in order to pass in a CSV and retrieve a column, and then I have more code that is supposed to split the data using the re library, but it throws an error stating "TypeError: expected string or bytes-like object."
I believe I just need to convert the CSV into a string before running re on it, but I can't figure out how.
The column in the CSV has data which look like: 'HB1.A1D62no.0016, HB31.N33NO.89, HB 54 .N338'
import pandas as pd
data = pd.read_csv('HB_Lib.csv', delimiter = ',')
s = [data[['Call Number']]]
import re
pattern = r"(^[a-z]+)\s*(\d+(?:\.\d+)?)"
print(list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s])))
Traceback:
Traceback (most recent call last):
File "C:/Python/test2.py", line 8, in <module>
print(list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s])))
File "C:/Python/test2.py", line 8, in <listcomp>
print(list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s])))
File "C:\Python37\lib\re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

data['Call Number'] = data['Call Number'].astype(str)

I think the first thing you should do is to remove the external square brakets when declaring s.
So, obtaining something like:
a = data[['something']]

Related

Determine if a file is "more likely" json or csv

I have a few files with generalized extensions, such as "txt" or no extension at all. I'm trying to determine in a very quick manner whether the file is json or a csv. I thought of using the magic module, but it doesn't work for what I'm trying to do. For example:
>>> import magic
>>> magic.from_file('my_json_file.txt')
'ASCII text, with very long lines, with no line terminators'
Is there a better way to determine if something is json or csv? I'm unable to load the entire file, and I want to determine it in a very quick manner. What would be a good solution here?
You can check if the file starts with either { or [ to determine if it's JSON, and you can load the first two lines with csv.reader and see if the two rows have the same number of columns to determine if it's CSV.
import csv
with open('file') as f:
if f.read(1) in '{[':
print('likely JSON')
else:
f.seek(0)
reader = csv.reader(f)
try:
if len(next(reader)) == len(next(reader)) > 1:
print('likely CSV')
except StopIteration:
pass
You can use the try/catch "technique" trying to parse the data to JSON object. When loading an invalid formatted JSON from a string it raises a ValueError which you can catch and process however you want:
>>> import json
>>> s1 = '{"test": 123, "a": [{"b": 32}]}'
>>> json.loads(s1)
If valid, nothing happens, if not:
>>> import json
>>> s2 = '1;2;3;4'
>>> json.loads(s2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 2 - line 1 column 8 (char 1 - 7)
So you can build a function as follows:
import json
def check_format(filedata):
try:
json.loads(filedata)
return 'JSON'
except ValueError:
return 'CSV'
>>> check_format('{"test": 123, "a": [{"b": 32}]}')
'JSON'
>>> check_format('1;2;3;4')
'CSV'

Importing large set of numbers from a text file into python program in matrix form

I'm trying to import some pointcloud coordinates into python, the values are in a text file in this format
0.0054216 0.11349 0.040749
-0.0017447 0.11425 0.041273
-0.010661 0.11338 0.040916
0.026422 0.11499 0.032623
and so on.
Ive tried doing it by 2 methods
def getValue(filename):
try:
file = open(filename,'r')
except: IOError:
print ('problem with file'), filename
value = []
for line in file:
value.append(float(line))
return value
I called the above code in idle but there is an error that says the string cannot be converted to float.
import numpy as np
import matplotlib.pyplot as plt
data = np.genformtxt('co.txt', delimiter=',')
In this method when I call for data in idle it says data is not defined.
Below is the error message
>>> data[0:4]
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
data[0:4]
NameError: name 'data' is not defined
With the data you provided I would say you are trying to use:
np.loadtxt(<yourtxt.txt>, delimiter=" ")
In this case your delimiter should be blank space as can be seen in you data. This works perfectly for me.
Your problem is you are using the comma as delimiter.
float() converts one number string, not several (read its docs)
In [32]: line = '0.0054216 0.11349 0.040749 '
In [33]: float(line)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-0f93140abeab> in <module>()
----> 1 float(line)
ValueError: could not convert string to float: '0.0054216 0.11349 0.040749 '
Note that the error tells us which string is giving it problems.
It works if we split the line into substrings and convert those individually.
In [34]: [float(x) for x in line.split()]
Out[34]: [0.0054216, 0.11349, 0.040749]
Similarly, genfromtxt needs to split the lines into proper substrings. There aren't any , in your file, so clearly that's the wrong delimiter. The default delimiter is white space, which works just fine in this case.
In [35]: data = np.genfromtxt([line])
In [36]: data
Out[36]: array([0.0054216, 0.11349 , 0.040749 ])
With the wrong delimiter it tries to convert the whole line to a float. It can't (same reason as above), so it uses np.nan instead.
In [37]: data = np.genfromtxt([line], delimiter=',')
In [38]: data
Out[38]: array(nan)

How to apply regex sub to a csv file in python

I have a csv file I wish to apply a regex replacement to with python.
So far I have the following
reader = csv.reader(open('ffrk_inventory_relics.csv', 'r'))
writer = csv.writer(open('outfile.csv','w'))
for row in reader:
reader = re.sub(r'\+','z',reader)
Which is giving me the following error:
Script error: Traceback (most recent call last):
File "ffrk_inventory_tracker_v1.6.py", line 22, in response
getRelics(data['equipments'], 'ffrk_inventory_relics')
File "ffrk_inventory_tracker_v1.6.py", line 72, in getRelics
reader = re.sub(r'\+','z',reader)
File "c:\users\baconcatbug\appdata\local\programs\python\python36\lib\re.py",
line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
After googling to not much luck, I would like to ask the community here how to open the csv file correctly so I can use re.sub on it and then write out the altered csv file back to the same filename.
csv.reader(open('ffrk_inventory_relics.csv', 'r')) is creating a list of lists, and when you iterate over it and pass each value to re.sub, you are passing a list, not a string. Try this:
import re
import csv
final_data = [[re.sub('\+', 'z', b) for b in i] for i in csv.reader(open('ffrk_inventory_relics.csv', 'r'))]
write = csv.writer(open('ffrk_inventory_relics.csv'))
write.writerows(final_data)
If you don't need csv you can use replace with regular open:
with open('ffrk_inventory_relics.csv', 'r') as reader, open('outfile.csv','w') as writer:
for row in reader:
writer.write(row.replace('+','z'))

Sentiment analysis Python TypeError: expected string or bytes-like object

I am doing a sentiment analysis and I want to Add NOT to every word between negation and following punctuation. I am performing the following code:
import re
fin=open("aboveE1.txt",'r', encoding='UTF-8')
transformed = re.sub(r'\b(?:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
fin,
flags=re.IGNORECASE)
Traceback (most recent call last):
line 14, in
flags=re.IGNORECASE)
line 182, in sub return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
I dont know how to fix the error. Can you help me?
re.sub takes in a string, not a file object. Documentation here.
import re
fin=open("aboveE1.txt",'r', encoding='UTF-8')
transformed = ''
for line in fin:
transformed += re.sub(r'\b(?:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
line,
flags=re.IGNORECASE)
# No need to append '\n' to 'transformed'
# because the line returned via the iterator includes the '\n'
fin.close()
Also remember to always close the file you open.

python prettytable module raise Could not determine delimiter error for valid csv file

I'm trying to use prettytable module to print out data from csv file. But it failed with the following exception
Could not determine delimiter error for valid csv file
>>> import prettytable
>>> with file("/tmp/test.csv") as f:
... prettytable.from_csv(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "build/bdist.linux-x86_64/egg/prettytable.py", line 1337, in from_csv
File "/usr/lib/python2.7/csv.py", line 188, in sniff
raise Error, "Could not determine delimiter"
_csv.Error: Could not determine delimiter
The CSV file:
input_gps,1424185824460,1424185902788,1424185939525,1424186019313,1424186058952,1424186133797,1424186168766,1424186170214,1424186246354,1424186298434,1424186376789,1424186413625,1424186491453,1424186606143,1424186719394,1424186756366,1424186835829,1424186948532,1424187107293,1424187215557,1424187250693,1424187323097,1424187358989,1424187465475,1424187475824,1424187476738,1424187548602,1424187549228,1424187550690,1424187582866,1424187584248,1424187639923,1424187641623,1424187774567,1424187776418,1424187810376,1424187820238,1424187820998,1424187916896,1424187917472,1424187919241,1424188048340,dummy-0,dummy-1,Total
-73.958315%2C 40.815569,0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),13.0 (42%)
-76.932984%2C 38.992186,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),1.0(100%),0.0(nan%),1.0(100%),0.0(nan%),1.0(100%),0.0(nan%),0.0(nan%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),0.0(0%),17.0 (55%)
null_input-0,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0 (0%)
null_input-1,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),1.0(100%),1.0 (3%)
Total,0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),0.0(0%),1.0(3%),0.0(0%),0.0(0%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),31.0(100%)
If you anyone can inform me how to workaround the problem or other alternative alternatives, it will be very helpful.
According to pypi, prettytable is only alpha level. I could not find where you could give it the configuration to pass to the csv module. So in that case, you probably should read the csv file by explicitely declaring the delimiter, and build the PrettyTable line by line
pt = None # to avoid it vanished at end of block...
with open('/tmp/test.csv') as fd:
rd = csv.reader(fd, delimiter = ',')
pt = PrettyTable(next(rd))
for row in rd:
pt.add_row(row)
Got the same, working on some messier csv's. Ended up implementing fallback method with manual search
from string import punctuation, whitespace
from collections import Counter
def get_delimiter(self, contents: str):
# contents = f.read()
try:
sniffer = csv.Sniffer()
dialect = sniffer.sniff(contents)
return dialect.delimiter
except Error:
return fallback_delimiter_search(contents)
def fallback_delimiter_search(contents: str) -> str:
# eliminate space in case of a lot of text
content_chars = list(filter(lambda x: (x in punctuation or x in whitespace) and x!=' ', contents))
counts = Counter(content_chars)
tgt_delimiter = counts.most_common(1)[0][0]
return tgt_delimiter

Categories