Determine if a file is "more likely" json or csv - python

I have a few files with generalized extensions, such as "txt" or no extension at all. I'm trying to determine in a very quick manner whether the file is json or a csv. I thought of using the magic module, but it doesn't work for what I'm trying to do. For example:
>>> import magic
>>> magic.from_file('my_json_file.txt')
'ASCII text, with very long lines, with no line terminators'
Is there a better way to determine if something is json or csv? I'm unable to load the entire file, and I want to determine it in a very quick manner. What would be a good solution here?

You can check if the file starts with either { or [ to determine if it's JSON, and you can load the first two lines with csv.reader and see if the two rows have the same number of columns to determine if it's CSV.
import csv
with open('file') as f:
if f.read(1) in '{[':
print('likely JSON')
else:
f.seek(0)
reader = csv.reader(f)
try:
if len(next(reader)) == len(next(reader)) > 1:
print('likely CSV')
except StopIteration:
pass

You can use the try/catch "technique" trying to parse the data to JSON object. When loading an invalid formatted JSON from a string it raises a ValueError which you can catch and process however you want:
>>> import json
>>> s1 = '{"test": 123, "a": [{"b": 32}]}'
>>> json.loads(s1)
If valid, nothing happens, if not:
>>> import json
>>> s2 = '1;2;3;4'
>>> json.loads(s2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 2 - line 1 column 8 (char 1 - 7)
So you can build a function as follows:
import json
def check_format(filedata):
try:
json.loads(filedata)
return 'JSON'
except ValueError:
return 'CSV'
>>> check_format('{"test": 123, "a": [{"b": 32}]}')
'JSON'
>>> check_format('1;2;3;4')
'CSV'

Related

How to get the last line from JSON file which contains only tuples in each line

can anybody pls tell me what way I need to use to fetch the last line from JSON file which contains tuples?
lst = (('7', '♠'), ('7', '♣')) # this line is generated each time and has new values
For instance, "deals.json" file contains such lines:
[["7", "\u2660"], ["7", "\u2663"]]
[["8", "\u2660"], ["8", "\u2663"]]
Code:
# here I add new line into the file
with open('deals.json', 'a') as f:
json.dump(lst,f)
f.write('\n')
# here I want to read the last line and handle it
with open('deals.json') as f:
json_data = json.load(f)
print(json_data)
After running this code it works fine but after the 2nd one it failed while trying to read the file and it's obvious:
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 549)
I know how to do that using text file but have no idea how to read the last line in JSON which contains tuples.
Reading just the last line can be achieved like this. Internally the entire file will be read but you don't need to handle that explicitly. If the file was very large then you might need another approach.
import json
with open('deals.json') as f:
j = json.loads(f.readlines()[-1])
print(j)
Output:
[['8', '♠'], ['8', '♣']]

How to unpack double digit hex file with python msgPack?

I have a text file containing some data, among these data there's a JSON packed with msgPack.
I am able to unpack on https://toolslick.com/conversion/data/messagepack-to-json but I can't get to make it work in python.
Up to now I am trying to do the following :
def parseAndSplit(path):
with open(path) as f:
fContent = f.read()
for subf in fContent.split('Payload: '):
'''for ssubf in subf.split('DataChunkMsg'):
print(ssubf)'''
return subf.split('DataChunkMsg')[0]
fpath = "path/to/file"
t = parseAndSplit(fpath)
l = t.split("-")
s = ""
for i in l:
s=s+i
print(s)
a = msgpack.unpackb(bytes(s,"UTF-8"), raw=False)
print(a)
but the output is
import msgpack
Traceback (most recent call last):
File "C:/Users/Marco/PycharmProjects/codeTest/msgPack.py", line 19, in <module>
a = msgpack.unpackb(bytes(s,"UTF-8"), raw=False)
File "msgpack\_unpacker.pyx", line 202, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
9392AA6E722D736230322D3032AC4F444D44617...(string goes on)
I am quite sure that it's an encoding problem of some sort but I am having no luck, wether in the docs or by trying .
Thank you very much for the attention
I found the solution in the end:
msgpack.unpackb(bytes.fromhex(hexstring)
where hexstring is the string read from the file.

How to read a CSV column as a string in Python

I wrote code with pandas in order to pass in a CSV and retrieve a column, and then I have more code that is supposed to split the data using the re library, but it throws an error stating "TypeError: expected string or bytes-like object."
I believe I just need to convert the CSV into a string before running re on it, but I can't figure out how.
The column in the CSV has data which look like: 'HB1.A1D62no.0016, HB31.N33NO.89, HB 54 .N338'
import pandas as pd
data = pd.read_csv('HB_Lib.csv', delimiter = ',')
s = [data[['Call Number']]]
import re
pattern = r"(^[a-z]+)\s*(\d+(?:\.\d+)?)"
print(list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s])))
Traceback:
Traceback (most recent call last):
File "C:/Python/test2.py", line 8, in <module>
print(list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s])))
File "C:/Python/test2.py", line 8, in <listcomp>
print(list(map("".join, [re.findall(pattern, part, flags=re.I)[0] for part in s])))
File "C:\Python37\lib\re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
data['Call Number'] = data['Call Number'].astype(str)
I think the first thing you should do is to remove the external square brakets when declaring s.
So, obtaining something like:
a = data[['something']]

How to apply regex sub to a csv file in python

I have a csv file I wish to apply a regex replacement to with python.
So far I have the following
reader = csv.reader(open('ffrk_inventory_relics.csv', 'r'))
writer = csv.writer(open('outfile.csv','w'))
for row in reader:
reader = re.sub(r'\+','z',reader)
Which is giving me the following error:
Script error: Traceback (most recent call last):
File "ffrk_inventory_tracker_v1.6.py", line 22, in response
getRelics(data['equipments'], 'ffrk_inventory_relics')
File "ffrk_inventory_tracker_v1.6.py", line 72, in getRelics
reader = re.sub(r'\+','z',reader)
File "c:\users\baconcatbug\appdata\local\programs\python\python36\lib\re.py",
line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
After googling to not much luck, I would like to ask the community here how to open the csv file correctly so I can use re.sub on it and then write out the altered csv file back to the same filename.
csv.reader(open('ffrk_inventory_relics.csv', 'r')) is creating a list of lists, and when you iterate over it and pass each value to re.sub, you are passing a list, not a string. Try this:
import re
import csv
final_data = [[re.sub('\+', 'z', b) for b in i] for i in csv.reader(open('ffrk_inventory_relics.csv', 'r'))]
write = csv.writer(open('ffrk_inventory_relics.csv'))
write.writerows(final_data)
If you don't need csv you can use replace with regular open:
with open('ffrk_inventory_relics.csv', 'r') as reader, open('outfile.csv','w') as writer:
for row in reader:
writer.write(row.replace('+','z'))

python prettytable module raise Could not determine delimiter error for valid csv file

I'm trying to use prettytable module to print out data from csv file. But it failed with the following exception
Could not determine delimiter error for valid csv file
>>> import prettytable
>>> with file("/tmp/test.csv") as f:
... prettytable.from_csv(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "build/bdist.linux-x86_64/egg/prettytable.py", line 1337, in from_csv
File "/usr/lib/python2.7/csv.py", line 188, in sniff
raise Error, "Could not determine delimiter"
_csv.Error: Could not determine delimiter
The CSV file:
input_gps,1424185824460,1424185902788,1424185939525,1424186019313,1424186058952,1424186133797,1424186168766,1424186170214,1424186246354,1424186298434,1424186376789,1424186413625,1424186491453,1424186606143,1424186719394,1424186756366,1424186835829,1424186948532,1424187107293,1424187215557,1424187250693,1424187323097,1424187358989,1424187465475,1424187475824,1424187476738,1424187548602,1424187549228,1424187550690,1424187582866,1424187584248,1424187639923,1424187641623,1424187774567,1424187776418,1424187810376,1424187820238,1424187820998,1424187916896,1424187917472,1424187919241,1424188048340,dummy-0,dummy-1,Total
-73.958315%2C 40.815569,0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),13.0 (42%)
-76.932984%2C 38.992186,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),1.0(100%),0.0(nan%),1.0(100%),0.0(nan%),1.0(100%),0.0(nan%),0.0(nan%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),0.0(0%),17.0 (55%)
null_input-0,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0 (0%)
null_input-1,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),1.0(100%),1.0 (3%)
Total,0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),0.0(0%),1.0(3%),0.0(0%),0.0(0%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),31.0(100%)
If you anyone can inform me how to workaround the problem or other alternative alternatives, it will be very helpful.
According to pypi, prettytable is only alpha level. I could not find where you could give it the configuration to pass to the csv module. So in that case, you probably should read the csv file by explicitely declaring the delimiter, and build the PrettyTable line by line
pt = None # to avoid it vanished at end of block...
with open('/tmp/test.csv') as fd:
rd = csv.reader(fd, delimiter = ',')
pt = PrettyTable(next(rd))
for row in rd:
pt.add_row(row)
Got the same, working on some messier csv's. Ended up implementing fallback method with manual search
from string import punctuation, whitespace
from collections import Counter
def get_delimiter(self, contents: str):
# contents = f.read()
try:
sniffer = csv.Sniffer()
dialect = sniffer.sniff(contents)
return dialect.delimiter
except Error:
return fallback_delimiter_search(contents)
def fallback_delimiter_search(contents: str) -> str:
# eliminate space in case of a lot of text
content_chars = list(filter(lambda x: (x in punctuation or x in whitespace) and x!=' ', contents))
counts = Counter(content_chars)
tgt_delimiter = counts.most_common(1)[0][0]
return tgt_delimiter

Categories