Identifying partial character encoding/compression in text content - python

I have a CSV (extracted from BZ2) where only some values are encoded:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
The |, 0 and 1 characters are definitely appearing as intended but the other values are clearly encoded. In fact, they look like text-compression replacements which could mean the CSV had its values compressed and then also compressed as a whole to BZ2.
I get the same results whether extracting the BZ2 with 7zip then opening the CSV in a text editor, or opening with Python bz2 module, or with Pandas and read_csv:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
How can I identify which type of encoding type to decode with?
Source directory: https://nlp.cs.princeton.edu/SARC/2.0/main
Source file: test-balanced.csv.bz2
First 100 lines from extracted CSV: https://pastebin.com/mgW8hKdh
I asked the original authors of the CSV/dataset but they didn't respond which is understandable.

From readme.txt:
File Guide:
raw/key.csv: column key for raw/sarc.csv
raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
*/comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
/.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that
sequence, and sarcastic/non-sarcastic labels for those responses. The
format is post_id comment_id … comment_id|response_id … response_id|label … labelwhere *_id is a key to */comments.json
and label 1 indicates the respective response_id maps to a
sarcastic response. Thus each row has three entries (comment
chain, responses, labels) delimited by '|', and each of these entries
has elements delimited by spaces.The first entry always contains a
post_id and 0 or more comment_ids. The second and third entries
have the same number of elements, with the first response_id
corresponding to the first label and so on.
Converting above to a Python code snippet:
import pandas as pd
import json
from pprint import pprint
file_csv = r"D:\bat\SO\71596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
sep='|',
names=['posts','responses','labels'],
encoding='utf-8')
file_json = r"D:\bat\SO\71596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
data_json = json.load(f)
print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} post_id: {post_id}')
pprint(data_json[post_id])
for response_id in data_csv['responses'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} response_id: {response_id}')
pprint(data_json[response_id])
Note that files were (manually) downloaded from the pol directory for their acceptable size (pol: contains subset of main dataset corresponding to comments in /r/politics).
Result: D:\bat\SO\71596864.py
First csv line decoded:
post_id: hqa1x
{'author': 'joshlamb619',
'created_utc': 1307053256,
'date': '2011-06',
'downs': 359,
'score': 274,
'subreddit': 'politics',
'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
'candidates during recall elections.',
'ups': 633}
response_id: c1xiujs
{'author': 'Artisane',
'created_utc': 1307077221,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "And we're upset since the Democrats would *never* try something as "
'sneaky as this, right?',
'ups': -2}
response_id: c1xj4e2
{'author': 'stellarfury',
'created_utc': 1307080843,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
"Picture this we were makin' up candidates Being huge election whores",
'ups': -2}

Related

ParserError: unable to convert txt file to df due to json format and delimiter being the same

Im fairly new dealing with .txt files that has a dictionary within it. Im trying to pd.read_csv and create a dataframe in pandas.I get thrown an error of Error tokenizing data. C error: Expected 4 fields in line 2, saw 11. I belive I found the root problem which is the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas in this case is the delimiter.
Data (store.txt)
id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}
How would I read this csv file and create new columns based on the key-values. In addition, what if there are more key-value in other data... for example > 11 keys within the dictionary.
Is there a an efficient way of create a df from the example above?
My code when trying to read as csv##
df = pd.read_csv('store.txt', header=None)
I tried to import json and user a converter but it do not work and converted all the commas to a |
`
import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")
In addition I also tried to use:
`
import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")
I also was thinking to add a quote at the begining of the dictionary and at the end to make it a string but thought that was too tedious to find the closing brackets.
I think adding quotes around the dictionaries is the right approach. You can use regex to do so and use a different quote character than " (I used § in my example):
from io import StringIO
import re
import json
with open("store.txt", "r") as f:
csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())
df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")
df_out = pd.concat([
df[["id", "name", "storeid"]],
pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)
print(df_out)
Note: the very last value in your csv isn't valid json: "code":001. It should either be "code":"001" or "code":1
Output:
id name storeid Source ... itemtrack info itemsboughtid stolenitem
0 11 JohnSmith 3221-123-555 online ... 110 {'haircolor': 'black', 'age': 53} [] [{'item': 'candy', 'code': 1}, {'item': 'candy...
1 35 BillyDan 3221-123-555 letter ... 110 {'haircolor': 'black', 'age': 21} [1, 42, 465, 5] [{'item': 'shoe', 'code': 2}]
2 64 NickWalker 3221-123-555 letter ... 110 {'haircolor': 'red', 'age': 22} [1, 2] [{'item': 'sweater', 'code': 11}, {'item': 'ma...

File containing dictionaries that are <class 'str'> and no commas separating the dictionaries, need to load into pandas to create csv file easily

I've created a generator object and want to write it out into a CSV file so I can upload it to an external tool. At the minute the generator returns records as separate dictionaries but don't appear to have any commas separating the records/dictionaries and when I write out the file to a txt file and reload it back into the script it returns a <class 'str'>.
Class Generator declared as:
matches =
{'type_of_reference': 'JOUR', 'title': 'Ranking evidence in substance use and addiction', 'secondary_title': 'International Journal of Drug Policy', 'alternate_title1': 'Int. J. Drug Policy', 'volume': '83', 'year': '2020', 'doi': '10.1016/j.drugpo.2020.102840'}
{'type_of_reference': 'JOUR', 'title': 'Methods used in the selection of instruments for outcomes included in core outcome sets have improved since the publication of the COSMIN/COMET guideline', 'secondary_title': 'Journal of Clinical Epidemiology', 'alternate_title1': 'J. Clin. Epidemiol.', 'volume': '125', 'start_page': '64', 'end_page': '75', 'year': '2020', 'doi': '10.1016/j.jclinepi.2020.05.021',}
Which is a result of the following generator function that compares records "doi" key within this generator object and a set of doi's from an other file.
def match_record():
with open(filename_ris) as f:
ris_records = readris(f)
for entry in ris_records:
if entry['doi'] in doi_match:
yield entry
I've outputted this generator class matches by using the following code to review that the correct records have been kept as a txt file.
with open('output.txt', 'w') as f:
for x in matchs:
f.write(str(x))
It's not a list of dictionaries nor dictionaries separated by commas that I have so I'm a bit confused about how to read/load it into pandas effectively. I want to load it into pandas to drop certain series[keys] and then write it out as a csv once completed.
I'm reading it in using pd.read_csv and just returns the key: value pairs for all the separate records as column headers which is no surprise but I don't know what to do before this step.

How to convert json data from single key to csv

I'm trying to convert some data from a JSON file to csv. The data from the JSON file that I need exists in a single key.
I have separated the data from that key using the code below. This gives me the data in the following format:
[['/s/case/50034000013ZPEoAAO$##$00192169', 'Unable to add authentication', 'Anypoint Studio', 'Other', '7.1.3', '/s/contact/00334000023cIUYAA2$##$Paul S', '05-31-2018 22:07', '09-27-2018 05:46', 'S4'], ['/s/case/50034000014dk7mAAA$##$00195409', 'Connect Virtual Private Circuit - VPC-Pre-Prod 198.18.12.0/23', 'Anypoint Platform', 'CloudHub', '', '/s/contact/00334000023ZzOSAA0$##$James G', '07-16-2018 15:59', '07-22-2018 14:42', 'S4']
I want to separate the data so that everything contained in a square bracket is returned as a single row in my CSV file (the data is a lot longer than above, many more square brackets).
import json
json_data = json.load(open('sample_response.txt'))
for x in json_data['actions']:
data = x['returnValue']
You need writerows(data) to save it
import csv
data = [
['/s/case/50034000013ZPEoAAO$##$00192169', 'Unable to add authentication', 'Anypoint Studio', 'Other', '7.1.3', '/s/contact/00334000023cIUYAA2$##$Paul S', '05-31-2018 22:07', '09-27-2018 05:46', 'S4'],
['/s/case/50034000014dk7mAAA$##$00195409', 'Connect Virtual Private Circuit - VPC-Pre-Prod 198.18.12.0/23', 'Anypoint Platform', 'CloudHub', '', '/s/contact/00334000023ZzOSAA0$##$James G', '07-16-2018 15:59', '07-22-2018 14:42', 'S4']
# more rows
]
with open('test.csv', 'w') as fh:
csvwriter = csv.writer(fh)
csvwriter.writerows(data)

Import raw data from web page as a dataframe

I'm trying to import some data from a webpage into a dataframe.
Data: a block of text in the following format
[{"ID":0,"Name":"John","Location":"Chicago","Created":"2017-04-23"}, ... ]
I am successfully making the request to the server and can return the data in text form, but cannot seem to convert this to a DataFrame.
E.g
r = requests.get(url)
people = r.text
print(people)
So from this point, I am a bit confused on how to structure this text as a DataFrame. Most tutorials online seem to demonstrate importing csv, excel or html etc.
If people is a list of dict in string format, you can use json.loads to convert it to a list of dict and then create a DataFrame easily
>>> import json
>>> import pandas as pd
>>> people='[{"ID":0,"Name":"John","Location":"Chicago","Created":"2017-04-23"}]'
>>> json.loads(people)
[{'ID': 0, 'Name': 'John', 'Location': 'Chicago', 'Created': '2017-04-23'}]
>>>
>>> data=json.loads(people)
>>> pd.DataFrame(data)
Created ID Location Name
0 2017-04-23 0 Chicago John

CSV Manipulation in python

I want to be able to change the CSV data as we can do in javascript for JSON. Just code and object manipulation, like -
var obj = JSON.parse(jsonStr);
obj.name = 'foo bar';
var modifiedJSON = JSON.stringify(obj)
how can I do like this but for CSV files and in python ?
Something like -
csvObject = parseCSV(csvStr)
csvObject.age = 10
csvObject.name = csvObject.firstName + csvObject.lastName
csvStr = toCSV(csvObject)
I have a csv file customers.csv
ID,Name,Item,Date these are the columns. eg of the csv file -
ID,LastName,FirstName,Item,Date
11231249015,Derik,Smith,Televisionx1,1391212800000
24156246254,Doe,John,FooBar,1438732800000
I know very well that the python csv library can handle it but can it be treated as an object as whole and then manipulate ?
I basically want to combine the firstname and lastname, and perform some math with the IDs, but in the way javascript handles JSON
Not sure but maybe you want to use https://github.com/samarjeet27/CSV-Mapper
Install using pip install csvmapper
import csvmapper
# create parser instance
parser = csvmapper.CSVParser('customers.csv', hasHeader=True)
# create object
customers = parser.buildDict() # buildObject() if you want object
# perform manipulation
for customer in customers:
customer['Name'] = customer['FirstName'] + ' ' + customer['LastName']
# remove last name and firstname
# maybe this was what you wanted ?
customer.pop('LastName', None)
customer.pop('FirstName', None)
print customers
Output
[{'Name': 'Smith Derik', 'Item': 'Televisionx1', 'Date': '1391212800000', 'ID': '11231249015'}, {'Name': 'John Doe', 'Item': 'FooBar', 'Date': '1438732800000', 'ID': '24156246254'}]
This combines the firstName and lastName by accessing it as a dict, as maybe you want to remove the last name and firstname I think, replacing it with just a 'name' property. You can use parser.buildObject() if you want to access it as in javascript
Edit
You can save it back to CSV too.
writer = csvmapper.CSVWriter(customers) # modified customers from the above code
writer.write('customers-final.csv')
And regarding being able to perform math, you could use a custom mapper file like
mapper = csvmapper.DictMapper(x = [
[
{ 'name':'ID' ,'type':'long'},
{ 'name':'LastName' },
{ 'name':'FirstName' },
{ 'name':'Item' },
{ 'name':'Date', 'type':'int' }
]
]
parser = csvmapper.CSVParser('customers.csv', mapper)
And specify the type(s)
JSON can, by design, represent various kinds of data in various kinds of arrangements (objects, arrays...) and you can nest these if you wish. This means that its relatively easy to serialise and deserialise complex objects.
On the other-hand, CSV is just rows and columns of data. No structured objects, arrays, nesting, etc. So you basically have to know ahead of time what you're dealing with, and then manually map these to corresponding objects.
That said, Python's CSV module does have dict reader functionality, which will let you open a CSV file as a python dictionary consisting of the CSV's rows. It automatically maps the first / header row to field-names, but you can also pass-in the field-names to the constructor. You can therefore reference a property from a row by using the corresponding column header / fieldname. It also has a corresponding dict writer class. If you don't need any fancy nesting or complex data structures, then these may be all you really need?
This example is directly from the python module documentation:
import csv
with open('names.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])

Categories