For a current research project, I am trying to slice a JSON file into different time intercepts. Based on the object "Date", I want to analyse content of the JSON file by quarter, i.e. 01 January - 31 March, 01 April - 20 June etc.
The code would ideally have to pick the oldest date in the file and add quarterly time incercepts on top of that. I have done research on this point but not found any helpful methods yet.
Is there any smart way to include this in the code? The JSON file has the following structure:
[
{"No":"121","Stock Symbol":"A","Date":"05/11/2017","Text Main":"Sample text"}
]
And the existing relevant code excerpt looks like this:
import pandas as pd
file = pd.read_json (r'Glassdoor_A.json')
data = json.load(file)
# Create an empty dictionary
d = dict()
# processing:
for row in data:
line = row['Text Main']
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into time intervals
line.sort_values(by=['Date'])
line.tshift(d, int = 90, freq=timedelta, axis='Date')
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Count the total number of words
total = sum(d.values())
print(d[key], total)
Please find below the solution to the question. The data can be sliced with Pandas by allocating a start and an end date and comparing the JSON Date object with these dates.
Important note: the data must be normalised and dates have to be converted into a Pandas datetime format before processing the information.
import string
import json
import csv
import pandas as pd
import datetime
import numpy as np
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = "01/01/2018"
end_date = "31/03/2018"
after_start_date = df["Date"] >= start_date
before_end_date = df["Date"] <= end_date
between_two_dates = after_start_date & before_end_date
filtered_dates = df.loc[between_two_dates]
print(filtered_dates)
Related
I have a csv file containing sensor data where one row is of the following format
1616580317.0733, {'Roll': 0.563820598084682, 'Pitch': 0.29817540218781163, 'Yaw': 60.18415650363684, 'gyroX': 0.006687641609460116, 'gyroY': -0.012394784949719908, 'gyroZ': -0.0027120113372802734, 'accX': -0.12778355181217196, 'accY': 0.24647256731987, 'accZ': 9.763526916503906}
Where the first column is a timestamp and the remainder is a dictionary like object containing various measured quantities.
I want to read this into a pandas array wit the columns
["Timestamp","Roll","Pitch","Yaw","gyroX","gyroY","gyroZ","accX","accY","accZ"]. What would be an efficient way of doing this? The file is 600MB so it's not a trivial number of lines which need to be parsed.
I'm not sure where you are getting the seconds column from.
The code below parses each row into a timestamp and dict. Then adds the timestamp to the dictionary that will eventually become a row in the dataframe.
import json
import pandas as pd
def read_file(filename):
chunk_size = 20000
entries = []
counter = 0
df = pd.DataFrame()
with open(filename, "r") as fh:
for line in fh:
timestamp, data_dict = line.split(",", 1)
data_dict = json.loads(data_dict.replace("'", '"'))
data_dict["timestamp"] = float(timestamp)
entries.append(data_dict)
counter += 1
if counter == chunk_size:
df = df.append(entries, ignore_index=True)
entries = []
counter = 0
if counter != 0:
df = df.append(entries, ignore_index=True)
return df
read_file("sample.txt")
I think you should convert your csv file to json format and then look at this site on how to transform the dictionary into a pandas dataframe : https://www.delftstack.com/fr/howto/python-pandas/how-to-convert-python-dictionary-to-pandas-dataframe/#:~:text=2%20banana%2012-,M%C3%A9thode%20pandas.,le%20nom%20de%20la%20colonne.
I am planning to tokenize a column within a JSON file with NLTK. The code below reads and slices the JSON file according into different time intervals.
I am however struggling to have the 'Main Text' column (within the JSON file) read/tokenized in the final part of the code below. Is there any smart tweak to make this happen?
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = pd.to_datetime('2009-01-01')
end_date = pd.to_datetime('2009-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_dates = df[df.Date.between(start_date, end_date)]
n = len(filtered_dates.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_dates)
start_date += mnthBeg
end_date += mnthEnd
# NLTK tokenizing
file_content = open('Main Text').read()
tokens = nltk.word_tokenize(file_content)
print(tokens)
I have solved the situation with the following code, which runs smoothly. Many thanks again for everyone's input.
for index, row in filtered_dates.iterrows():
line = row['Text Main']
tokens = nltk.word_tokenize(line)
print(tokens)
I have two large csv files that exceed my memory capabilities requiring me to chunk the files or read by line.
In column 1 of each file is a datetime sting timestamp in sorted from oldest to most recent.
Iterating through each line of the first csv file, what is the most efficient way to retrieve data from the row of the second csv file whose timestamp is the most recent time but older than the timestamp of the row in the first csv file?
import pandas as pd
import datetime
from time import time
import itertools
import re
x = 0
with open(r'C:/Users/Administrator/Desktop/TSLA_T.csv') as f_trades:
for line_trades in itertools.islice(f_trades, 1,None): #Skip dataframe headers
trade = re.split(',|\n',line_trades) # Convert line into list
trade_time = datetime.datetime.strptime(trade[1][:-3], "%H:%M:%S.%f") #Convert string to datetime
print("Time to Find: " + str(trade_time))
with open(r'C:/Users/Administrator/Desktop/TSLA_Q.csv') as f_quotes:
for line_quotes in itertools.islice(f_quotes, 1,None): #Skip dataframe headers
quote = re.split(',|\n',line_quotes) #Convert line into list
quote_time = datetime.datetime.strptime(quote[1][:-3], "%H:%M:%S.%f") #Convert string to datetime
#print(quote_time)
if x == 0:
x = 1
previous_quote_time = quote_time # Store previous time initially
elif quote_time > trade_time:
print("Timestamp Located: " + str(previous_quote_time))
#print(quote_time)
previous_quote_time = quote_time #Update to new time
break
else:
previous_quote_time = quote_time #Update to new time
I have a program which takes in a JSON file, reads it line by line, aggregates the time into four bins depending on the time, and then outputs it to a file. However my file output contains extra characters due to concatenating a dictionary with a string.
For example this is how the output for one line looks:
dwQEZBFen2GdihLLfWeexA<bound method DataFrame.to_dict of Friday Monday Saturday Sunday Thursday Tuesday Wednesday
Category
Afternoon 0 0 3 2 2 0 1
Evening 20 4 16 11 4 3 5
Night 16 1 19 5 2 5 3>
The memory address is being concatenated as well into the output file.
Here is the code used for creating this specific file:
import json
import ast
import pandas as pd
from datetime import datetime
def cleanStr4SQL(s):
return s.replace("'","`").replace("\n"," ")
def parseCheckinData():
#write code to parse yelp_checkin.JSON
# Add a new column "Time" to the DataFrame and set the values after left padding the values in the index
with open('yelp_checkin.JSON') as f:
outfile = open('checkin.txt', 'w')
line = f.readline()
# print(line)
count_line = 0
while line:
data = json.loads(line)
# print(data)
# jsontxt = cleanStr4SQL(str(data['time']))
# Parse the json and convert to a dictionary object
jsondict = ast.literal_eval(str(data))
outfile.write(cleanStr4SQL(str(data['business_id'])))
# Convert the "time" element in the dictionary to a pandas DataFrame
df = pd.DataFrame(jsondict['time'])
# Add a new column "Time" to the DataFrame and set the values after left padding the values in the index
df['Time'] = df.index.str.rjust(5, '0')
# Add a new column "Category" and the set the values based on the time slot
df['Category'] = df['Time'].apply(cat)
# Create a pivot table based on the "Category" column
pt = df.pivot_table(index='Category', aggfunc=sum, fill_value=0)
# Convert the pivot table to a dictionary to get the json output you want
jsonoutput = pt.to_dict
# print(jsonoutput)
outfile.write(str(jsonoutput))
line = f.readline()
count_line+=1
print(count_line)
outfile.close()
f.close()
# Define a function to convert the time slots to the categories
def cat(time_slot):
if '06:00' <= time_slot < '12:00':
return 'Morning'
elif '12:00' <= time_slot < '17:00':
return 'Afternoon'
elif '17:00' <= time_slot < '23:00':
return 'Evening'
else:
return 'Night'
I was wondering if it was possible to remove the memory location from the output file in some way?
Any advice is appreciated and please let me know if you require any more information.
Thank you for reading
The way you're working with JSON seems like streaming it, which is an unpleasant problem to deal with.
If you're not working with a terribly big JSON file, you're better off with
with open("input.json", "r") as input_json:
json_data = json.load(input_json)
And then extract specific entries from json_data as you wish (just remember it is a dictionary), manipulate them and populate an output dict intended to be saved
Also, in python if you're using a with open(...) syntax, you don't need to close the file afterwards
Problem 1: missing parenthesis after to_dict, which causes this "memory address".
Problem 2: to produce a valid JSON, you will also need to wrap the output into an array
Problem 3: converting JSON to/from string is not safe with str or eval. Use json.loads() and .dumps()
import json
...
line_chunks = []
outfile.write("[")
while line:
...
jsondict = json.loads(data) # problem 3
...
jsonoutput = pt.to_dict() # problem 1
...
outfile.write(json.dumps(line_chunks)) # problems 2 and 3
I am working on a tutorial which reads a csv file :
# Read the data and append SENTENCE_START and SENTENCE_END tokens
with open('data/reddit-comments-2015-08.csv', 'rb') as f:
reader = csv.reader(f, skipinitialspace=True)
reader.next()
# Split full comments into sentences
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-
8').lower()) for x in reader])
# Append SENTENCE_START and SENTENCE_END
sentences = ["%s %s %s" % (sentence_start_token, x,
sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))
but get the following error:
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
IndexError: list index out of range
Can anyone help me with that. I am new in nltk.
That's not exactly a cvs file but you can it as one.
With pandas:
import pandas as pd
df = pd.read_csv('reddit-comments-2015-08.csv', sep='\0')
[out]:
body
0 I joined a new league this year and they have ...
1 In your scenario, a person could just not run ...
2 They don't get paid for how much time you spen...
3 I dunno, back before the August update in an A...
4 No, but Toriyama sometimes would draw himself ...
Then to remove the starting and trailing spaces:
df['body'][:100].astype(str).apply(str.strip)
Next, you see that you have weird XML escaped symbols (e.g. >, <, etc.) in the text, so before tokenization, you have to unescape them:
import pandas as pd
from nltk.tokenize.util import xml_unescape
df = pd.read_csv('reddit-comments-2015-08.csv', sep='\0')
df['body'].astype(str).apply(str.strip).apply(xml_unescape)
Now you can do the tokenization:
import pandas as pd
from nltk.tokenize.util import xml_unescape
df = pd.read_csv('reddit-comments-2015-08.csv', sep='\0')
df['body'].astype(str).apply(str.strip).apply(xml_unescape).apply(word_tokenize)
To add the START and END token, simply do:
df['tokens'] = ['START'] + df['body'].astype(str).apply(str.strip).apply(xml_unescape).apply(word_tokenize) + ['END']