I am planning to tokenize a column within a JSON file with NLTK. The code below reads and slices the JSON file according into different time intervals.
I am however struggling to have the 'Main Text' column (within the JSON file) read/tokenized in the final part of the code below. Is there any smart tweak to make this happen?
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = pd.to_datetime('2009-01-01')
end_date = pd.to_datetime('2009-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_dates = df[df.Date.between(start_date, end_date)]
n = len(filtered_dates.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_dates)
start_date += mnthBeg
end_date += mnthEnd
# NLTK tokenizing
file_content = open('Main Text').read()
tokens = nltk.word_tokenize(file_content)
print(tokens)
I have solved the situation with the following code, which runs smoothly. Many thanks again for everyone's input.
for index, row in filtered_dates.iterrows():
line = row['Text Main']
tokens = nltk.word_tokenize(line)
print(tokens)
Related
I have created a function to parse data from CityBik API for 4 cities.
In the function I parse the data and add in the necessary details I require before creating a DataFrame using pandas.
When I run the function it displays the data as being from one city only. Does anyone know how I can fix this?
def parse_bike_data(city_name, fpath):
fin = open(fpath, "r")
json_data = fin.read() #read in JSON data
data = json.loads(json_data)
stations= data['network']['stations']
rows = []
#look at each observation
for obs in stations:
row={}
row = {"City": city_name}
# parse the local datatime in ISO8601 format
obs_date = datetime.strptime(obs["timestamp"], "%Y-%m-%dT%H:%M:%S.%f%z")
# strip the timezone
obs_date = obs_date.replace(tzinfo=None)
# round it to the nearest hour
row["Timestamp"] = round_datetime(obs_date)
#add in relevant station data
row['Station'] = obs["name"]
row["Free Bikes"] = obs["free_bikes"]
row["Empty Slots"] = obs["empty_slots"]
row["Capacity"] = obs["extra"]["slots"]
rows.append(row)
fin.close()
return pd.DataFrame(rows) #return a data frame
I have a csv file containing sensor data where one row is of the following format
1616580317.0733, {'Roll': 0.563820598084682, 'Pitch': 0.29817540218781163, 'Yaw': 60.18415650363684, 'gyroX': 0.006687641609460116, 'gyroY': -0.012394784949719908, 'gyroZ': -0.0027120113372802734, 'accX': -0.12778355181217196, 'accY': 0.24647256731987, 'accZ': 9.763526916503906}
Where the first column is a timestamp and the remainder is a dictionary like object containing various measured quantities.
I want to read this into a pandas array wit the columns
["Timestamp","Roll","Pitch","Yaw","gyroX","gyroY","gyroZ","accX","accY","accZ"]. What would be an efficient way of doing this? The file is 600MB so it's not a trivial number of lines which need to be parsed.
I'm not sure where you are getting the seconds column from.
The code below parses each row into a timestamp and dict. Then adds the timestamp to the dictionary that will eventually become a row in the dataframe.
import json
import pandas as pd
def read_file(filename):
chunk_size = 20000
entries = []
counter = 0
df = pd.DataFrame()
with open(filename, "r") as fh:
for line in fh:
timestamp, data_dict = line.split(",", 1)
data_dict = json.loads(data_dict.replace("'", '"'))
data_dict["timestamp"] = float(timestamp)
entries.append(data_dict)
counter += 1
if counter == chunk_size:
df = df.append(entries, ignore_index=True)
entries = []
counter = 0
if counter != 0:
df = df.append(entries, ignore_index=True)
return df
read_file("sample.txt")
I think you should convert your csv file to json format and then look at this site on how to transform the dictionary into a pandas dataframe : https://www.delftstack.com/fr/howto/python-pandas/how-to-convert-python-dictionary-to-pandas-dataframe/#:~:text=2%20banana%2012-,M%C3%A9thode%20pandas.,le%20nom%20de%20la%20colonne.
For a current research project, I am trying to slice a JSON file into different time intercepts. Based on the object "Date", I want to analyse content of the JSON file by quarter, i.e. 01 January - 31 March, 01 April - 20 June etc.
The code would ideally have to pick the oldest date in the file and add quarterly time incercepts on top of that. I have done research on this point but not found any helpful methods yet.
Is there any smart way to include this in the code? The JSON file has the following structure:
[
{"No":"121","Stock Symbol":"A","Date":"05/11/2017","Text Main":"Sample text"}
]
And the existing relevant code excerpt looks like this:
import pandas as pd
file = pd.read_json (r'Glassdoor_A.json')
data = json.load(file)
# Create an empty dictionary
d = dict()
# processing:
for row in data:
line = row['Text Main']
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into time intervals
line.sort_values(by=['Date'])
line.tshift(d, int = 90, freq=timedelta, axis='Date')
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Count the total number of words
total = sum(d.values())
print(d[key], total)
Please find below the solution to the question. The data can be sliced with Pandas by allocating a start and an end date and comparing the JSON Date object with these dates.
Important note: the data must be normalised and dates have to be converted into a Pandas datetime format before processing the information.
import string
import json
import csv
import pandas as pd
import datetime
import numpy as np
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = "01/01/2018"
end_date = "31/03/2018"
after_start_date = df["Date"] >= start_date
before_end_date = df["Date"] <= end_date
between_two_dates = after_start_date & before_end_date
filtered_dates = df.loc[between_two_dates]
print(filtered_dates)
I have a text file with data of 6000 records in this format
{"id":"1001","user":"AB1001","first_name":"David ","name":"Shai","amount":"100","email":"me#no.mail","phone":"9999444"}
{"id":"1002","user":"AB1002","first_name":"jone ","name":"Miraai","amount":"500","email":"some1#no.mail","phone":"98894004"}
I want to export all data to excel file as shown bellow example
I would recommend reading in the text file, then converting to a dictionary with json, and using pandas to save a .csv file that can be opened with excel.
In the example below, I copied your text into a text file, called "myfile.txt", and I saved the data as "myfile2.csv".
import pandas as pd
import json
# read lines of text file
with open('myfile.txt') as f:
lines=f.readlines()
# remove empty lines
lines2 = [line for line in lines if not(line == "\n")]
# convert to dictionaries
dicts = [json.loads(line) for line in lines2]
# save to .csv
pd.DataFrame(dicts ).to_csv("myfile2.csv", index = False)
You can use VBA and a json-parser
Your two lines are not a valid JSON. However, it is easy to convert it to a valid JSON as shown in the code below. Then it is a relatively simple matter to parse it and write it to a worksheet.
The code assumes no blank lines in your text file, but it is easy to fix if that is not the case.
Using your data on two separate lines in a windows text file (if not windows, you may have to change the replacement of the newline token with a comma depending on what the generating system uses for newline.
I used the JSON Converter by Tim Hall
'Set reference to Microsoft Scripting Runtime or
' use late binding
Option Explicit
Sub parseData()
Dim JSON As Object
Dim strJSON As String
Dim FSO As FileSystemObject, TS As TextStream
Dim I As Long, J As Long
Dim vRes As Variant, v As Variant, O As Object
Dim wsRes As Worksheet, rRes As Range
Set FSO = New FileSystemObject
Set TS = FSO.OpenTextFile("D:\Users\Ron\Desktop\New Text Document.txt", ForReading, False, TristateUseDefault)
'Convert to valid JSON
strJSON = "[" & TS.ReadAll & "]"
strJSON = Replace(strJSON, vbLf, ",")
Set JSON = parsejson(strJSON)
ReDim vRes(0 To JSON.Count, 1 To JSON(1).Count)
'Header row
J = 0
For Each v In JSON(1).Keys
J = J + 1
vRes(0, J) = v
Next v
'populate the data
I = 0
For Each O In JSON
I = I + 1
J = 0
For Each v In O.Keys
J = J + 1
vRes(I, J) = O(v)
Next v
Next O
'write to a worksheet
Set wsRes = Worksheets("sheet6")
Set rRes = wsRes.Cells(1, 1)
Set rRes = rRes.Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))
Application.ScreenUpdating = False
With rRes
.EntireColumn.Clear
.Value = vRes
.Style = "Output"
.EntireColumn.AutoFit
End With
End Sub
Results from your posted data
Try using the pandas module in conjunction with the eval() function:
import pandas as pd
with open('textfile.txt', 'r') as f:
data = f.readlines()
df = pd.DataFrame(data=[eval(i) for i in data])
df.to_excel('filename.xlsx', index=False)
I'm trying to write a dataframe to a .csv using df.to_csv(). For some reason, its only writing the last value (data for the last ticker). It reads through a list of tickers (turtle, all tickers are in first column) and spits out price data for each ticker. I can print all the data without a problem but can't seem to write to .csv. Any idea why? Thanks
input_file = pd.read_csv("turtle.csv", header=None)
for ticker in input_file.iloc[:,0].tolist():
data = web.DataReader(ticker, "yahoo", datetime(2011,06,1), datetime(2016,05,31))
data['ymd'] = data.index
year_month = data.index.to_period('M')
data['year_month'] = year_month
first_day_of_months = data.groupby(["year_month"])["ymd"].min()
first_day_of_months = first_day_of_months.to_frame().reset_index(level=0)
last_day_of_months = data.groupby(["year_month"])["ymd"].max()
last_day_of_months = last_day_of_months.to_frame().reset_index(level=0)
fday_open = data.merge(first_day_of_months,on=['ymd'])
fday_open = fday_open[['year_month_x','Open']]
lday_open = data.merge(last_day_of_months,on=['ymd'])
lday_open = lday_open[['year_month_x','Open']]
fday_lday = fday_open.merge(lday_open,on=['year_month_x'])
monthly_changes = {i:MonthlyChange(i) for i in range(1,13)}
for index,ym, openf,openl in fday_lday.itertuples():
month = ym.strftime('%m')
month = int(month)
diff = (openf-openl)/openf
monthly_changes[month].add_change(diff)
changes_df = pd.DataFrame([monthly_changes[i].get_data() for i in monthly_changes],columns=["Month","Avg Inc.","Inc","Avg.Dec","Dec"])
CSVdir = r"C:\Users\..."
realCSVdir = os.path.realpath(CSVdir)
if not os.path.exists(CSVdir):
os.makedirs(CSVdir)
new_file_name = os.path.join(realCSVdir,'PriceData.csv')
new_file = open(new_file_name, 'wb')
new_file.write(ticker)
changes_df.to_csv(new_file)
Use a for appending instead of wb because it overwrites the data in every iteration of loop.For different modes of opening a file see here.