Producing seperate text files from a large Json file - python

I'm trying to use the code below to produce a set of .txt files from a large .json file where there is one Json object per line, with date and a string of text. I want the date to be the filename.
When I open up the .json file (in sublime text editor), it shows me 2272 lines, so I assume the code should produce this number of text files. However, it is only producing half as many. Can anybody tell me why, and what I should do to correct this?
import json
#with open('results.json') as json_file:
data = [json.loads(line) for line in open('results.json', 'r')]
for p in data:
date = p["date"]
filename = date.replace(" ", "_").replace(":","_")
print(filename)
text = p["text"]
with open('Articles2/'+filename+'.txt', 'w') as f:
f.write(text+'\n')
Thanks for any help!

You have duplicate dates in your sample data, so each iteration of your for loop is creating a new entry and then overwriting the entries where the date is the exact same.
For example: 2018-11-17 17:11:48 with 3 entries about 13 lines down in your data. This will only create 1 new file because your new file creation criteria in your script is only based on the date data.
You need to add some other unique value to the date to make them unique so open() doesn't overwrite the file that already exists. ie. add milliseconds to your date, concat values from the "text" property in your JSON object, add a counter, etc.

Related

Is there a way I can extract mutliple pieces of data from a multiple text file in python and save it as a row in a new .csv file?

Is there a way I can extract multiple pieces of data from a text file in python and save it as a row in a new .csv file? I need to do this for multiple input files and save the output as a single .csv file for all of the input files.
I have never used Python before so I am quite clueless. I have used matlab before and I know how I would do it in matlab if it was numbers (but unfortunately it is text which is why I am trying python). So to be clear I need a new line in the .csv output file for each "ID" in the input files.
An example of the data is show below (2 separate files)
EXAMPLE DATA - FILE 1:
id,ARI201803290
version,2
info,visteam,COL
info,hometeam,ARI
info,site,PHO01
info,date,2018/03/29
id,ARI201803300
data,er,corbp001,2
version,2
info,visteam,COL
info,hometeam,ARI
info,site,PHO01
info,date,2018/03/30
data,er,delaj001,0
EXAMPLE DATA - FILE 2:
id,NYN201803290
version,2
info,visteam,SLN
info,hometeam,NYN
info,site,NYC20
info,usedh,false
info,date,2018/03/29
data,er,famij001,0
id,NYN201803310
version,2
info,visteam,SLN
info,hometeam,NYN
info,site,NYC20
info,date,2018/03/31
data,er,gselr001,0
I'm hoping to get the data in a .csv format with all the details from one "id" on 1 line. There are multiple "id's" per text file and there are multiple files. I want to repeat this process for multiple text files so the outputs are in the same .csv output file. I want the output to look as follows in the .csv file, with each piece of info as a new cell:
ARI201803290 COL ARI PHO01 2018/03/29 2
ARI201803300 COL ARI PHO01 2018/03/30 0
NYN201803290 SLN NYN NYC20 2018/03/29 0
NYN201803310 SLN NYN NYC20 2018/03/31 0
If I was doing it in matlab I'd use a for loop and if statement and say
j=1
k=1
for i=1:size(myMatrix, 1)
if file1(i;1)==id
output(k,1)=(i;2)
k=k+1
else if
file1(i;1)==info && file1(i;1)==info
output(j,2)=(i;3)
j=j+1
etc.....
However I obviously can't do this in matlab because I have comma separated text files, not a matrix. Does anyone have any suggestions how I can translate my idea to python code? Or any other suggestion. I am super new to python so willing to try anything that might work.
Thank you very much in advance!
python is very flexible and can do these jobs very easily,
there is a lot of csv tools/modules in python to handle pretty much all type of csv and excel files, however i prefer to handle a csv the same as a text file because csv is simply a text file with comma separated text, so simple is better than complicated
below is the code with comments to explain most of it, you can tweak it to match your needs exactly
import os
input_folder = 'myfolder/' # path of folder containing the text files on your disk
# create a list with file names with their full paths using list comprehension
data_files = [os.path.join(input_folder, file) for file in os.listdir(input_folder)]
# open our csv file for writing
csv = open('myoutput.csv', 'w') # better to open files with context manager like below but i am trying to show you different methods
def write_to_csv(line):
print(line)
csv.write(line)
# loop thru your text files
for file in data_files:
with open(file, 'r') as f: # use context manager to open files (best practice)
buff = []
for line in f:
line = line.strip() # remove spaces and new lines
line = line.split(',') # split line to list of values
if buff and line[0] == 'id': # hit another 'id'
write_to_csv(','.join(buff) + '\n')
buff = []
buff.append(line[-1]) # add the last word in line
write_to_csv(','.join(buff) + '\n')
csv.close() # must close any open file handles opened manually "no context manager i.e. no with"
output:
ARI201803290,2,COL,ARI,PHO01,2018/03/29,2
ARI201803300,2,COL,ARI,PHO01,2018/03/30,0
NYN201803290,2,SLN,NYN,NYC20,false,2018/03/29,0
NYN201803310,2,SLN,NYN,NYC20,2018/03/31,0

How do I increase the default column width of a csv file so that when I open the file all of the text fits correctly?

I am trying to code a function where I grab data from my database, which already works correctly.
This is my code for the headers prior to adding the actual records:
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the column names of the excel table prior to adding the actual physical data
template_writer.writerow(['Arrangement_ID','Quantity','Cost'])
#closes the file after appending
template_file.close()
This is my code for the records which is contained in a while loop and is the main reason that the two scripts are kept separate.
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the data of the current fetched values of the sql statement within the while loop to the csv file
template_writer.writerow([transactionWordData[0],transactionWordData[1],transactionWordData[2]])
#closes the file after appending
template_file.close()
Now once I have got this data ready for excel, I run the file in excel and I would like it to be in a format where I can print immediately, however, when I do print the column width of the excel cells is too small and leads to it being cut off during printing.
I have tried altering the default column width within excel and hoping that it would keep that format permanently but that doesn't seem to be the case and every time that I re-open the csv file in excel it seems to reset completely back to the default column width.
Here is my code for opening the csv file in excel using python and the comment is the actual code I want to use when I can actually format the spreadsheet ready for printing.
#finds the os path of the csv file depending where it is in the file directories
file_path = os.path.abspath("csv_template.csv")
#opens the csv file in excel ready to print
os.startfile(file_path)
#os.startfile(file_path, 'print')
If anyone has any solutions to this or ideas please let me know.
Unfortunately I don't think this is possible for CSV file formats, since they are just plaintext comma separated values and don't support formatting.
I have tried altering the default column width within excel but every time that I re-open the csv file in excel it seems to reset back to the default column width.
If you save the file to an excel format once you have edited it that should solve this problem.
Alternatively, instead of using the csv library you could use xlsxwriter instead which does allow you to set the width of the columns in your code.
See https://xlsxwriter.readthedocs.io and https://xlsxwriter.readthedocs.io/worksheet.html#worksheet-set-column.
Hope this helps!
The csv format is nothing else than a text file, where the lines follow a given pattern, that is, a fixed number of fields (your data) delimited by comma. In contrast an .xlsx file is a binary file that contains specifications about the format. Therefore you may want write to an Excel file instead using the rich pandas library.
You can add space like as it is string so it will automatically adjust the width do it like this:
template_writer.writerow(['Arrangement_ID ','Quantity ','Cost '])

Extracting N JSON objects contained in a single line from a text file in Python 2.7?

I have a huge text file that contains several JSON objects inside of it that I want to parse into a csv file. Just because i'm dealing with someone else's data I cannot really change the format its being delivered in.
Since I dont know how many objects JSON objects I just can create a couple set of dictionaries, wrap them in a list and then json.loads() the list.
Also, since all the objects are in a single text line I can't a regex expression to separete each individual json object and then put them on a list.(It's a super complicated and sometimes triple nested json at some points.
Here's, my current code
def json_to_csv(text_file_name,desired_csv_name):
#Cleans up a bit of the text file
file = fileinput.FileInput(text_file_name, inplace=True)
ile = fileinput.FileInput(text_file_name, inplace=True)
for line in file:
sys.stdout.write(line.replace(u'\'', u'"'))
for line in ile:
sys.stdout.write(re.sub(r'("[\s\w]*)"([\s\w]*")', r"\1\2", line))
#try to load the text file to content var
with open(text_file_name, "rb") as fin:
content = json.load(fin)
#Rest of the logic using the json data in content
#that uses it for the desired csv format
This code gives a ValueError: Extra data: line 1 column 159816 because there is more than one object there.
I seen similar questions in Google and StackOverflow. But none of those solutions none because of the fact that it's just one really long line in a text file and I dont know how many objects there are in the file.
If you are trying to split apart the highest level braces you could do something like
string = '{"NextToken": {"value": "...'
objects = eval("[" + string + "]")
and then parse each item in the list.

While comparing 2 csv file with python misses a match on larger files

I am fairly new to programming and I am currently stumped by an issue. I wrote a little script in python to compare to csv files which contain usernames for email groups that must be maintained. The script I wrote was working perfectly for csv files that were in the 200-300 item range, however, I just started testing files with a couple thousand values and my script seems to miss the very last item on each list.
The idea here is that I have 2 csv files, an old list and a new list. The way I receive the files is a bit finicky, so before processing the lists, I create an new csv form each of the csv's I am dealing with. So basically a old_clean csv and a new_clean csv. I then check each item in old_clean against new_clean to see if each item is in the new list, if it's not it get's added to remove.csv to be processed into the email system latter. I then run the test the other way around to find new names which go into add.csv. The issue I have is that the last name on the list, which is on both the old and new csv file, shows up on the add.csv and the remove.csv.
As I stated, this only happens with larger files. My code is below, any help would be appreciated.
import sys
import csv
import re
import os
###Works in python 2.5###
#create a new csv for cleaned values from first csv entered
o = open("first.csv","w")
data = open(sys.argv[1]).read()
o.write( re.sub(" ","",data) )
o.close()
#create a new csv for cleaned values from second csv entered
n = open("second.csv","w")
data = open(sys.argv[2]).read()
n.write( re.sub(" ","",data) )
n.close()
#create csv of names to remove from group
remove = open("Changes/remove.csv","w")
#create csv of names to add to group
add = open("Changes/add.csv","w")
time.sleep(3)
#adds any names from first list not found in second list to the remove.csv
for line in open("first.csv"):
if line not in open("second.csv"):
remove.write(line)
remove.close()
#adds any names from second list not found in first list to the add.csv
for line in open ("second.csv"):
if line not in open("first.csv"):
add.write(line)
add.close()
#remove the generated "clean" csv files
os.remove("first.csv")
os.remove("second.csv")

How to copy specific data out of a file using python?

I have some large data files and I want to copy out certain pieces of data on each line, basically an ID code. The ID code has a | on one side and a space on the other. I was wondering would it be possible to pull out just the ID. Also I have two data files, one has 4 ID codes per line and the other has 23 per line.
At the moment I'm thinking something like copying each line from the data file, then subtract the strings from each other to get the desired ID code, but surely there must be an easier way! Help?
Here is an example of a line from the data file that I'm working with
cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327
and from this line I would want to output on separate lines
Wood_4286
EIK58010
AEV644870.1
PSEBR_a4327
Use the regex module for such a task. The following code shows you how to extract the ID's from a string (works for any number of ID's as long as they are structured the same way).
import re
s = """cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327"""
results = re.findall('\|([^ ]*)',s) #list of ids that have been extracted from string
print('\n'.join(results)) #pretty output
Output:
Wood_4286
EIK58010
AEV64487.1
PSEBR_a4327
To write the output to a file:
with open('out.txt', mode = 'w') as filehandle:
filehandle.write('\n'.join(results))
For more information, see the regex module documentation.
If all your lines have the given format, a simple split is enough:
#split by '|' and the result by space
ids = [x.split()[0] for x in line.split("|")[1:]]

Categories