Turning a huge TextEdit file of JSON into a Pandas dataframe

Turning a huge TextEdit file of JSON into a Pandas dataframe - python

I have an extremely large list of JSON files in the form of a TextEdit document, each of which has 6 key-value pairs.
I would like to turn each key-value pair into a column name for a Pandas Dataframe, and list the values under the column.
{'column1': "stuff stuff", 'column2': "details details, ....}
Is there a standard way to do this?
I think you could begin uploading the file into a dataframe with
import pandas as pd
df = pd.read_table(file_name)
I think each column could be created by iterating through each JSON document using groupby.
EDIT: I think the correct approach is to parse each JSON object into a Dataframe, and then create a function to iterate through all JSONs and create one Dataframe.

Take a look at read_json or json_normalize. You would indeed most likely read each file and then use for instance pd.concat to combine them as required.
Something along the below lines should work, depending on what your file looks like (here assuming that each json dictionary makes up a line in the file:
df = pd.DataFrame()
f = open('workfile', 'r')
for line in f:
df = pd.concat([df, pd.read_json(line, orient='columns')])

Related

How to use a dictionary to loop through different file names to create multiple databases in python?

I have two file location I would like to iterate through to search read the .tsv files. The first location is:
"C:\Users\User\Documents\Research\STITCH\0NN-human-STITCH\stitch.tsv"
The second is:
"C:\Users\User\Documents\Research\STITCH\1AQ-human-STITCH\stitch.tsv"
Both tsv files are the same name, but located in different folders.
Instead of using glob, I'd like to create a loop and dictionary to search through each of the files like, this:
import pandas as pd
file_name = 'C:/Users/User/Documents/Research/STITCH/{}-human-STITCH/stitch_interactions.tsv'
df_list = []
for i in range('ONN','1AQ'):
df_list.append(pd.read_csv(file_name.format(i)))
df = pd.concat(df_list)
After searching through one file, I'd then like to add an element from that file to an excel sheet.
I receive an error:
for i in range('ONN','1AQ'):
TypeError: 'str' object cannot be interpreted as an integer
Thanks

Range returns a sequence of numbers. This will not work with strings.
When there are only the two values, you can simply iterate over them as tuple.
file_name = 'C:/Users/User/Documents/Research/STITCH/{}-human-STITCH/stitch_interactions.tsv'
df_list = []
for i in ('ONN','1AQ'):
df_list.append(pd.read_csv(file_name.format(i)))
df = pd.concat(df_list)

TRY f-string with list comprehension:
concat_df = pd.concat([pd.read_csv(
f'C:/Users/User/Documents/Research/STITCH/{i}-human-STITCH/stitch_interactions.tsv') for i in range('ONN', '1AQ')])

Edit: For the 'str' error, remove range in the for-loop.
If you want to write the whole dataframe to excel you can use df.to_excel
pandas.DataFrame.to_excel
You can also append using an ExcelWriter (see example when you scroll down)
If you want to write a specific row to excel you can use "iloc" if you know which row number or "loc" for row name/identifier
pandas.DataFrame.iloc
pandas.DataFrame.loc
Scroll down for examples on how to use the functions.

How do I export JSON data to CSV using python?

I'm building a site that, based on a user's input, sorts through JSON data and prints a schedule for them into an html table. I want to give it the functionality that once the their table is created they can export the data to a CSV/Excel file so we don't have to store their credentials (logins & schedules in a database). Is this possible? If so, how can I do it using python preferably?

This is not the exact answer but rather steps for you to follow in order to get a solution:
1 Read data from json. some_dict = json.loads(json_string)
2 Appropriate code to get the data from dictionary (sort/ conditions etc) and get that data in a 2D array (list)
3 Save that list as csv: https://realpython.com/python-csv/

I'm pretty lazy and like to utilize pandas for things like this. It would be something along the lines of
import pandas as pd
file = 'data.json'
with open(file) as j:
json_data = json.load(j)
df = pd.DataFrame.from_dict(j, orient='index')
df.to_csv("data.csv")

write spark dataframe as array of json (pyspark)

I would like to write my spark dataframe as a set of JSON files and in particular each of which as an array of JSON.
Let's me explain with a simple (reproducible) code.
We have:
import numpy as np
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'x': np.random.rand(100), 'y': np.random.rand(100)}))
Saving the dataframe as:
df.write.json('s3://path/to/json')
each file just created has one JSON object per line, something like:
{"x":0.9953802385540144,"y":0.476027611419198}
{"x":0.929599290575914,"y":0.72878523939521}
{"x":0.951701684432855,"y":0.8008064729546504}
but I would like to have an array of those JSON per file:
[
{"x":0.9953802385540144,"y":0.476027611419198},
{"x":0.929599290575914,"y":0.72878523939521},
{"x":0.951701684432855,"y":0.8008064729546504}
]

It is not currently possible to have spark "natively" write a single file in your desired format because spark works in a distributed (parallel) fashion, with each executor writing its part of the data independently.
However, since you are okay with having each file be an array of json not only [one] file, here is one workaround that you can use to achieve your desired output:
from pyspark.sql.functions import to_json, spark_partition_id, collect_list, col, struct
df.select(to_json(struct(*df.columns)).alias("json"))\
.groupBy(spark_partition_id())\
.agg(collect_list("json").alias("json_list"))\
.select(col("json_list").cast("string"))\
.write.text("s3://path/to/json")
First you create a json from all of the columns in df. Then group by the spark partition ID and aggregate using collect_list. This will put all the jsons on that partition into a list. Since you're aggregating within the partition, there should be no shuffling of data required.
Now select the list column, convert to a string, and write it as a text file.
Here's an example of how one file looks:
[{"x":0.1420523746714616,"y":0.30876114874052263}, ... ]
Note you may get some empty files.
Presumably you can force spark to write the data in ONE file if you specified an empty groupBy, but this would result in forcing all of the data into a single partition which could result in an out of memory error.

If the data is not super huge and it's okay to have the list as one JSON file, the following workaround is also valid. First, convert the Pyspark data frame to Pandas and then to a list of dicts. Then, the list can be dumped as JSON.
list_of_dicts = df.toPandas().to_dict('records')
json_file = open('path/to/file.json', 'w')
json_file.write(json.dumps(list_of_dicts))
json_file.close()

Rebase csv file by merging a base csv file with another new csv file

I am currently working with two csv files, base.csv and another csv file, output_20170503.csv which will be produced everyday, so my aim here is to rebase every output so that they have the same data as the base.csv.
My base.csv:
ID,Name,Number,Shape,Sound
1,John,45,Round,Meow
2,Jimmy,78,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,50,Triangle,Meow
5,Nyancat,,Round,Quack
My output_20170503.csv
ID,Name,Number,Shape,Sound
1,John,,Round,Meow
2,Jimmy,,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,,Triangle,
5,Nyancat,,Round,Quack
6,Marc,,Square,Woof
7,Jonnn,,Hexagon,Chirp
The objective here is to rebase the data (ID from 1-5) from base.csv with the output_20170503.csv
What I want to achieve:
ID,Name,Number,Shape,Sound
1,John,45,Round,Meow
2,Jimmy,78,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,50,Triangle,Meow
5,Nyancat,,Round,Quack
6,Marc,,Square,Woof
7,Jonnn,,Hexagon,Chirp
I already searched for the solution but what I got;
Merge two csv files (both of csv files have different columns, won't work for me)
Remove duplicates from a csv files (Appending base.csv with the output_20170503.csv and then remove the duplicates, won't work because they have different values for column Number)
Any help would be appreciated, thank you.

You can try this, I use first two item as key and generate a dict and then iterate the new dict update the base dict if the key not in base:
new = {"".join(i.split(',')[:2]): i[:-1].split(',') for i in open('output_20170503.csv')}
base = {"".join(i.split(',')[:2]): i[:-1].split(',') for i in open('base.csv')}
base.update({i: new[i] for i in new if i not in base})
f=open("out.csv","w")
for i in sorted(base.values(), key=lambda x: x[0]):
if i[0]!="ID":
f.write(",".join(i)+"\n")
Output:
1,John,45,Round,Meow
2,Jimmy,78,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,50,Triangle,Meow
5,Nyancat,,Round,Quac
6,Marc,,Square,Woof
7,Jonnn,,Hexagon,Chir
Python2.7+ supports the syntactical extension called the "dictionary comprehension" or "dict comprehension", so if you're using Python2.6, you need to replace the first three lines with:
new = dict(("".join(i.split(',')[:2]),i[:-1].split(',')) for i in open('output_20170503.csv'))
base = dict(("".join(i.split(',')[:2]),i[:-1].split(',')) for i in open('base.csv'))
base.update(dict((i,new[i]) for i in new if i not in base))

You should try to use pandas library which is excellent for data manipulation. you can read easily csv files and do merge operation. Your solution might look like the following :
import pandas as pd
base_df = pd.read_csv('base.csv')
output_df = pd.read_csv('My output_20170503.csv')
output_df.update(base_df)
output_df.write_csv('My output_20170503.csv')
The missing values on output_df has now been updated with the one from base_df.

Python method or pre-existing module to access csv via headers instead of column ID's

I am being forced to work a project off of CSV files instead of a database... irritating but true. I have no control of the organization which the CSV will come out in. I can reasonably guarantee that the names will be maintained in the CSV header.
I was just getting ready to write some code to return column id's on string matches, but was wondering if there was a module that might be able to do this for me?
e.g.
data = csv.csvRowData[5] becomes
data = csv.csvRowData[find_rowID('column_name')]
Forgive me if my code syntax is off, came from php. Will figure out how to make it work in the syntax.

I use the pandas package, there is a powerful read_csv utility http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
cat test.csv
date,value
2014,Hi
2015,Hello
import pandas as pd
df = pd.read_csv('test.csv')
This returns a pandas.DataFrame that does what you want (and a lot more, e.g. conversion of the data types on the columns), try it out on IPython:
In [5]: df['date']
Out[5]:
0 2014
1 2015
Name: date, dtype: int64
In [6]: df.columns
Out[6]: Index([u'date', u'value'], dtype='object')

The python standard library includes the csv module.
It provides the DictReader class which will allow you to access a row's data by column header labels.
DictReader will take the first row in the CSV file to be the column headers then provide every subsequent row as a dict with the column labels as keys and the row's data as values.
For example if people.csv looked like this:
"First Name","Last Name"
Peter,Venkman
Egon,Spengler
You can use DictReader like this:
import csv
with open('people.csv') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
print row["Last Name"]
# will output
Venkman
Spengler

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Turning a huge TextEdit file of JSON into a Pandas dataframe - python

Related

How to use a dictionary to loop through different file names to create multiple databases in python?

How do I export JSON data to CSV using python?

write spark dataframe as array of json (pyspark)

Rebase csv file by merging a base csv file with another new csv file

Python method or pre-existing module to access csv via headers instead of column ID's

Categories

Resources