Reading csv from url and pushing it in DB through pandas - python

The URL gives a csv formatted data. I am trying to get the data and push it in database. However, I am unable to read data as it only prints header of the file and not complete csv data. Could there be better option?
#!/usr/bin/python3
import pandas as pd
data = pd.read_csv("some-url") //URL not provided due to security restrictions.
for row in data:
print(row)

You can iterate through the results of df.to_dict(orient="records"):
data = pd.read_csv("some-url")
for row in data.to_dict(orient="records"):
# For each loop, `row` will be filled with a key:value dict where each
# key takes the value of the column name.
# Use this dict to create a record for your db insert, eg as raw SQL or
# to create an instance for an ORM like SQLAlchemy.
I do a similar thing to pre-format data for SQLAlchemy inserts, although I'm using Pandas to merge data from multiple sources rather than just reading the file.
Side note: There will be plenty of other ways to do this without Pandas and just iterate through the lines of the file. However Pandas's intuituve handling of CSVs makes it an attractive shortcut to do what you need.

Related

How can I print attributes/meta-data of a pandas dataframe while saving it in Excel or CSV file?

I have a pandas dataframe abc which I created as follows:
abc = pd.DataFrame({"A":[1,2,3],"B":[2,3,4]})
I added some additional attributes of the dataframe as follows:
abc.attrs = {"Name":"John", "Country":"Nepal"}
I'd like to save the pandas dataframe into an Excel file in xlsx or CSV format. I can do that using abc.to_excel("filename.xlsx") or abc.to_csv("filename.csv") where filename is the required name of the file.
However, I am not able to print the attributes in the saved file. I'd like to save the dataframe in Excel file such that first row gives Name and second row gives Country in two columns as shown below:
How can I do that?
Unfortunately, .to_excel() and .to_csv() do not provide any explicit functionality to insert meta information ahead of the actual dataframe as documented for the Excel and CSV write functions.
Regardless, one could exploit the header argument to hardcode this preamble into the frame. This can be achieved, for example, with
abc.to_csv("filename.csv", header=[str(k) + ',' + str(v) + '\n' for k,v in abc.attrs.items()])
Please note, however, that data tables store homogenous data across rows and columns. Adding meta information on top makes the data harder to read and process. Consider adding it (a) in the file name, (b) in a distinct table, or (c) dropping it altogether.
Additionally, it shall be noted that as of now (Pandas 1.4.3), the attributes feature is experimental and could change/disappear at any future version which makes any implementation brittle.

How do I export JSON data to CSV using python?

I'm building a site that, based on a user's input, sorts through JSON data and prints a schedule for them into an html table. I want to give it the functionality that once the their table is created they can export the data to a CSV/Excel file so we don't have to store their credentials (logins & schedules in a database). Is this possible? If so, how can I do it using python preferably?
This is not the exact answer but rather steps for you to follow in order to get a solution:
1 Read data from json. some_dict = json.loads(json_string)
2 Appropriate code to get the data from dictionary (sort/ conditions etc) and get that data in a 2D array (list)
3 Save that list as csv: https://realpython.com/python-csv/
I'm pretty lazy and like to utilize pandas for things like this. It would be something along the lines of
import pandas as pd
file = 'data.json'
with open(file) as j:
json_data = json.load(j)
df = pd.DataFrame.from_dict(j, orient='index')
df.to_csv("data.csv")

write spark dataframe as array of json (pyspark)

I would like to write my spark dataframe as a set of JSON files and in particular each of which as an array of JSON.
Let's me explain with a simple (reproducible) code.
We have:
import numpy as np
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'x': np.random.rand(100), 'y': np.random.rand(100)}))
Saving the dataframe as:
df.write.json('s3://path/to/json')
each file just created has one JSON object per line, something like:
{"x":0.9953802385540144,"y":0.476027611419198}
{"x":0.929599290575914,"y":0.72878523939521}
{"x":0.951701684432855,"y":0.8008064729546504}
but I would like to have an array of those JSON per file:
[
{"x":0.9953802385540144,"y":0.476027611419198},
{"x":0.929599290575914,"y":0.72878523939521},
{"x":0.951701684432855,"y":0.8008064729546504}
]
It is not currently possible to have spark "natively" write a single file in your desired format because spark works in a distributed (parallel) fashion, with each executor writing its part of the data independently.
However, since you are okay with having each file be an array of json not only [one] file, here is one workaround that you can use to achieve your desired output:
from pyspark.sql.functions import to_json, spark_partition_id, collect_list, col, struct
df.select(to_json(struct(*df.columns)).alias("json"))\
.groupBy(spark_partition_id())\
.agg(collect_list("json").alias("json_list"))\
.select(col("json_list").cast("string"))\
.write.text("s3://path/to/json")
First you create a json from all of the columns in df. Then group by the spark partition ID and aggregate using collect_list. This will put all the jsons on that partition into a list. Since you're aggregating within the partition, there should be no shuffling of data required.
Now select the list column, convert to a string, and write it as a text file.
Here's an example of how one file looks:
[{"x":0.1420523746714616,"y":0.30876114874052263}, ... ]
Note you may get some empty files.
Presumably you can force spark to write the data in ONE file if you specified an empty groupBy, but this would result in forcing all of the data into a single partition which could result in an out of memory error.
If the data is not super huge and it's okay to have the list as one JSON file, the following workaround is also valid. First, convert the Pyspark data frame to Pandas and then to a list of dicts. Then, the list can be dumped as JSON.
list_of_dicts = df.toPandas().to_dict('records')
json_file = open('path/to/file.json', 'w')
json_file.write(json.dumps(list_of_dicts))
json_file.close()

How to store complex csv data in django?

I am working on django project.where user can upload a csv file and stored into database.Most of the csv file i saw 1st row contain header and then under the values but my case my header presents on column.like this(my csv data)
I did not understand how to save this type of data on my django model.
You can transpose your data. I think it is more appropriate for your dataset in order to do real analysis. Usually things such as id values would be the row index and the names such company_id, company_name, etc would be the columns. This will allow you to do further analysis (mean, std, variances, ptc_change, group_by) and use pandas at its fullest. Thus said:
import pandas as pd
df = pd.read_csv('yourcsvfile.csv')
df2 = df.T
Also, as #H.E. Lee pointed out. In order to save your model to your database, you can either use the method to_sql in your dataframe to save in mysql (e.g. your connection), if you're using mongodb you can use to_json and then import the data, or you can manually set your function transformation to your database.
You can flip it with the built-in CSV module quite easily, no need for cumbersome modules like pandas (which in turn requires NumPy...)... Since you didn't define the Python version you're using, and this procedure differs slightly between the versions, I'll assume Python 3.x:
import csv
# open("file.csv", "rb") in Python 2.x
with open("file.csv", "r", newline="") as f: # open the file for reading
data = list(map(list, zip(*csv.reader(f)))) # read the CSV and flip it
If you're using Python 2.x you should also use itertools.izip() instead of zip() and you don't have to turn the map() output into a list (it already is).
Also, if the rows are uneven in your CSV you might want to use itertools.zip_longest() (itertools.izip_longest() in Python 2.x) instead.
Either way, this will give you a 2D list data where the first element is your header and the rest of them are the related data. What you plan to do from there depends purely on your DB... If you want to deal with the data only, just skip the first element of data when iterating and you're done.
Given your data it may be best to store each row as a string entry using TextField. That way you can be sure not to lose any structure going forward.

Converting JSON file to SQLITE or CSV

I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?
You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)
You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()

Categories