I am currently conducting a data scraping project with Python 3 and am attempting to write the scraped data to a CSV file. My current process to do it is this:
import csv
outputFile = csv.writer(open('myFilepath', 'w'))
outputFile.writerow(['header1', 'header2'...])
for each in data:
scrapedData = scrap(each)
outputFile.writerow([scrapedData.get('header1', 'header 1 NA'), ...])
Once this script is finished, however, the CSV file is blank. If I just run:
import csv
outputFile = csv.writer(open('myFilepath', 'w'))
outputFile.writerow(['header1', 'header2'...])
a CSV file is produced containing the headers:
header1,header2,..
If I just scrape 1 in data, for example:
outputFile.writerow(['header1', 'header2'...])
scrapedData = scrap(data[0])
outputFile.writerow([scrapedData.get('header1', 'header 1 NA'), ...])
a CSV file will be created including both the headers and the data for data[0]:
header1,header2,..
header1 data for data[0], header1 data for data[0]
Why is this the case?
When you open a file with w, it erases the previous data
From the docs
w: open for writing, truncating the file first
So when you open the file after writing scrape data with w, you just get a blank file and then you write the header on it so you only see the header. Try replacing w with a. So the new call to open the file would look like
outputFile = csv.writer(open('myFilepath', 'a'))
You can fine more information about the modes to open the file here
Ref: How do you append to a file?
Edit after DYZ's comment:
You should also be closing the file after you are done appending. I would suggest using the file like the:
with open('path/to/file', 'a') as file:
outputFile = csv.writer(file)
# Do your work with the file
This way you don't have to worry about remembering to close it. Once the code exists the with block, the file will be closed.
I would use Pandas for this:
import pandas as pd
headers = ['header1', 'header2', ...]
scraped_df = pd.DataFrame(data, columns=headers)
scraped_df.to_csv('filepath.csv')
Here I'm assuming your data object is a list of lists.
Related
I created an azure dev ops query, and chose 'download results as csv' which gave me a csv file. If I open this csv in vscode, I can see in the bottom right corner it says UTF-8 with BOM
I am trying to write some python function that will read in each value of this csv file. I can not rely parsing text myself and spitting values based on the , comma character, because I will have values that include commas inside them.
If I open my csv in excel, everything is organized perfectly. But if I try to parse the file in python, it reads in every row as a single string separated by commas (bad)
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
print('row=',row)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
console output:
filename: csv_migration\ads-test-direct-download.csv, id_format: ID, encoding: utf-8-sig
row= ['Title,State,Work Item Type,ID,12NC']
row= ['TITLE,WITH COMMAS,To Do,NAME,6034,"value,with,commas"']
done
How can I read this file in python so it separates each value into a list? Instead of this single string
I get the same result with encodingVar='utf-8', should I open my csv in some app like notepadd++ and convert it to utf-16? My code works great for .csv files with utf-16 encoding, it can parse each individual value into a list no problem. why wont this work with a utf-8 DOM csv, even when excel can parse the individual values perfectly fine?
csv file: https://file.io/TXh6uyXKZaug
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
row_as_list = row.split(",") # <-- Gets line as list!
print('row=',row_as_list)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
This snippet splits the line into a list that you can index to get the information out
I'm new to Python and the task I am performing is to extract a specific key value from a list of .iris ( which contains the list of nested dictionary format) files in a specific directory.
I wanted to extract the specific value and save it as a new .csv file and repeat it for all other files.
Below is my sample of .iris file from which I should extract only for the these keys ('uid','enabled','login','name').
{"streamType":"user",
"uid":17182,
"enabled":true,
"login":"xyz",
"name":"abcdef",
"comment":"",
"authSms":"",
"email":"",
"phone":"",
"location":"",
"extraLdapOu":"",
"mand":997,
"global":{
"userAccount":"View",
"uid":"",
"retention":"No",
"enabled":"",
"messages":"Change"},
"grants":[{"mand":997,"role":1051,"passOnToSubMand":true}],
I am trying to convert the .iris file to .json and reading the files one by, but unfortunately, I am not getting the exact output as desired.
Please, could anyone help me?
My code (added from comments):
import os
import csv
path = ''
os.chdir(path)
# Read iris File
def read_iris_file(file_path):
with open(file_path, 'r') as f:
print(f.read())
# iterate through all files
for file in os.listdir():
# Check whether file is in iris format or not
if file.endswith(".iris"):
file_path = f"{path}\{file}"
# call read iris file function
print(read_iris_file(file_path))
Your files contain data in JSON format, so we can use built-in json module to parse it. To iterate over files with certain extension you can use pathlib.glob() with next pattern "*.iris". Then we can use csv.DictWriter() and pass "ignore" to extrasaction argument which will make DictWriter ignore keys which we don't need and write only those which we passed to fieldnames argument.
Code:
import csv
import json
from pathlib import Path
path = Path(r"path/to/folder")
keys = "uid", "enabled", "login", "name"
with open(path / "result.csv", "w", newline="") as out_f:
writer = csv.DictWriter(out_f, fieldnames=keys, extrasaction='ignore')
writer.writeheader()
for file in path.glob("*.iris"):
with open(file) as inp_f:
data = json.load(inp_f)
writer.writerow(data)
Try the below (the key point here is loading the iris file using ast)
import ast
fields = ('uid','enabled','login','name')
with open('my.iris') as f1:
data = ast.literal_eval(f1.read())
with open('my.csv','w') as f2:
f2.write(','.join(fields) + '\n')
f2.write(','.join(data[f] for f in fields) + '\n')
my.csv
uid,enabled,login,name
17182,true,xyz,abcdef
I'm a beginner with Python and I'm trying to automate some tasks. What I cannot do is iterate through each url inside a large csv file after I read them with pandas and chunksize:
import pandas as pd
import urllib.request, json
import csv
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 9999999)
finalUrlList = []
# Basically I want to append each URL from the csv to apiBase, then read the url and retrieve the JSON for each url and save it to a new csv file
apiBase = "https://script.google.com/macros/s/AKfycbykfWnqp7urCXZLmOOGnuWz6OcAufTFWNoOMHIew2nh3CWKriZS/exec?page="
csv_url='/Users/Andrea/Desktop/test.csv'
# use chunk size
c_size = 50000
df_chunk = pd.read_csv(csv_url, chunksize=c_size, iterator=True)
# iterate through each url in the chunks and append it to the apiBase, then add it to a list
for chunk in df_chunk:
urlToParse = apiBase + chunk
finalUrlList.append(urlToParse)
# iterate through each element of the list and process the url to retrieve json data
index = 0
while index < len(finalUrlList):
try:
with urllib.request.urlopen(finalUrlList[index]) as urlToProcess:
data = json.loads(urlToProcess.read().decode())
index = index + 1
except Exception:
print("An error occurred. I will try again!")
pass
# Write data into a new csv file
csvfile = "IndexedUrls.csv"
try:
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in data:
writer.writerow([val])
print("Csv file saved successfully!")
except:
print("An error occured, couldn't save csv file!")
The first part of reading the big file in chunks is successful, python can read the csv very fast. But then I cannot iterate through each url of the csv and perform the json reading task on each of them (maybe with multiprocessing to go faster in this last step cause it takes time as well to open the url, get the result, store it etc.).
Is there a fast way to achieve all this? thanks a lot for your help and sorry if code is crap but I'm a beginner and willing to learn a lot.
THANKS!
i'm new to python and I've got a large json file that I need to convert to csv - below is a sample
{ "status": "success","Name": "Theresa May","Location": "87654321","AccountCategory": "Business","AccountType": "Current","TicketNo": "12345-12","AvailableBal": "12775.0400","BookBa": "123475.0400","TotalCredit": "1234567","TotalDebit": "0","Usage": "5","Period": "May 11 2014 to Jul 11 2014","Currency": "GBP","Applicants": "Angel","Signatories": [{"Name": "Not Available","BVB":"Not Available"}],"Details": [{"PTransactionDate":"24-Jul-14","PValueDate":"24-Jul-13","PNarration":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"24-Jul-14","PValueDate":"23-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"25-Jul-14","PValueDate":"22-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"25-Jul-14","PValueDate":"21-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"25-Jul-14","PValueDate":"20-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"}]}
I need this to show up as
name, status, location, accountcategory, accounttype, availablebal, totalcredit, totaldebit, etc as columns,
with the pcredit, pdebit, pbalance, ptransactiondate, pvaluedate and 'ptest' having new values each row as the JSON file shows
I've managed to put this script below together looking online, but it's showing me an empty csv file at the end. What have I done wrong? I have used the online json to csv converters and it works, however as these are sensitive files I'm hoping to write/manage with my own script so I can see exactly how it works. Please see below for my python script - can I have some advise on what to change? thanks
import csv
import json
infile = open("BankStatementJSON1.json","r")
outfile = open("testing.csv","w")
writer = csv.writer(outfile)
for row in json.loads(infile.read()):
writer.writerow(row)
import csv, json, sys
# if you are not using utf-8 files, remove the next line
sys.setdefaultencoding("UTF-8") # set the encode to utf8
# check if you pass the input file and output file
if sys.argv[1] is not None and sys.argv[2] is not None:
fileInput = sys.argv[1]
fileOutput = sys.argv[2]
inputFile = open("BankStatementJSON1.json","r") # open json file
outputFile = open("testing2.csv","w") # load csv file
data = json.load("BankStatementJSON1.json") # load json content
inputFile.close() # close the input file
output = csv.writer("testing.csv") # create a csv.write
output.writerow(data[0].keys()) # header row
for row in data:
output.writerow(row.values()) # values row
This works for the JSON example you posted. The issue is that you have nested dict and you can't create sub-headers and sub rows for pcredit, pdebit, pbalance, ptransactiondate, pvaluedate and ptest as you want.
You can use csv.DictWriter:
import csv
import json
with open("BankStatementJSON1.json", "r") as inputFile: # open json file
data = json.loads(inputFile.read()) # load json content
with open("testing.csv", "w") as outputFile: # open csv file
output = csv.DictWriter(outputFile, data.keys()) # create a writer
output.writeheader()
output.writerow(data)
Make sure you're closing the output file at the end as well.
The following script for writing to a CSV file is going to run on a server which will automate the run.
d = {'col1': a, col2': b, col3': c,}
df = pandas.DataFrame(d, index = [0])
with open('foo.csv', 'a') as f:
df.to_csv(f, index = False)
The problem is, everytime I run it, the header gets copied to the CSV file. How can I modify this code to have the header copied to the CSV file only the first time its run, and never after that?
Any help will be appreciated :)
try this:
filename = '/path/to/file.csv'
df.to_csv(filename, index=False, mode='a', header=(not os.path.exists(filename)))