I have a list of a million pins and one URL that has a pin within the URL
Example:
https://www.example.com/api/index.php?pin=101010&key=113494
I have to change the number "101010" for the pin with the list of a million values like 093939,493943,344454 that I have in a csv file and then save all of those new urls to a csv file.
Here's what I have tried doing so far that has not worked:
def change(var_data):
var = str(var_data)
url = 'https://www.example.com/api/index.php?pin=101010&key=113494'
url1 = url.split('=')
url2 = ''.join(url1[:-2] + [var] + [url1[-1]])
print(url2)
change('xxxxxxxxxx')
Also this is for an api request that goes to a json page. Would using python and then reiterating through these urls I save on a csv file be the best way to do this? I want to collect some information for all of the pins that I have and save it to a BigQuery database, or somewhere where I can connect to Google Data Studio in order to have the ability to create a dashboard using all of this data.
Any ideas? What do you think the best way of getting this done would be?
Answering the first part of the question, the changes to the change function returns a list of urls using f-strings.
This can then be applied via list comprehension.
The variable url_variables would be the list of integers for the variable you are reading in from the other file.
Then writing the url list to rows in a csv.
import csv
url_variables = [93939, 493943, 344454]
def change(var_data):
var_data = str(var_data)
url = 'https://www.example.com/api/index.php?pin='
key = 'key=113494'
new_url = f'{url}{var_data}&{key}'
return(new_url)
url_list = [change(x) for x in url_variables]
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
for val in url_list:
writer.writerow([val])
Output in output.csv
1. First part of the question: (replace the numbers between "pin=" and "&") I will use an answer from Change a text between two strings in Python with Regex post:
import re
def change(var_data):
var = str(var_data)
url = 'https://www.example.com/api/index.php?pin=101010&key=113494'
url2 = re.sub("(?<=pin=).*?(?=&)",var,url)
print(url2)
change('xxxxxxxxxx')
Here I use the sub method from the built-in package "re" and the RegEx Lookarounds sintax, where:
(?<=pin=) # Asserts that what immediately precedes the current position in the string is "pin="
.*? # is the occurrence of any character
(?=&) #Asserts that what immediately follows the current position in the string is "&"
Here is a formal explanation about the Lookarounds syntax.
2. Second part of the question: As another answer explains, you can register the urls in the csv file by rows but I recommend you to read this post about handling csv files with python and you can give yourself an idea of the way you want to save them.
I am not very good at english but I hope that I have explained myself well.
I'm fairly new to Python and I am attempting to group various postcodes together under predefined labels. For example "SA31" would be labelled a "HywelDDAPostcode"
I have some code where I read lots of postcodes from a singled columned file into a list and compare them with postcodes that are in predefined lists. However, when I output my postcode labels only the Label "UKPostcodes" is outputted for every postcode in my original file. It would appear that the first two conditions in my code always evaluate to false no matter what. Am I doing the right thing using "in"? Or perhaps it's a file reading issue? I'm not sure
The input file is simply a file which contains a list of postcodes ( in reality it has thousands of rows)
The CSV file
Here is my code:
import csv
with open('postcodes.csv', newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
my_list =[]
HywelDDAPostcodes=["SA46","SY23","SY24","SA18","SA16","SA43","SA31","SA65","SA61","SA62","SA17","SA48","SA40","SA19","SA20","SA44","SA15","SA14","SA73","SA32","SA67","SA45",
"SA38","SA42","SA41","SA72","SA71","SA69","SA68","SA33","SA70","SY25","SA34","LL40","LL42","LL36","SY18","SY17","SY20","SY16","LD6"]
NationalPostcodes=["LL58","LL59","LL60","LL61","LL62","LL63","LL64","LL65","LL66","LL67","LL68","LL69","LL70","LL71","LL72","LL73","LL74","LL75","LL76","LL77","LL78",
"NP1","NP2","NP23","NP3","CF31","CF32","CF33","CF34","CF35","CF36","CF3","CF46","CF81","CF82","CF83","SA35","SA39","SA4","SA47","LL16","LL18","LL21","LL22","LL24","LL25","LL26","LL27","LL28","LL29","LL30","LL31","LL32","LL33","LL34","LL57","CH7","LL11","LL15","LL16","LL17","LL18","LL19","LL20","LL21","LL22","CH1","CH4","CH5","CH6","CH7","LL12","CF1","CF32","CF35","CF5","CF61","CF62","CF63","CF64","CF71","LL23","LL37","LL38","LL39","LL41","LL43","LL44","LL45","LL46","LL47","LL48","LL49","LL51","LL52","LL53","LL54","LL55","LL56","LL57","CF46","CF47","CF48","NP4","NP5","NP6","NP7","SA10","SA11","SA12","SA13","SA8","CF3","NP10","NP19","NP20","NP9","SA36","SA37","SA63","SA64","SA66","CF44","CF48","HR3","HR5","LD1","LD2","LD3","LD4","LD5","LD7","LD8","NP8","SY10","SY15","SY19","SY21","SY22","SY5","CF37","CF38","CF39","CF4","CF40","CF41","CF42","CF43","CF45","CF72","SA1","SA2","SA3","SA4","SA5","SA6","SA7","SA1","NP4","NP44","NP6","LL13","LL14","SY13","SY14"]
NationalPostcodes2= list(dict.fromkeys(NationalPostcodes))
labels=["HywelDDA","NationalPostcodes","UKPostcodes"]
for postcode in your_list:
#print(postcode)
if postcode in HywelDDAPostcodes:
my_list.append(labels[0])
if postcode in NationalPostcodes2:
my_list.append(labels[1])
else:
my_list.append(labels[2])
with open('DiscretisedPostcodes.csv','w') as result_file:
wr = csv.writer(result_file, dialect='excel')
for item in my_list:
wr.writerow([item,])
If anyone has any advice as to what could be causing the issue or just any advice surrounding Python, in general, I would very much appreciate it. Thank you!
The reason why your comparison block isn't working is that when you use csv reader to read your file, each line is being added to your_list as a list. So you are making a list of lists and when you compare those things it doesn't match.
['LL58'] == 'LL58' # fails
So, inspect your_list and see what I mean. You should make a shell your_list before you read the file and append each new reading to it. Then inspect that to make sure it looks good. It would also behoove you to use the strip() command to strip off whitespace from each item. I can't recall if csv reader does that automatically.
Also... a better structure for testing for membership is to use sets instead of lists. in will work for lists, but it is MUCH faster for sets, so I would put your comparison items into sets.
Lastly, it isn't clear what you are trying to do with NationalPostcodes2. Just use your NationalPostcodes, but put them in a set with {}.
#Jeff H's answer is correct, but for what it's worth here's how I might write this code (untested):
# Note: Since, as you wrote, these are only single-column files I did not use the csv
# module, as it will just add additional unnecessary overhead.
# Read the known data from files--this will always be more flexible and maintainable than
# hard-coding them in your code. This is just one possible scheme for doing this; e.g.
# you could also put all of them into a single JSON file
standard_postcode_files = {
'HywelDDA': 'hyweldda.csv',
'NationalPostcodes': 'nationalpostcodes.csv',
'UKPostcodes': 'ukpostcodes.csv'
}
def read_postcode_file(filename):
with open(filename) as f:
# exclude blank lines and strip additional whitespace
return [line.strip() for line in f if line.strip()]
standard_postcodes = {}
for key, filename in standard_postcode_files.items():
standard_postcodes[key] = set(read_postcode_file(filename))
# Assuming all post codes are unique to a set, map postcodes to the set they belong to
postcodes_reversed = {v: k for k, s in standard_postcodes.items() for v in s}
your_postcodes = read_postcode_file('postcodes.csv')
labels = [postcodes_reversed[code] for code in your_postcodes]
with open('DiscretisedPostCodes.csv', 'w') as f:
for label in labels:
f.write(label + '\n')
I would probably do other things like not make the input filename hard-coded. If you need to work with multiple columns using the csv module would also be fine with minimal additional changes, but since you're just writing one item per line I figured it was unnecessary.
I'm learning to do web scraping and i managed to pull data out of a webpage into excel file. But, it might be because of the item names that contain "," and this made the item names in the excel file to multiple columns.
I have tried using strip and replace elements in the list but it returns an error saying: AttributeError: 'WebElement' object has no attribute 'replace'.
item = driver.find_elements_by_xpath('//h2[#class="list_title"]')
item = [i.replace(",","") for i in item]
price = driver.find_elements_by_xpath('//div[#class="ads_price"]')
price = [p.replace("rm","") for p in price]
expected result in excel file file:
expected
actual result in excel file file:
actual
The function find_elements_by_xpath returns a WebElement object, you will need to convert this to a string in order to use the replace function.
Depending on your use case you may want to reconsider using excel as your storage medium, unless this is the final step of your process.
The portion of your code that you've included in your question isn't the portion that's relevant to the issue you're experiencing.
As CMMCD mentioned, I would also recommend skipping the binary excel format for the sake of simplicity, and use the built-in csv library instead. This will prevent unintended separators from splitting your cells
from csv import writer
# your data should be a list of lists
data = [['product1', 8.0], ['product2', 12.25]] # etc, as an example
with open('your_output_file.csv', 'w') as file:
mywriter = writer(file)
for line in data:
mywriter.writerow(line)
The docs: https://docs.python.org/3/library/csv.html
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T
First question here so forgive any lapses in the etiquette.
I'm new to python. I have a small project I'm trying to accomplish both for practical reasons and as a learning experience and maybe some people here can help me out. There's a proprietary system I regularly retrieve data from. Unfortunately they don't use standard CSV format. They use a strange character to separate data, its a ‡. I need it in CSV format in order to import it into another system. So what I need to do is take the data and replace the special character (with a comma) and format the data by removing whitespaces among other minor things like unrecognized characters etc...so it's the way I need it in CSV to import it.
I want to learn some python so I figured I'd write it in python. I'll be reading it from a webservice URL, but for now I just have some test data in the same format I'd receive.
In reality it will be tons of data per request but I can scale it when I understand how to retrieve and manipulate the data properly.
My code so far just trying to read and write two columns from the data:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0')
data = r.text
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for elem in data:
f.writerow([elem["PlayerID"], elem["Partner"]])
I'm getting this error.
File "csvTest.py", line 14, in
f.writerow([elem["PlayerID"], elem["Partner"]])
TypeError: string indices must be integers
It's probably evident by that, that I don't know how to manipulate the data much nor read it properly. I was able to pull back some JSON data and output it so i know the structure works at core with standardized data.
Thanks in advance for any tips.
I'll continue to poke at it.
Sample data is at the dropbox link mentioned in the script.
https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0
There are multiple problems. First, the link is incorrect, since it returns the html. To get the raw file, use:
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
Then, data is a string, so elem in data will iterate over all the characters of the string, which is not what you want.
Then, your data are unicode, not string. So you need to decode them first.
Here is your program, with some changes:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
data = str(r.text.encode('utf-8').replace("\xc2\x87", ",")).splitlines()
headers = data.pop(0).split(",")
pidx = headers.index('PlayerID')
partidx = headers.index('Partner')
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for data in data[1:]:
words = data.split(',')
f.writerow([words[pidx], words[partidx]])
Output:
PlayerID,Partner
1038005,EXT
254034,EXT
Use split:
lines = data.split('\n') # split your data to lines
headers = lines[0].split('‡')
player_index = headers.index('PlayerID')
partner_index = headers.index('Partner')
for line in lines[1:]: # skip the headers line
words = line.split('‡') # split each line by the delimiter '‡'
print words[player_index], words[partner_index]
For this to work, define the encoding of your python source code as UTF-8 by adding this line to the top of your file:
# -*- coding: utf-8 -*-
Read more about it in PEP 0263.