How to delete elements of a list of dictionaries - python

I have a problem where I have a json file of businesses open and closed. I need to specify the number of open businesses which is why I did this. But it returns 'none'. Note that I have to use functions. Also I'm not using a simple counter because I have to actually delete the closed business, because I have to do more stuff with them. This is not a duplicate because I tried what the other post says and it gives me 0.
Here is what an entry of the json file looks like:
{
"business_id":"1SWheh84yJXfytovILXOAQ",
"name":"Arizona Biltmore Golf Club",
"address":"2818 E Camino Acequia Drive",
"city":"Phoenix",
"state":"AZ",
"postal_code":"85016",
"latitude":33.5221425,
"longitude":-112.0184807,
"stars":3.0,
"review_count":5,
"is_open":0,
"attributes":{
"GoodForKids":"False"
},
"categories":"Golf, Active Life",
"hours":null
}
import json
liste_businesses=[]
liste_open=[]
def number_entries(liste_businesses):
with open ('yelp.txt') as file:
for line in file:
liste_businesses.append(json.loads((line)))
return (len(liste_businesses))
def number_open(liste_businesses):
for e in range (len(liste_businesses)):
if 'is_open' not in liste[e]:
liste_open=liste_businesses.remove(liste[e])
if int(liste[e]['is_open'])==int(0):
liste_open=liste_businesses.remove(liste[e])
print(number_open(liste_businesses))

Unless you're dealing with memory constraints, it's probably simpler to just iterate over your list of businesses and select the open ones:
def load_businesses():
businesses = []
with open('yelp.txt') as file:
for line in file:
businesses.append(json.loads(line))
# More idiomatic to return a list than to modify global state
return businesses
def get_open_businesses(businesses):
# Make a new list rather than modifying the old one
open_businesses = []
for business in businesses:
if business.get('is_open', '0') != '0':
open_businesses.append(business)
return open_businesses
businesses = load_businesses()
open_businesses = get_open_businesses(businesses)
print(len(open_businesses))
If you wanted to use a list comprehension for the open businesses:
[b for b in businesses if b.get('is_open', '0') != '0']

Related

Form new lists from elements in an existing list and assign a worksheet name

The title may sound confusing...but this is what I need to do:
I have a list (which will be variable in length, with different values depending on various scenarios), e.g: list1 = ['backup', 'downloadMedia', 'createAlbum']. From this list, I need to create one of the following for each of these items: (and obviously the name will update depending on the item in the list)
I need to create a new list called: testcases_backup = []
I need to create a new list called: results_backup = []
I need to create a new list called: screenshot_paths_backup = []
And lastly, I need to open a new worksheet, which requires: worksheet1 = workbook.add_worksheet('Results'). Of note in this case, I will need to iterate 1,2,3,... for the worksheet name for each of the items in the list. So for the first iteration for 'backup', it will be worksheet1. and 2 for downloadMedia, etc.
I have tried using dictionaries, but at this point I am not making any real progress.
My attempt: (I have very limited exp with dictionaries)
master_test_list = ['backup', 'downloadMedia', 'createAlbum']
master_test_dict = {}
def addTest(test, worksheet, testcases_list, results_list, screenshots_path_list):
master_test_dict[test] = worksheet
master_test_dict[test] = testcases_list
master_test_dict[test] = results_list
master_test_dict[test] = screenshots_path_list
for test in master_test_list:
addTest(test, "worksheet"+str(master_test_list.index(test)+1), "testcases_list_"+test, "results_list_"+test, "screenshots_path_list_"+test)
print(results_list_backup)
I thought this might work...but I just get strings inside the lists, and so I cannot define them as lists:
worksheets = []
for i in range(len(master_test_list)):
worksheets.append(str(i+1))
worksheets = ["worksheet%s" % x for x in worksheets]
testcases = ["testcases_list_%s" % x for x in master_test_list]
results = ["results_%s" % x for x in master_test_list]
screenshot_paths = ["screenshot_paths_%s" % x for x in master_test_list]
for w in worksheets:
w = workbook.add_worksheet('Results')
for t in testcases:
t = []
for r in results:
r = []
for s in screenshot_paths:
s = []
Adding a second answer since the code is significantly different, addressing the specified request for how to create n copies of lists:
def GenerateElements():
# Insert your code which generates your list here
myGeneratedList = ['backup', 'downloadMedia', 'createAlbum']
return myGeneratedList
def InitDict(ListOfElements):
# Dont make a new free floating list for each element of list1. Generate and store the lists you want in a dictionary
return dict([[x, []] for x in ListOfElements])
def RunTest():
for myContent in list1:
# Do whatever you like to generate the data u need
myTestCaseList = ['a', 'b']
myResultsList = [1, 2]
myScreenshot_Paths_List = ['sc1', 'sc2']
# 1 Store your created list for test case of item 'myContent' from list1 in a dictionary
testcases[myContent].append(myTestCaseList)
# 2 Same but your results list
results[myContent].append(myResultsList)
# 3 Same but your screenshot_paths list
screenshot_paths[myContent].append(myScreenshot_Paths_List)
# 4 Make an excel sheet named after the item from list1
# run_vba_macro("C:\\Users\\xx-_-\\Documents\\Coding Products\\Python (Local)\\Programs\\Python X Excel\\AddSheets.xlsm","SheetAdder", "AddASheet", myContent)
list1 = GenerateElements()
testcases, results, screenshot_paths = InitDict(
list1), InitDict(list1), InitDict(list1)
NumTests = 5 # Number of tests you want
for x in range(NumTests):
RunTest()
What's going on here is just defining some initialization functions and then exercising them in a couple of lines.
My understanding is that you are running a series of tests, where you want a list of the inputs and outputs to be a running tally kind of thing. As such, this code uses a dictionary to store a list of lists. The dictionary key is how you identify which log you're looking at: test cases log vs results log vs screenshot_paths log.
As per my understanding of your requirements, each dictionary element is a list of lists where the 1st list is just the output of the first test. The second list is the first with the outcome of the second test/result appended to it. This goes on, so the structure looks like:
testcases= [ [testcase1] , [testcase1,testcase2] , [testcase1,testcase2,testcase3] ]
etc.
If this isn't exactly what you want you can probably modify it to suit your needs.
You explanation leaves some things to be imagined, but I think I've got what you need. There are two files: The .py python file and an excel file which is the spreadsheet serving as a foundation for adding sheets. You can find the ones I made on my github:
https://github.com/DavidD003/LearningPython
here is the excel code. Sharing first because its shorter. If you don't want to download mine then make a sheet called 'AddSheets.xlsm' with a module called 'SheetAdder' and within that module put the following code:
Public Sub AddASheet(nm)
Application.DisplayAlerts = False 'Reset on workbook open event, since we need it to be False here right up to the point of saving and closing
Dim NewSheet As Worksheet
Set NewSheet = ThisWorkbook.Sheets.Add
NewSheet.Name = nm
End Sub
Make sure to add this to the 'ThisWorkbook' code in the 'MicroSoft Excel Objects' folder of the VBA project:
Private Sub Workbook_Open()
Application.DisplayAlerts = True
End Sub
The python script is as follows:
See [this question][1] for an example of how to type format the filepath as a string for function argument. I removed mine here.
import win32com.client as wincl
import os
# Following modified from https://stackoverflow.com/questions/58188684/calling-vba-macro-from-python-with-unknown-number-of-arguments
def run_vba_macro(str_path, str_modulename, str_macroname, shtNm):
if os.path.exists(str_path):
xl = wincl.DispatchEx("Excel.Application")
wb = xl.Workbooks.Open(str_path, ReadOnly=0)
xl.Visible = True
xl.Application.Run(os.path.basename(str_path)+"!" +
str_modulename+'.'+str_macroname, shtNm)
wb.Save()
wb.Close()
xl.Application.Quit()
del xl
# Insert your code which generates your list here
list1 = ['backup', 'downloadMedia', 'createAlbum']
# Dont make a new free floating list for each element of list1. Generate and store the lists you want in a dictionary
testcases = dict([[x, []] for x in list1])
results = dict([[x, []] for x in list1])
screenshot_paths = dict([[x, []] for x in list1])
for myContent in list1:
myTestCaseList = [] # Do whatever you like to generate the data u need
myResultsList = []
myScreenshot_Paths_List = []
# 1 Store your created list for test case of item 'myContent' from list1 in a dictionary
testcases[myContent].append(myTestCaseList)
# 2 Same but your results list
results[myContent].append(myResultsList)
# 3 Same but your screenshot_paths list
screenshot_paths[myContent].append(myScreenshot_Paths_List)
# 4 Make an excel sheet named after the item from list1
run_vba_macro("C:\\Users\\xx-_-\\Documents\\Coding Products\\Python (Local)\\Programs\\Python X Excel\\AddSheets.xlsm",
"SheetAdder", "AddASheet", myContent)```
I started working on this before you updated your question with a code sample, so bear in mind I haven't looked at your code at all lol. Just ran with this.
Here is a summary of what all of the above does:
-Creates an excel sheet with a sheet for every element in 'list1', with the sheet named after that element
-Generates 3 dictionaries, one for test cases, one for results, and one for screenshot paths, where each dictionary has a list for each element from 'list1', with that list as the value for the key being the element in 'list1'
[1]: https://stackoverflow.com/questions/58188684/calling-vba-macro-from-python-with-unknown-number-of-arguments

For Loop 60 items 10 per 10

I'm working with an api that gives me 61 items that I include in a discord embed in a for loop.
As all of this is planned to be included into a discord bot using pagination from DiscordUtils, I need to make it so it male an embed for each 10 entry to avoid a too long message / 2000 character message.
Currently what I use to do my loop is here: https://api.nepmia.fr/spc/ (I recomend the usage of a parsing extention for your browser or it will be a bit hard to read it)
But what I want to create is something that will look like that : https://api.nepmia.fr/spc/formated/
So I can iterate each range in a different embed and then use pagination.
I use TinyDB to generate the JSON files I shown before with this script:
import urllib.request, json
from shutil import copyfile
from termcolor import colored
from tinydb import TinyDB, Query
db = TinyDB("/home/nepmia/Myazu/db/db.json")
def api_get():
print(colored("[Myazu]","cyan"), colored("Fetching WynncraftAPI...", "white"))
try:
with urllib.request.urlopen("https://api.wynncraft.com/public_api.php?action=guildStats&command=Spectral%20Cabbage") as u1:
api_1 = json.loads(u1.read().decode())
count = 0
if members := api_1.get("members"):
print(colored("[Myazu]","cyan"),
colored("Got expecteded answer, starting saving process.", "white"))
for member in members:
nick = member.get("name")
ur2 = f"https://api.wynncraft.com/v2/player/{nick}/stats"
u2 = urllib.request.urlopen(ur2)
api_2 = json.loads(u2.read().decode())
data = api_2.get("data")
for item in data:
meta = item.get("meta")
playtime = meta.get("playtime")
print(colored("[Myazu]","cyan"),
colored("Saving playtime for player", "white"),
colored(f"{nick}...","green"))
db.insert({"username": nick, "playtime": playtime})
count += 1
else:
print(colored("[Myazu]","cyan"),
colored("Unexpected answer from WynncraftAPI [ERROR 1]", "white"))
except:
print(colored("[Myazu]","cyan"),
colored("Unhandled error in saving process [ERROR 2]", "white"))
finally:
print(colored("[Myazu]","cyan"),
colored(f"Finished saving data for", "white"),
colored(f"{count}", "green"),
colored("players.", "white"))
but this will only create a range like this : https://api.nepmia.fr/spc/
what I would like is something like this : https://api.nepmia.fr/spc/formated/
Thanks for your help!
PS: Sorry for your eyes I'm still new to Python so I know I don't do stuff really properly :s
To follow up from the comments, you shouldn't store items in your database in a format that is specific to how you want to return results from the database to a different API, as it will make it more difficult to query in other contexts, among other reasons.
If you want to paginate items from a database it's better to do that when you query it.
According to the docs, you can iterate over all documents in a TinyDB database just by iterating directly over the DB like:
for doc in db:
...
For any iterable you can use the enumerate function to associate an index to each item like:
for idx, doc in enumerate(db):
...
If you want the indices to start with 1 as in your examples you would just use idx + 1.
Finally, to paginate the results, you need some function that can return items from an iterable in fixed-sized batches, such as one of the many solutions on this question or elsewhere. E.g. given a function chunked(iter, size) you could do:
pages = enumerate(chunked(enumerate(db), 10))
Then list(pages) gives a list of lists of tuples like [(page_num, [(player_num, player), ...].
The only difference between a list of lists and what you want is you seem to want a dictionary structure like
{'range1': {'1': {...}, '2': {...}, ...}, 'range2': {'11': {...}, ...}}
This is no different from a list of lists; the only difference is you're using dictionary keys to give numerical indices to each item in a collection, rather than the indices being implict in the list structure. There's many ways you can go from a list of lists to this. The easiest I think is using a (nested) dict comprehension:
{f'range{page_num + 1}': {str(player_num + 1): player for player_num, player in page}
for page_num, page in pages}
This will give output in exactly the format you want.
Thanks #Iguananaut for your precious help.
In the end I made something similar from your solution using a generator.
def chunker(seq, size):
for i in range(0, len(seq), size):
yield seq[i:i+size]
def embed_creator(embeds):
pages = []
current_page = None
for i, chunk in enumerate(chunker(embeds, 10)):
current_page = discord.Embed(
title=f'**SPC** Last week online time',
color=3903947)
for elt in chunk:
current_page.add_field(
name=elt.get("username"),
value=elt.get("play_output"),
inline=False)
current_page.set_footer(
icon_url="https://cdn.discordapp.com/icons/513160124219523086/a_3dc65aae06b2cf7bddcb3c33d7a5ecef.gif?size=128",
text=f"{i + 1} / {ceil(len(embeds) / 10)}"
)
pages.append(current_page)
current_page = None
return pages
Using embed_creator I generate a list named pages that I can simply use with DiscordUtils paginator.

Converting text file to list

We had our customer details spread over 4 legacy systems and have subsequently migrated all the data into 1 new system.
Our customers previously had different account numbers in each system and I need to check which account number has been used in the new system, which supports API calls.
I have a text file containing all the possible account numbers, structured like this:
30000001, 30000002, 30000003, 30000004
30010000, 30110000, 30120000, 30130000
34000000, 33000000, 32000000, 31000000
Where each row represents all the old account numbers for each customer.
I'm not sure if this is the best approach but I want to open the text file and create a nested list:
[['30000001', '30000002', '30000003', '30000004'], ['30010000', '30110000', '30120000', '30130000'], ['34000000', '33000000', '32000000', '31000000']]
Then I want to iterate over each list but to save on API calls, as soon as I have verified the new account number in a particular list, I want to break out and move onto the next list.
import json
from urllib2 import urlopen
def read_file():
lines = [line.rstrip('\n') for line in open('customers.txt', 'r')]
return lines
def verify_account(*customers):
verified_accounts = []
for accounts in customers:
for account in accounts:
url = api_url + account
response = json.load(urlopen(url))
if response['account_status'] == 1:
verified_accounts.append(account)
break
return verified_accounts
The main issue is when I read from the file it returns the data like below so I can't iterate over the individual accounts.
['30000001, 30000002, 30000003, 30000004', '30010000, 30110000, 30120000, 30130000', '34000000, 33000000, 32000000, 31000000']
Also, is there a more Pythonic way of using list comprehensions or similar to iterate and check the account numbers. There seems to be too much nesting being used for Python?
The final item to mention is that there are over 255 customers to check, well there is almost 1000. Will I be able to pass more than 255 parameters into a function?
What about this? Just use str.split():
l = []
with open('customers.txt', 'r') as f:
for i in f:
l.append([s.strip() for s in i.split(',')])
Output:
[['30000001', '30000002', '30000003', '30000004'],
['30010000', '30110000', '30120000', '30130000'],
['34000000', '33000000', '32000000', '31000000']]
How about this?
with open('customers.txt','r') as f:
final_list=[i.split(",") for i in f.read().replace(" ","").splitlines()]
print final_list
Output:
[['30000001', '30000002', '30000003', '30000004'],
['30010000', '30110000', '30120000', '30130000'],
['34000000', '33000000', '32000000', '31000000']]

PYTHON readlines()-cannot access lists within a bigger list

I am currently doing a project for school that involves making a graphing editor. I am at a part where I have to be able to save and reopen the file. I can open the file but I have to iterate through it and regraph everything I saved. I am unsure however to actually iterate through the file because when print the file that I opened, i get a huge list that has all of my lists within it like this:
["['Rectangle', 5.168961201501877, 8.210262828535669, 7.6720901126408005, 6.795994993742178, 'red']['Line', 5.782227784730914, 5.269086357947434, 8.69837296620776, 4.993742177722153, 'red']['Circle', 2.6491232154288933, -0.8552572601656006, 6.687547623119292, 3.1831671475247982, 'red']"]
I am new at using this website so please bear with me.
def open_file(self,cmd):
filename=input("What is the name of the file? ")
File= open(filename,'r')
file= File.readlines()
print(file)
I had previously saved the file by using:
file.write(str(l)) where l is the name of a list of values I made
I have tried using split()
I tried using a for loop to save the data within the string into a list
and I have searched the web for hours to find some sort of explanation but I couldn't find any.
What you've provided is actually a list with one item consisting of a long string. Can you provide the code you're using to generate this?
If it actually is a list within a list, you can use a for loop inside another for loop to access each item in each list.
let's say your list is object l.
l[0] = ['Rectangle', 5.168961201501877, 8.210262828535669, 7.6720901126408005, 6.795994993742178, 'red']
and l[0][0] = 'Rectangle'
for i in l:
for x in i:
Would allow you to loop through all of them.
For the info you've provided, readlines() won't necessarily work, as there's nothing to delineate a new line in the text. Instead of saving the list as a converted string, you could use a for loop to save each item in the list as a line
for lne in l:
f.write(lne)
Which would write each item in the list on a new line in the file (depending on your python version, you might have to use f.write(lne+'\n') to add a new line). Then when you open the file and use readlines(), it will append each line as an item in a list.
You are apparently having problem with reading data you have created before.
Your task seem to require
1) creating some geometry in an editor
2) serialize all the geometry to a file
and later on (after the program is restarted and all old memory content is gone:
3) load geometries from the file
4) recreated the content (geometries) in your program
In step 2 you did something and you seem to be surprised by that. My proposal would be to use some other serialization option. Python offers many of them, e.g.
pickle - quick and easy, but is not interoperable with other than Python programs
JSON - easy, but might require some coding for serialization and loading your custom objects
Sample solution using JSON serialization could go like this:
import json
class Geometry():
def __init__(self, geotype="Geometry", color="blank", numbers=[]):
self.geotype = geotype
self.color = color
self.numbers = numbers
def to_list(self):
return [self.geotype, self.color, self.numbers]
def from_list(self, lst):
print "lst", lst
self.geotype, self.color, self.numbers = lst
return self
def __repr__(self):
return "<{self.geotype}: {self.color}, {self.numbers}>".format(self=self)
def test_create_save_load_recreate():
geoms = []
rect = Geometry("Rectange", "red", [12.34, 45])
geoms.append(rect)
line = Geometry("Line", "blue", [12.33, 11.33, 55.22, 22,41])
geoms.append(line)
# now serialize
fname = "geom.data"
with open(fname, "w") as f:
geoms_lst = [geo.to_list() for geo in geoms]
json.dump(geoms_lst, f)
# "geom.data are closed noe
del f
del geoms
del rect
del line
# after a while
with open(fname, "r") as f:
data = json.load(f)
geoms = [Geometry().from_list(itm) for itm in data]
print geoms

Speed up simple Python function that uses list comprehension

I'm extracting 4 columns from an imported CSV file (~500MB) to be used for fitting a scikit-learn regression model.
It seems that this function used to do the extraction is extremely slow. I just learnt python today, any suggestions on how the function can be sped up?
Can multithreading/core be used? My system has 4 cores.
def splitData(jobs):
salaries = [jobs[i]['salaryNormalized'] for i, v in enumerate(jobs)]
descriptions = [jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] for i, v in enumerate(jobs)]
titles = [jobs[i]['title'] for i, v in enumerate(jobs)]
return salaries, descriptions, titles
print type(jobs)
<type 'list'>
print jobs[:1]
[{'category': 'Engineering Jobs', 'salaryRaw': '20000 - 30000/annum 20-30K', 'rawLocation': 'Dorking, Surrey, Surrey', 'description': 'Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K', 'title': 'Engineering Systems Analyst', 'sourceName': 'cv-library.co.uk', 'company': 'Gregory Martin International', 'contractTime': 'permanent', 'normalizedLocation': 'Dorking', 'contractType': '', 'id': '12612628', 'salaryNormalized': '25000'}]
def loadData(filePath):
reader = csv.reader( open(filePath) )
rows = []
for i, row in enumerate(reader):
categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
"contractType", "contractTime", "company", "category",
"salaryRaw", "salaryNormalized","sourceName"]
# Skip header row
if i != 0:
rows.append( dict(zip(categories, row)) )
return rows
def splitData(jobs):
salaries = []
descriptions = []
titles = []
for i in xrange(len(jobs)):
salaries.append( jobs[i]['salaryNormalized'] )
descriptions.append( jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] )
titles.append( jobs[i]['title'] )
return salaries, descriptions, titles
def fit(salaries, descriptions, titles):
#Vectorize
vect = TfidfVectorizer()
vect2 = TfidfVectorizer()
descriptions = vect.fit_transform(descriptions)
titles = vect2.fit_transform(titles)
#Fit
X = hstack((descriptions, titles))
y = [ np.log(float(salaries[i])) for i, v in enumerate(salaries) ]
rr = Ridge(alpha=0.035)
rr.fit(X, y)
return vect, vect2, rr, X, y
jobs = loadData( paths['train_data_path'] )
salaries, descriptions, titles = splitData(jobs)
vect, vect2, rr, X_train, y_train = fit(salaries, descriptions, titles)
I see multiple problems with your code, directly impacting its performance.
You enumerate the jobs list multiple times. You could enumerate it only once and instead use the enumerated list (stored in a variable).
You don't use the value from the enumerated items at all. All you need is the index, and you could easily achieve this using the built-in range function.
Each of the lists is generated in eager manner. What happens is the following: 1st list blocks the execution of the program and it takes some time to finish. Same thing happens with the second and third lists, where calculations are exactly the same.
What I would offer you to do, is to use a generator, so that you process the data in a lazy manner. It's more performance-efficient and allows you to extract the data on-the-go.
def splitData(jobs):
for job in jobs:
yield job['salaryNormalized'], job['description'] + job['normalizedLocation'] + job['category'], job['title']
One simple speedup is to cut down on your list traversals. You can build a generator or generator expression that returns tuples for a single dictionary, then zip the resulting iterable:
(salaries, descriptions, titles) = zip(*((j['salaryNormalized'], j['description'] + j['normalizedLocation'] + j['category'], j['title']) for j in jobs))
Unfortunately, that still creates three sizable in-memory lists - using a generator expression rather than a list comprehension should at least prevent it from creating a full list of three-element tuples prior to zipping.
Correct me if I'm wrong, but it seems that TfidVectorizer accepts an iterator (e.g. generator expression) as well. This helps prevent having multiple copies of this rather large data in memory, which probably is what makes it slow. Alternatively, for sure it can work with files directly. One could transform the csv into separate files and then feed those files to TfidVectorizer directly without keeping them in memory in any way at all.
Edit 1
Now that you provided some more code, I can be a bit more specific.
First of all, please note that loadData is doing more than it needs to; it duplicates functionality present in csv.DictReader. If we use that, we skip the listing of category names. Another syntax for opening files is used, because in this way, they're closed automatically. Also, some names are changed to be both more accurate and Pythonic (underscore style).
def data_from_file(filename):
rows = []
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
rows.append(row)
return rows
We can now change this so that we don't build the list of all rows in memory, but instead give back a row one at a time right after we read it from the file. If this looks like magic, just read a little about generators in Python.
def data_from_file(path):
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
yield row
Now let's have a look at splitData. We could write it more cleanly like this:
def split_data(jobs):
salaries = []
descriptions = []
titles = []
for job in jobs:
salaries.append(job['salaryNormalized'] )
descriptions.append(job['description'] + job['normalizedLocation'] +
job['category'])
titles.append(job['title'])
return salaries, descriptions, titles
But again we don't want to build three huge lists in memory. And generally, it's not going to be practical that this function gives us three different things. So to split it up:
def extract_salaries(jobs):
for job in jobs:
yield job['salaryNormalized']
And so on. This helps us set up some kind of processing pipeline; everytime we'd request a value from extract_salaries(data_from_file(filename)) a single line of the csv would be read and the salary extracted. The next time, the second line giving back the second salary. There's no need to make functions for this simple case. Instead, you can use a generator expression:
salaries = (job['salaryNormalized'] for job in data_from_file(filename))
descriptions = (job['description'] + job['normalizedLocation'] +
job['category'] for job in data_from_file(filename))
titles = (job['title'] for job in data_from_file(filename))
You can now pass these generators to fit, where the most important modification is this:
y = [np.log(float(salary)) for salary in salaries]
You can't index into an iterator (something that gives you one value at a time) so you just assume you will get a salary from salaries as long as there are more, and do something with it.
In the end, you will read the whole csv file multiple times, but I don't expect that to be the bottleneck. Otherwise, some more restructuring is required.
Edit 2
Using DictReader seems a bit slow. Not sure why, but you may stick with your own implementation of that (modified to be a generator) or even better, go with namedtuples:
def data_from_file(filename):
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
Job = namedtuple('Job', header)
for row in reader:
yield Job(*row)
Then access the attributes with a dot (job.salaryNormalized). But anyway note that you can get the list of column names from the file; don't duplicate it in code.
You may of course decide to keep a single copy of the file in memory after all. In that case, do something like this:
data = list(data_from_file(filename))
salaries = (job['salaryNormalized'] for job in data)
The functions remain untouched. The call to list consumes the whole generator and stores all values in a list.
You don't need the indexes at all. Just use in. This saves the creation of a extra list of tuples, and it removes a level of indirection;
salaries = [j['salaryNormalized'] for j in jobs]
descriptions = [j['description'] + j['normalizedLocation'] + j['category'] for j in jobs]
titles = [j['title'] for j in jobs]
This still iterates over the data three times.
Alternatively you could get everything in one list comprehension, grouping the relevant data from one job together in a tuple;
data = [(j['salaryNormalized'],
j['description'] + j['normalizedLocation'] + j['category'],
j['title']) for j in jobs]
Saving the best for last; why not fill the lists straight from the CSV file instead of making a dict first?
import csv
with open('data.csv', 'r') as df:
reader = csv.reader(df)
# I made up the row indices...
data = [(row[1], row[3]+row[7]+row[6], row[2]) for row in reader]

Categories