Converting text file to list - python

We had our customer details spread over 4 legacy systems and have subsequently migrated all the data into 1 new system.
Our customers previously had different account numbers in each system and I need to check which account number has been used in the new system, which supports API calls.
I have a text file containing all the possible account numbers, structured like this:
30000001, 30000002, 30000003, 30000004
30010000, 30110000, 30120000, 30130000
34000000, 33000000, 32000000, 31000000
Where each row represents all the old account numbers for each customer.
I'm not sure if this is the best approach but I want to open the text file and create a nested list:
[['30000001', '30000002', '30000003', '30000004'], ['30010000', '30110000', '30120000', '30130000'], ['34000000', '33000000', '32000000', '31000000']]
Then I want to iterate over each list but to save on API calls, as soon as I have verified the new account number in a particular list, I want to break out and move onto the next list.
import json
from urllib2 import urlopen
def read_file():
lines = [line.rstrip('\n') for line in open('customers.txt', 'r')]
return lines
def verify_account(*customers):
verified_accounts = []
for accounts in customers:
for account in accounts:
url = api_url + account
response = json.load(urlopen(url))
if response['account_status'] == 1:
verified_accounts.append(account)
break
return verified_accounts
The main issue is when I read from the file it returns the data like below so I can't iterate over the individual accounts.
['30000001, 30000002, 30000003, 30000004', '30010000, 30110000, 30120000, 30130000', '34000000, 33000000, 32000000, 31000000']
Also, is there a more Pythonic way of using list comprehensions or similar to iterate and check the account numbers. There seems to be too much nesting being used for Python?
The final item to mention is that there are over 255 customers to check, well there is almost 1000. Will I be able to pass more than 255 parameters into a function?

What about this? Just use str.split():
l = []
with open('customers.txt', 'r') as f:
for i in f:
l.append([s.strip() for s in i.split(',')])
Output:
[['30000001', '30000002', '30000003', '30000004'],
['30010000', '30110000', '30120000', '30130000'],
['34000000', '33000000', '32000000', '31000000']]

How about this?
with open('customers.txt','r') as f:
final_list=[i.split(",") for i in f.read().replace(" ","").splitlines()]
print final_list
Output:
[['30000001', '30000002', '30000003', '30000004'],
['30010000', '30110000', '30120000', '30130000'],
['34000000', '33000000', '32000000', '31000000']]

Related

How to delete elements of a list of dictionaries

I have a problem where I have a json file of businesses open and closed. I need to specify the number of open businesses which is why I did this. But it returns 'none'. Note that I have to use functions. Also I'm not using a simple counter because I have to actually delete the closed business, because I have to do more stuff with them. This is not a duplicate because I tried what the other post says and it gives me 0.
Here is what an entry of the json file looks like:
{
"business_id":"1SWheh84yJXfytovILXOAQ",
"name":"Arizona Biltmore Golf Club",
"address":"2818 E Camino Acequia Drive",
"city":"Phoenix",
"state":"AZ",
"postal_code":"85016",
"latitude":33.5221425,
"longitude":-112.0184807,
"stars":3.0,
"review_count":5,
"is_open":0,
"attributes":{
"GoodForKids":"False"
},
"categories":"Golf, Active Life",
"hours":null
}
import json
liste_businesses=[]
liste_open=[]
def number_entries(liste_businesses):
with open ('yelp.txt') as file:
for line in file:
liste_businesses.append(json.loads((line)))
return (len(liste_businesses))
def number_open(liste_businesses):
for e in range (len(liste_businesses)):
if 'is_open' not in liste[e]:
liste_open=liste_businesses.remove(liste[e])
if int(liste[e]['is_open'])==int(0):
liste_open=liste_businesses.remove(liste[e])
print(number_open(liste_businesses))
Unless you're dealing with memory constraints, it's probably simpler to just iterate over your list of businesses and select the open ones:
def load_businesses():
businesses = []
with open('yelp.txt') as file:
for line in file:
businesses.append(json.loads(line))
# More idiomatic to return a list than to modify global state
return businesses
def get_open_businesses(businesses):
# Make a new list rather than modifying the old one
open_businesses = []
for business in businesses:
if business.get('is_open', '0') != '0':
open_businesses.append(business)
return open_businesses
businesses = load_businesses()
open_businesses = get_open_businesses(businesses)
print(len(open_businesses))
If you wanted to use a list comprehension for the open businesses:
[b for b in businesses if b.get('is_open', '0') != '0']

Most Python-ic way to collate a group of CSVs based on a common key?

The Problem: I have a group of CSV files, imagine they're named a.csv, b.csv, etc. They all share a common structure of first name, last name, phone, email, and the common key is source - where I got the data from. I collated all of those CSV files into one massive CSV.
Now, it can be the case that 1 individual can be in all of the CSV files - i.e., imagine if the source was a website - the same individual's profile is listed on each website, so each source is different, but the rest of the data is the same. This would mean I end up with data like:
John,Doe,867-5309,johndoe#fake.com,Website A
John,Doe,867-5309,johndoe#fake.com,Website B
I have the CSVs collated into one massive CSV, but I'm having trouble how to best complete the latter portion - sorting by the common key. Ideally, I'd want it so that the source is a list instead of a string - a list of all the sources. So, instead of the code sample above, I want to make the data look like this:
John,Doe,867-5309,johndoe#fake.com,Website A,Website B
Attempted solutions: Not my first rodeo here, so I know I have to show my work. My initial idea was to iterate through all of the agents in the collated CSV file, save their emails to a list, then iterate again through all of the list of agents and the list of emails - if the agent's email and the iterated email are the same, then I add the source from the iterated email to the agent's source column, and continue on as necessary. Here's the code I used to achieve that:
import csv
from tqdm import tqdm
class Agent:
def __init__(self, source, first_name, last_name, phone, email):
self.sources = []
self.source = source
self.first_name = first_name
self.last_name = last_name
self.phone = phone
self.email = email
def writer(self):
with open('final_agent_list.csv', 'a') as csvfile:
csv_writer = csv.writer(csvfile)
row = [self.first_name, self.last_name, self.phone,
self.email]
row.extend(i for i in self.sources)
csv_writer.writerow(
row
)
agents = []
with open('collated_files.csv', 'r') as csvfile:
csv_reader = csv.reader(csvfile)
for row in tqdm(list(csv_reader)):
a = Agent(*row)
for agent in agents:
if agent.email == a.email:
a.sources.append(agent.source)
else:
a.sources.append(a.source)
agents.append(a)
for agent in agents:
agent.writer()
For a minimum, complete verifiable example, use the following as collated_files.csv :
John,Doe,867-5309,johndoe#fake.com,Website A
John,Doe,867-5309,johndoe#fake.com,Website B
However, when I run that, I do get a list of sources as hoped.. but they aren't collated. When I run that, a good example of the output would be:
John,Doe,867-5309,johndoe#fake.com,[Website A]
John,Doe,867-5309,johndoe#fake.com,[Website B]
Clearly, it's not combining the two as I want, but I can't figure out what's making the code go wonky. Do any of you wonderful folks have any ideas? Thanks for reading!
The problem is that you break out of the loop as soon as you find one match, and so your agent will always end up with one value in sources. [You fixed this in a later edit].
Secondly, the inner else will also kick in, even if in a previous iteration there was an email match, and so you still get to append duplicate agents. And as there are more agents, the more iterations the loop will go through, and the more duplicates you add.
I would suggest using a dict, as it will allow faster lookup of a matching email:
agents = {} # create a dict keyed by email
with open('collated_files.csv', 'r') as csvfile:
csv_reader = csv.reader(csvfile)
for row in tqdm(list(csv_reader)):
a = Agent(*row)
# if this is a new email, add it to the agents
if not a.email in agents:
agents[a.email] = a
# in all cases add the source
agents[a.email].sources.append(a.source)
for agent in agents:
agent.writer()
To be more pythonic (and to make that work), what I would do:
use collections.DefaultDict to associate a list with the key, which is a tuple made of all values but the last one (the website)
read the input csv files in a loop and create the dictionary
write a new csv file and recreate the rows using concatenation of the key and the values (the gathered websites)
like this:
import collections,csv
d = collections.defaultdict(list)
for input_file in ["in1.csv","in2.csv"]:
with open(input_file) as f:
for row in csv.reader(f):
d[tuple(row[:-1])].append(row[-1])
with open("out.csv","w",newline="") as f:
csv.writer(f).writerows((k+tuple(v)) for k,v in d.items())

How to iterate over two files effectively (25000+ Lines)

So, I am trying to make a combined list inside of Python for matching data of about 25,000 lines.
The first list data came from file mac.uid and looks like this
Mac|ID
The second list data came serial.uid and looks like this:
Serial|Mac
Mac from list 1 must equal the Mac from list 2 before it's joined.
This is what I am currently doing, I believe there is too much repetition going on.
combined = [];
def combineData():
lines = open('mac.uid', 'r+')
for line in lines:
with open('serial.uid', 'r+') as serial:
for each in serial:
a, b = line.strip().split('|')
a = a.lower()
x, y = each.strip().split('|')
y = y.lower()
if a == y:
combined.append(a+""+b+""+x)
The final list is supposed to look like this:
Mac(List1), ID(List1), Serial(List2)
So that I can import it into an excel sheet.
Thanks for any and all help!
Instead of your nested loops (which cause quadratic complexity) you should use dictionaries which will give you roughly O(n log(n)) complexity. To do so, first read serial.uid once and store the mapping of MAC addresses to serial numbers in a dict.
serial = dict()
with open('serial.uid') as istr:
for line in istr:
(ser, mac) = split_fields(line)
serial[mac] = ser
Then you can close the file again and process mac.uid looking up the serial number for each MAC address in the dictionary you've just created.
combined = list()
with open('mac.uid') as istr:
for line in istr:
(mac, uid) = split_fields(line)
combined.append((mac, uid, serial[mac]))
Note that I've changed combined from a list of strings to a list of tuples. I've also factored the splitting logic out into a separate function. (You'll have to put its definition before its use, of course.)
def split_fields(line):
return line.strip().lower().split('|')
Finally, I recommend that you start using more descriptive names for your variables.
For files of 25k lines, you should have no issues storing everything in memory. If your data sets become too large for that, you might want to start looking into using a database. Note that the Python standard library ships with an SQLite module.

Efficiently Find Partial String Match --> Values Starting From List of Values in 5 GB file with Python

I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004
Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).
I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.
I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.
Questions:
What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)
This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores
codes = [4451,4452,447,772,45299,45291,45212] #codes for grocery stores
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
data = np.asarray(df)
data = pd.DataFrame(data, columns = headers)
for code in codes:
if np.char.startswith(str(data["SNACODE"][0]), str(code)):
with open("grocery.csv", "a") as myfile:
data.to_csv(myfile, header = False)
print code
break #break code for loop if match
grocery.to_csv("grocery.csv", sep = '\t')
This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.
codes = [4451,4452,447,772,45299,45291,45212]
matched = []
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
x = df[df.SNACODE.isin(codes)]
if len(x):
matched.append(x)
print "Processed chunk and found {} matches".format(len(x))
output = pd.concat(matched, axis=0)
output.to_csv("grocery.csv", index = False)
Thanks!
To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...
codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
"|".join(map(str, codes)) +
".*")
for L in open('infogroup_bus_2010.csv'):
if expr.match(L):
print L
Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:
...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...
that ignores commas inside double-quotes
You can probably make your pandas solution much faster:
codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]
sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'],
chunksize=int(1e6), dtype={'SNACODE': str})
with open('grocery.csv', 'w') as fout:
for chunk in sna:
for code in chunk['SNACODE']:
for target_code in codes:
if code.startswith(target_code):
fout.write('{}\n'.format(code))
Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.

PYTHON readlines()-cannot access lists within a bigger list

I am currently doing a project for school that involves making a graphing editor. I am at a part where I have to be able to save and reopen the file. I can open the file but I have to iterate through it and regraph everything I saved. I am unsure however to actually iterate through the file because when print the file that I opened, i get a huge list that has all of my lists within it like this:
["['Rectangle', 5.168961201501877, 8.210262828535669, 7.6720901126408005, 6.795994993742178, 'red']['Line', 5.782227784730914, 5.269086357947434, 8.69837296620776, 4.993742177722153, 'red']['Circle', 2.6491232154288933, -0.8552572601656006, 6.687547623119292, 3.1831671475247982, 'red']"]
I am new at using this website so please bear with me.
def open_file(self,cmd):
filename=input("What is the name of the file? ")
File= open(filename,'r')
file= File.readlines()
print(file)
I had previously saved the file by using:
file.write(str(l)) where l is the name of a list of values I made
I have tried using split()
I tried using a for loop to save the data within the string into a list
and I have searched the web for hours to find some sort of explanation but I couldn't find any.
What you've provided is actually a list with one item consisting of a long string. Can you provide the code you're using to generate this?
If it actually is a list within a list, you can use a for loop inside another for loop to access each item in each list.
let's say your list is object l.
l[0] = ['Rectangle', 5.168961201501877, 8.210262828535669, 7.6720901126408005, 6.795994993742178, 'red']
and l[0][0] = 'Rectangle'
for i in l:
for x in i:
Would allow you to loop through all of them.
For the info you've provided, readlines() won't necessarily work, as there's nothing to delineate a new line in the text. Instead of saving the list as a converted string, you could use a for loop to save each item in the list as a line
for lne in l:
f.write(lne)
Which would write each item in the list on a new line in the file (depending on your python version, you might have to use f.write(lne+'\n') to add a new line). Then when you open the file and use readlines(), it will append each line as an item in a list.
You are apparently having problem with reading data you have created before.
Your task seem to require
1) creating some geometry in an editor
2) serialize all the geometry to a file
and later on (after the program is restarted and all old memory content is gone:
3) load geometries from the file
4) recreated the content (geometries) in your program
In step 2 you did something and you seem to be surprised by that. My proposal would be to use some other serialization option. Python offers many of them, e.g.
pickle - quick and easy, but is not interoperable with other than Python programs
JSON - easy, but might require some coding for serialization and loading your custom objects
Sample solution using JSON serialization could go like this:
import json
class Geometry():
def __init__(self, geotype="Geometry", color="blank", numbers=[]):
self.geotype = geotype
self.color = color
self.numbers = numbers
def to_list(self):
return [self.geotype, self.color, self.numbers]
def from_list(self, lst):
print "lst", lst
self.geotype, self.color, self.numbers = lst
return self
def __repr__(self):
return "<{self.geotype}: {self.color}, {self.numbers}>".format(self=self)
def test_create_save_load_recreate():
geoms = []
rect = Geometry("Rectange", "red", [12.34, 45])
geoms.append(rect)
line = Geometry("Line", "blue", [12.33, 11.33, 55.22, 22,41])
geoms.append(line)
# now serialize
fname = "geom.data"
with open(fname, "w") as f:
geoms_lst = [geo.to_list() for geo in geoms]
json.dump(geoms_lst, f)
# "geom.data are closed noe
del f
del geoms
del rect
del line
# after a while
with open(fname, "r") as f:
data = json.load(f)
geoms = [Geometry().from_list(itm) for itm in data]
print geoms

Categories