I have small scraping script. I have file with 2000 names and I use these names to search for Video IDs in YouTube. Because of the amount it takes pretty long time to get all the IDs so I can't do that in one time. What I want is to find where I ended my last scrape and then start from that position. What is the best way to do this? I was thinking about adding the used name to the list and then just check if it's in the list, if no - start scraping but maybe there's a better way to do this? (I hope yes).
Part that takes name from file and scraped IDs. What I want is when I quit scraping, next time when I start it, it would run not from beginning but from point where it ended last time:
index = 0
for name in itertools.islice(f, index, None):
parameters = {'key': api_key, 'q': name}
request_url = requests.get('https://www.googleapis.com/youtube/v3/search?part=snippet&maxResults=1&type=video&fields=items%2Fid', params = parameters)
videoid = json.loads(request_url.text)
if 'error' in videoid:
pass
else:
index += 1
id_file.write(videoid['items'][0]['id']['videoId'] + '\n')
print videoid['items'][0]['id']['videoId']
You could just remember the index number of the last scraped entry. Every time you finish scraping one entry, increment a counter, then assuming the entries in your text file don't change order, just pick up again at that number?
The simplest answer here is probably mitim's answer. Just keep a file that you rewrite with the last-processed index after each line. For example:
savepath = os.path.expanduser('~/.myprogram.lines')
skiplines = 0
try:
with open(savepath) as f:
skiplines = int(f.read())
except:
pass
with open('names.txt') as f:
for linenumber, line in itertools.islice(enumerate(f), skiplines, None):
do_stuff(line)
with open(savepath, 'w') as f:
f.write(str(linenumber))
However, there are other ways you could do this that might make more sense for your use case.
For example, you could rewrite the "names" file after each name is processed to remove the first line. Or, maybe better, preprocess the list into an anydbm (or even sqlite3) database, so you can more easily remove (or mark) names once they're done.
Or, if you might run against different files, and need to keep a progress for each one, you could store a separate .lines file for each one (probably in a ~/.myprogram directory, rather than flooding the top-level home directory), or use an anydbm mapping pathnames to lines done.
Related
I have a file that looks like this:
1234:AnneShirly:anneshirley#seneca.ca:4:5\[SRT111,OPS105,OPS110,SPR100,ENG100\]
3217:Illyas:illay#seneca.ca:2:4\[SRT211,OPS225,SPR200,ENG200\]
1127:john Marcus:johnmarcus#seneca.ca:1:4\[SRT111,OPS105,SPR100,ENG100\]
0001:Amin Malik:amin_malik#seneca.ca:1:3\[OPS105,SPR100,ENG100\]
I want to be able to ask the user for an input(the student number at the beginning of each line) and then ask which course they want to delete(the course codes are the list). So the program would delete the course from the list in the student number without deleting other instances of the course. Cause other students have the same courses.
studentid = input("enter studentid")
course = input("enter the course to delete")
with open("studentDatabase.dat") as file:
f = file.readlines()
with open("studentDatabase.dat","w") as file:
for line in lines:
if line.find(course) == -1:
file.write(line)
This just deletes the whole line but I only want to delete the course
Welcome to the site. You have a little ways to go to make this work. It would be good if you put some additional effort in to this before asking somebody to code this up. Let me suggest a structure for you that perhaps you can work on/augment and then you can re-post if you get stuck by editing your question above and/or commenting back on this answer. Here is a framework that I suggest:
make a section of code to read in your whole .dat file into memory. I would suggest putting the data into a dictionary that looks like this:
data = {1001: (name, email, <whatever the digits stand for>, [SRT111, OPS333, ...],
1044: ( ... )}
basically a dictionary with the ID as the key and the rest in a tuple or list. Test that, make sure it works OK by inspecting a few values.
Make a little "control loop" that uses your input statements, and see if you can locate the "record" from your dictionary. Add some "if" logic to do "something" if the ID is not found or if the user enters something like "quit" to exit/break the loop. Test it to make sure it can find the ID's and then test it again to see that it can find the course in the list inside the tuple/list with the data. You probably need another "if" statement in there to "do something" if the course is not in the data element. Test it.
Make a little "helper function" that can re-write a data element with the course removed. A suggested signature would be:
def remove_course(data_element, course):
# make the new data element (name, ... , [reduced course list]
return new_data_element
Test it, make sure it works.
Put those pieces together and you should have the ingredients to change the dictionary by using the loop and function to put the new data element into the dictionary, over-writing the old one.
Write a widget to write the new .dat file from the dictionary in its entirety.
EDIT:
You can make the dictionary from a data file with something like this:
filename = 'student_data.dat'
data = {} # an empty dictionary to stuff the results in
# use a context manager to handle opening/closing the file...
with open(filename, 'r') as src:
# loop through the lines
for line in src:
# strip any whitespace from the end and tokenize the line by ":"
tokens = line.strip().split(':')
# check it... (remove later)
print(tokens)
# gather the pieces, make conversions as necessary...
stu_id = int(tokens[0])
name = tokens[1]
email = tokens[2]
some_number = int(tokens[3])
# splitting the number from the list of courses is a little complicated
# you *could* do this more elegantly with regex, but for your level,
# here is a simple way to find the "chop points" and split this up...
last_blobs = tokens[4].split('[')
course_count = int(last_blobs[0])
course_list = last_blobs[1][:-1] # everything except the last bracket
# split up the courses by comma
courses = course_list.split(',')
# now stuff that into the dictionary...
# a little sanity check:
if data.get(stu_id):
print(f'duplicate ID found: {stu_id}. OVERWRITING')
data[stu_id] = (name,
email,
some_number,
course_count,
courses)
for key, value in data.items():
print(key, value)
i got something for you. What you want to do is to find the student first and then delete the course: like this.
studentid = input("enter studentid")
course = input("enter the course to delete")
with open("studentDatabase.dat") as file:
f = file.readlines()
with open("studentDatabase.dat","w") as file:
for line in lines:
if studentid in line: # Check if it's the right sudent
line = line.replace(course, "") # replace course with nothing
file.write(line)
You want to check if we are looking at the correct student, then replace the line but without the course code. Hope you can find it useful.
As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)
I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...
I am supposed to turn my text file from this
https://pastebin.com/ZM8EPu0p
and export it into a more readable format like this- example output is below
Here is what I have so far.
def readFile(court):
csv_rows = []
# read and split txt file into pages & chunks of data by pagragraph
with open(court, "r") as file:
data_chunks = file.read().split("\n\n")
for chunk in data_chunks:
chunk = chunk.strip # .strip removes useless spaces
if str(data_chunks[:4]).isnumeric(): # if first 4 characters are digits
entry = None # initialize an empty dictionary
elif (
str(data_chunks).isspace() and entry
): # if we're on an empty line and the entry dict is not empty
csv_rows.DictWriter(dialect="excel") # turn csv_rows into needed output
entry = {}
else:
# parse here?
print(data_chunks)
return csv_rows
readFile("/Users/mia/Desktop/School/programming/court.txt")
It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks.
First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp
Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp
The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).
And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".
And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
For every iteration in my loop for, I need to give 'the number of my iteration' as a name for the file, for example, the goal is to save:
my first iteration in the first file.
my second iteration in the second file.
....
I use for that the library numpy, but my code doesn't give me the solution that i need, in fact my actual code oblige me to enter the name of the file after each iteration, that is easy if I have 6 or 7 iteration, but i am in the case that I have 100 iteration, it doesn't make sense:
for line, a in enumerate(Plaintxt_file):
#instruction
#result
fileName = raw_input()
if(fileName!='end'):
fileName = r'C:\\Users\\My_resul\\Win_My_Scripts\\'+fileName
np.save(fileName+'.npy',Result)
ser.close()
I would be very grateful if you could help me.
Create your file name from the line number:
for line, a in enumerate(Plaintxt_file):
fileName = r'C:\Users\My_resul\Win_My_Scripts\file_{}.npy'.format(line)
np.save(fileName, Result)
This start with file name file_0.npy.
If you like to start with 1, specify the starting index in enumerate:
for line, a in enumerate(Plaintxt_file, 1):
Of course, this assumes you don't need line starting with 0 anywhere else.
I'm not 100% sure what your issue is, but as far as I can tell, you just need some string formatting for the filename.
So, you want, say 100 files, each one created after an iteration. The easiest way to do this would probably be to use something like the following:
for line, a in enumerate():
#do work
filename = "C:\\SaveDir\\OutputFile{0}.txt".format(line)
np.save(filename, Result)
That won't be 100% accurate to your needs, but hopefully that will give you the idea.
If you're just after, say, 100 blank files with the naming scheme "0.npy", "1.npy", all the way up to "n-1.npy", a simple for loop would do the job (no need for numpy!):
n = 100
for i in range(n):
open(str(i) + ".npy", 'a').close()
This loop runs for n iterations and spits out empty files with the filename corresponding to the current iteration
If you do not care about the sequence of the files and you do not want the files from multiple runs of the loop to overwrite each other, you can use random unique IDs.
from uuid import uuid4
# ...
for a in Plaintxt_file:
fileName = 'C:\\Users\\My_resul\\Win_My_Scripts\\file_{}.npy'.format(uuid4())
np.save(fileName, Result)
Sidenote:
Do not use raw strings and escaped backslashes together.
It's either r"C:\path" or "C:\\path" - unless you want double backslashes in the path. I do not know if Windows likes them.
I am trying to scroll trough a result file that one of our process print.
The objective is to look through various blocks and find a specific parameter. I tried to tackle this but can't find an efficient way that would avoid to parse the file multiple times.
This is an example of the output file that I read:
ID:13123
Compound:xyz
... various parameters
RhPhase:abc
ID:543
Compound:lbm
... various parameters
ID:232355
Compound:dfs
... various parameters
RhPhase:cvb
I am looking for a specific ID that has a RhPhase in it, but since the file contains many more entry, I just want that specific ID. It may or may not have an RhPhase in it; if it has one, I get the value.
The only way that I figured out is to actually go through the whole file (which may be hundreds of blocks, to give an idea of the size), and make a list for each ID that has a RhPhase, then in second instance, I scroll through the dictionary, retrieving the value for a specific ID.
This feels so inefficient; I tried to do something different, but got stuck at how you mark the lines while you scroll through them; so I can tell python to read each line->when find the ID that I want continue to read->if you find RhPhase get the value, otherwise stop at the next ID.
I am stuck here:
datafile=open("datafile.txt", "r")
for items in datafile.readline():
if "ID:543" in items:
[read more lines]
[if "RhPhase" in lines:]
[ rhphase=lines ]
[elif ""ID:" in lines ]
[ rhphase=None ]
[ break ]
Once I find the ID; I don't know how to continue to either look for the RhPhase string or find the first ID: string and stop everything (because this means that the ID does not have an associated RhPhase).
This would pass through the file once, and just check for the specific ID, instead of parse the whole thing once and then do a second pass.
Is possible to do so or am I stuck to the double parsing ?
Usually, you solve these kind of things with a simple state machine: You read the lines until you find your id; then you put your reader into a special state that then checks for the parameter you want to extract. In your case, you only have two states: ID not found, and ID found, so a simple boolean is enough:
foundId = False
with open('datafile.txt', 'r') as datafile:
for line in datafile:
if foundId:
if line.startswith('RhPhase'):
print('Found RhPhase for ID 543:')
print(line)
# end reading the file
break
elif line.startswith('ID:'):
print('Error: Found another ID without finding RhPhase first')
break
# if we haven’t found the ID yet, keep looking for it
elif line.startswith('ID:543'):
foundId = True
I need to write a program like this:
Write a program that reads a file .picasa.ini and copies pictures in new files, whose names are the same as identification numbers of person on these pictures (eg. 8ff985a43603dbf8.jpg). If there are more person on the picture it makes more copies. If a person is on more pictures, later override earlier copies of pictures; if a person 8ff985a43603dbf8 may appear in more pictures, only one file with this name will exist. You must presume that we have a simple file .picasa.ini.
I have an .ini, that consists:
[img_8538.jpg]
faces=rect64(4ac022d1820c8624),**d5a2d2f6f0d7ccbc**
backuphash=46512
[img_8551.jpg]
faces=rect64(acb64583d1eb84cb),**2623af3d8cb8e040**;rect64(58bf441388df9592),**d85d127e5c45cdc2**
backuphash=8108
...
Is this a good way to start this program?
for line in open('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini'):
if line.startswith('faces'):
line.split() # what must I do here to split the bolded words?
Is there a better way to do this? Remember the .jpg file must be created with a new name, so I think I should link the current .jpg file with the bolded one.
Consider using ConfigParser. Then you will have to split each value by hand, as you describe.
import ConfigParser
import string
config = ConfigParser.ConfigParser()
config.read('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini')
imgs = []
for item in config.sections():
imgs.append(config.get(item, 'faces'))
This is still work in progress. Just want to ask if it's correct.
edit:
Still don't know hot to split the bolded words out of there. This split function really is a pain for me.
Suggestions:
Your lines don't start with 'faces', so your second line won't work the way you want it to. Depending on how the rest of the file looks, you might only need to check whether the line is empty or not at that point.
To get the information you need, first split at ',' and work from there
Try at a solution: The elements you need seem to always have a ',' before them, so you can start by splitting at the ',' sign and taking everything from the 1-index elemnt onwards [1::] . Then if what I am thinking is correct, you split those elements twice again: at the ";" and take the 0-index element of that and at that " ", again taking the 0-index element.
for line in open('thingy.ini'):
if line != "\n":
personelements = line.split(",")[1::]
for person in personelements:
personstring = person.split(";")[0].split(" ")[0]
print personstring
works for me to get:
d5a2d2f6f0d7ccbc
2623af3d8cb8e040
d85d127e5c45cdc2