Attempting to save webscraping to csv file with no luck - python

The result is printed correctly, but the csv file stops at the first iteration and repeats itself.
Here is the code :
**with open('stocknews.csv','w') as new_file:
csv_writer=csv.writer(new_file, delimiter=' ')
csv_reader=csv.reader('stocknews.csv')
i=0
lenght=len(soup.find_all('div',{'class':'eachStory'}))**
**for i in range(lenght):
print(i+1,")")
headlines+=[soup.find_all('div',{'class':'eachStory'})[. i].find_all('a')[-1].text]
descriptions+=[soup.find_all('div',{'class':'eachStory'}). [i].find_all('p')[0].text]
print(headlines[i])
print(descriptions[i])**
**i+=1
print(i)
for i in csv_reader :
csv_writer.writerow(['headlines','descriptions'])
csv_writer.writerow([headlines, descriptions])**
I'm pretty sure the problem lies within the last few lines. i.e. csv_writer.writerow.. I've tried many things but never managed to save to csv correctly.

This isn't really an answer (There are a lot of things I don't quite understand about your code and I can't test it without any data to work with).
However,
for i in csv_reader :
csv_writer.writerow(['headlines','descriptions'])
csv_writer.writerow([headlines, descriptions])**
This is a loop over a csv_reader but what are you actually looping over? Normally you don't loop over csv_reader (I have never seen this actual construction before). Normally you loop over some collection - such as the lines in a text file which you have read. As far as I can tell, you are looping the csv_reader itself. There is only one csv_reader. hence, only one loop.
This would be more typical:
lines = csv_reader.readlines()
for line in lines:
pass #do something
I have no idea why you have double asterisks sprinkled around your code. I would suggest you to break this down and step through each line carefully - is it doing what you think it is doing? Is each intermediate step correct? There isn't a lot of code here, but 1) is the reading of the file working? 2) are the headlines and descriptions being read as you expect? 3) Once you have the headlines and descriptions, are they being written out correctly?

Related

Read csv and output to multiple csv files depending on criteria, nested conditions with more than 20 elements

I have a very large csv file which looks like this:
Column1;Column2
01;BE
02;ED
12;FD
14;DS
03;ED
04;DF
Now I want to read this csv and depending on certain criteria I would like to export it to different multiple csv files.
My code is as follows:
import csv
output_path=r'C:\myfolder\large_file.csv'
with open(os.path.join(os.path.dirname(output_path),"first_subset_total.csv"), "w", encoding="utf-8", newline='') as \
out_01, open(os.path.join(os.path.dirname(output_path),"excluded_first.csv"), "w", encoding="utf-8", newline='') as \
out_02, open(os.path.join(os.path.dirname(output_path),"pure_subset.csv"), "w", encoding="utf-8", newline='') as \
out_03_a, open(os.path.join(os.path.dirname(output_path),"final_subset.csv"), "w", encoding="utf-8", newline='') as \
out_04_b:
cw01 = csv.writer(out_01, delimiter=";", quoting=csv.QUOTE_MINIMAL)
cw02 = csv.writer(out_02, delimiter=";", quoting=csv.QUOTE_MINIMAL)
cw03_a = csv.writer(out_03_a, delimiter=";", quoting=csv.QUOTE_MINIMAL)
cw04_b = csv.writer(out_04_b, delimiter=";", quoting=csv.QUOTE_MINIMAL)
with open(output_path, encoding="utf-8") as in_f:
cr = csv.reader(in_f, delimiter=";")
header = next(cr)
cw01.writerow(header)
cw02.writerow(header)
cw03_a.writerow(header)
cw04_b.writerow(header)
for line in cr:
if (line[0][:2] =="01" and ...): cw01.writerow(line)
if (line[0][:2] =="02"): cw02.writerow(line)
if (line[0][:2] =="03" and ...): cw03_a.writerow(line)
if (line[0][:2] =="04" and ...): cw04_b.writerow(line)
Now my problem is first that I have many if statements and more than 04 files. Also some have subset notations like 04_a and 04_b. So now I do it for 04 files, there are way more than 20. Same number of if statements. So many, that I get an SyntaxError: too many statically nested blocks error, because there are more than 20 nested conditions. My current solution is to put the next conditions into a loop again. Not a good solution. This is inefficient. However, I also doubt my coding readiblity and the way I do it in general. So how can I have all this in a more efficient manner?
The problem?
So I am not sure I understand your problem. I would assume that originally you went with some kind of if-else nesting that yielded the syntax error and that the solution you present is your fix but is not as efficient as it could be since the conditions in each if are actually mutually exclusive. Meaning that is the first one is true all the rest is false, yet you still check all of them.
Simple solution
If I understood the problem correctly, then the solution is simple, replace your if's by elif. elif is the contraction of else and if (duh...) and allows you to avoid big nested structures est follow:
# ...
for line in cr:
if (line[0][:2] =="01" and ...): cw01.writerow(line)
elif (line[0][:2] =="02"): cw02.writerow(line)
elif (line[0][:2] =="03" and ...): cw03_a.writerow(line)
elif (line[0][:2] =="04" and ...): cw04_b.writerow(line)
It is true that this is still harder to read, but align your code nicely and this is already pretty acceptable. Although I will admit this leads to a lot of spaghetti code.
More complex solution (rework your code structure)
The way I see it you have actually only 2 parameters that you have to hardcode: your output file names and the related conditions. There is no way to avoid it. If we take a minimalist approach, these should be the only "pieces of spaghetti" in your code. All the other redundant lines of code could be avoided.
So I would start by defining those as a some kind of iterable object at the start of my file and then iterating over this list all through your code, avoiding repeating the same code line 20 times.
I don't think it's relevant for me to rewrite your code but here are a few pointers that will give you some good tools to do it well:
Your iterable could be one of the following: nested list, dict, numpy array, data class. I recommand numpy array which is a good compromise between ease of use and flexibility.
You can use python lambdas as a way to store your condition in your list.
You can use context manager (with) with a variable number of files in only two lines using contexlib.ExitStack as shown in this answer.
You can use break to exit the loop once you have found the condition that was true and switch faster to the next line.
So here is the idea:
Write your array with conditions and as lambdas and file names (2-D) array
Use context manager to open all of your output files in 2 lines
Iterate over your open files to get a list of CSV writers
Use a context manager to open your input file (as you already do))
Iterate over your CSV writers to write the header
For each line iterate over your conditions and write the line to a file with the relevant CSV writer (it's easy, it should be the one with the same index as your true condition)
(Optional) For a little bit of extra speed break the condition iteration after having found the one that is true.

Reading large files in a loop

I'm having some trouble dealing with large text files (about 1GB), when I want to read them and use them in while loops.
More specifically: First I start by doing some parsing on the lines of the file, in order to find e.g. all lines that start with "x". In doing so, I add the indices of the found lines to a list (say l). This is the pre-processing part.
Now in a while loop, I'm choosing random indices from l, and want to read its corresponding line (or say 5 lines around it). Thus I need to keep the file in memory once and for all throughout the while loop, as a priori I do not know what lines I end up reading (the line is randomly picked from l).
The problem is, when I call the file before my main loop, during the first run of the loop, the reading gets done successfully, but already from the second run, the file has vanished from memory. What I have tried:
The preprocess part:
for i, line in enumerate(filename):
prep = ''.join(c for c in line if c.isalnum() or c.isspace())
if 'x' in prep: l.append(i)
Now I have my l list. loading the file in memory before main loop:
with open(filename,'r') as f:
while (some condition):
random_index = random.sample(range(0,len(l)),1)
output_file = open("out","w") #I will write here the read line(s)
for i, line in enumerate(f):
#(the lines to be read, starting from the given random index)
if (i >= l[random_index]) and (i < l[random_index+1]):
out.write(line)
out.close()
Only during the first run of the loop, things work properly.
Alternatively I also tried:
f = open(filename)
while (some condition):
random_index = ... #rest is same as above.
Same issue, only first run work. One thing that worked was putting the f=open(filename) in the loop, so every run the file is called. But since it is a large one, this is really no practical solution.
What am I doing wrong here?
How should such readings be done properly?
What am I doing wrong here?
This answer addresses the same problem: you can't read file twice.
You open file f outside of the while loop and read it completely by calling for i, line in enumerate(f): during first iteration of the while loop. During the second iteration you can't read it again, since it has been read already.
How should such readings be done properly?
As noted in the linked answer:
To answer your question directly, once a file has been read, with read() you can use seek(0) to return the read cursor to the start of the file (docs are here).
That means, that to solve your problem you can add f.seek(0) at the end of the while loop to move pointer to the start of the file after each iteration. Doing this you can reread file from the start again.

How to append list index Python

Please help, I can't find it.
Why can I append row but not row[2]? It crashes. I'm using Python 3.4.3.
import csv
with open("file.csv", encoding="UTF8") as csvfile:
read = csv.reader(csvfile, delimiter=";")
original = []
for row in read:
original.append(row[2])
csvfile.close()
print(original)
Thanks
This looks to be a frustrating debugging experience.
One possibility is that there's a last line in the file that has only one item which is causing the issue.
A quick way to look at the situation (depending how long your file is) might be to throw in a print and see what's going on, line by line:
for row in read:
try:
original.append(row[2])
except:
print(row)
If you run with this, you may be able to see what happens just before the crash.
You may want to be a little more descriptive on what the crash is. It's famously difficult to help with such a vague description. A little more effort will help people to help you more effectively.
I would suggest you do not try and print the whole of your CSV list at the end, this can cause some IDEs to lock up for a long time.
Instead you could just print the last few entries to prove it has worked:
print("Rows read:", len(original)
print(original[-10:])

Python list of lists no loops

So full disclosure, this is hw, but I am having a lot of difficulty figuring this out. My professor has a rather particular challenge in this one portion of the assignment that I can't quite seem to figure out. Basically I'm trying to read a very very large file and put it into a list of lists that's represented as a movie recommendation matrix. She says that this can be done without a for loop and suggests using the readlines() method.
I've been running this code:
movMat = []
with open(u"movie-matrix.txt", 'r', encoding="ISO-8859-1") as f:
movMat.append(f.readlines())
But when I run diff on the output, it is not equivalent to the original file. Any suggestions for how I should go about this?
Update: upon further analysis, I think my code is correct. I added this to make it a tuple.
with open(u"movie-matrix.txt", 'r', encoding="ISO-8859-1") as f:
movMat = list(enumerate(f.readlines()))
Update2: Since people seem to want the file I'm reading from, allow me to explain. This is a ranking system from 1-5. If a person has not ranked a file, they are denoted with a ';'. This is the second line of the file.
"3;;;;3;;;;;;;;3;;;;;;;;;2;;;;;;;;3;;;;;;;;;;;;5;;;;;;;1;;;;;;;;;;;;;;;3;;;;;;;;3;;;;;;;;;;;4;;;;4;;;;;3;;;2;;;;;;;2;;;;;;;;3;;;;;;;;;;;;;;;;;;;;4;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;4;;;;;;;;;;;;;;;3;;;;3;;;4;2;;;;;;3;;;;;;4;;;;3;;;;;3;;;;;;;;;;;;2;;;;;;;;;;;;;;;3;4;;;;;;5;;;;;;;;;;;3;2;;;1;;;;;4;;;4;3;;;;;;;;;;;;4;3;;;;;;;;2;;3;;2;;;;;;;;;;;;;;;4;;;;;1;;2;;;;;;;;;;;;;;;;;;;5;;;;;;;;;;;;;;;;;4;;;;;;;;;;4;4;;;;2;3;;;;;;3;;4;;;;;;4;;;;;3;3;;;;;;1;;4;;;;;;;;;4;;;;;;;;;2;;;;3;;;;;;4;;;;;;;3;;;;;;;;4;;;;;4;;;;;;;;;;;1;;;;;;5;;;;;;;;;;;;4;;;3;;;;;;;;2;;1;;;;;;;;;4;;;;;;;;;;;;;;;3;;;;;;;;;;;5;;;;4;;;;;;;3;;;;;;;;2;;;;;;;;;;3;;;;;5;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;3;;;;;;;;;;;;;;;;;;2;;;3;4;;;;;3;;;;;4;;;;;;;;4;;4;3;;;;;4;;3;;;1;;3;;;;;2;;;;;;;;;;;4;;;;;;;;;;;3;;;;3;;;;;;;;;;;;;;;;;;;3;;;;4;;;;;;3;;;;;;;;;;;;4;;;;;;;;;;;3;;;;;;;;3;;;4;;4;;;;;;3;;;;;;;3;;;;;;;;;3;1;;;;;;;;;;;;;;;;3;;;;;3;5;;4;;;;;;4;;3;4;;;;;;;;3;;;;;;;;;;;3;;;;3;;;;;;;;;;;;;;4;;5;;;;;;;;;;;;;;;;;;4;;;;2;;2;;;;;;;;;;3;;;;;;4;;;3;;;4;;;;3;;;3;;;;;;;;;;;;;;;;;3;;;;;;;;3;;;;;;;;;;4;;;;;;;;;5"
I can't think of any case where f.readlines() would be better than just using f as an iterable. That is, for example,
with open('movie-matrix.txt', 'r', encoding="ISO-8859-1") as f:
movMat = list(f)
(no reason using u'...' notation in Python 3 -- which you have to be using if built-in open takes encoding=...!-).
Yes, f.readlines() would be equivalent to list(f) -- but it's more verbose and less obvious, so, what's the point?!
Assuming you have to output this to another file, since you mention "running diff on the output", that would be
with open('other.txt', 'w', encoding="ISO-8859-1") as f:
f.writelines(movMat)
no non-for-loop alternatives there:-).

Why am I getting "_csv.Error: newline inside string"?

There is one answer to this question:
Getting "newline inside string" while reading the csv file in Python?
But this didn't work when I used the accepted answer.
If the answer in the above link doesn't work and you have opened multiple files during the execution of your code, go back and make sure you have closed all your previous files when you were done with them.
I had a script that opened and processed multiple files. Then at the very end, it kept throwing a _csv.Error in the same manner that Amit Pal saw.
My code runs about 500 lines and has three stages where it processes multiple files in succession. Here's the section of code that gave the error. As you can see, the code is plain vanilla:
f = open('file.csv')
fread = csv.reader(f)
for row in fread:
do something
And the error was:
for row in fread:
_csv.Error: newline inside string
So I told the script to print what the row....OK, that's not clear, here's what I did:
print row
f = open('file.csv')
fread = csv.reader(f)
for row in fread:
do something
Interestingly, what printed was the LAST LINE from one of the previous files I had opened and processed.
What made this really weird was that I used different variable names, but apparently the data was stuck in a buffer or memory somewhere.
So I went back and made sure I closed all previously opened files and that solved my problem.
Hope this helps someone.

Categories