Downloading Data From .txt file containing URLs with Python again - python

I am currently trying to extract the raw data from a .txt file of 10 urls, and put the raw data from each line(URL) in the .txt file. And then repeat the process with the processed data(the raw data from the same original .txt file stripped of the html) by using Python.
import commands
import os
import json
# RAW DATA
input = open('uri.txt', 'r')
t_1 = open('command', 'w')
counter_1 = 0
for line in input:
counter_1 += 1
if counter_1 < 11:
filename = str(counter_1)
print str(line)
filename= str(count)
command ='curl ' + '"' + str(line).rstrip('\n') + '"'+ '> ./rawData/' + filename
output_1 = commands.getoutput(command)
input.close()
# PROCESSED DATA
counter_2 = 0
input = open('uri.txt','r')
t_2 = open('command','w')
for line in input:
counter_2 += 1
if counter_2 <11:
filename = str(counter_2) + '-processed'
command = 'lynx -dump -force_html ' + '"'+ str(line).rstrip('\n') + '"'+'> ./processedData/' + filename
print command
output_2 = commands.getoutput(command)
input.close()
I am attempting to do all of this with one script. Can anyone help me refine my code so I can run it? it should loop through the code completely once for each kind line in the .txt file. For example, I should have 1 raw & 1 processed .txt file for every url line in my .txt file.

Break your code up into functions. Currently the code is hard to read and debug. Make a function called get_raw() and a function called get_processed(). Then for your main loop, you can do
for line in file:
get_raw(line)
get_processed(line)
Or something similar. Also you should avoid using 'magic numbers' like counter<11. Why is it 11? Is it the number of the lines in the file? If it is you can get the number of lines with len().

Related

Split a file in python - Faster way

I need to be able to split a huge file (10GB) into multiple files. The only criteria is the header from the original file, have to be copied to all smaller files.
Thus, wrote a program in python to achieve the same. However, the program is painstakingly slow. Is there a way, to speed up the program.
from pathlib import Path
import sys
import os
import string
import glob
directoryToLoadFrom = "c:\\directory\\"
directoryToWriteTo = "C:\\outputDirectory\\"
# Set the last business day
filesToRead = directoryToLoadFrom + "output*.csv"
listNoOfOutputFiles= sorted(glob.glob(filesToRead), key=os.path.getmtime)
# For each file name
splitLen = 100000
for filename in listNoOfOutputFiles:
print ('Currently working on ')
print (filename)
entirePath, filenameWithExtension= os.path.split(filename)
filenameOnly = filenameWithExtension.split(".")[0] #Just get the filename
extensionOnly =filenameWithExtension.split(".")[-1] # Just get the extension
with open(filename, 'r') as curFileContents:
header_line = curFileContents.readline()
filecnt = 1
while 1:
curlineCnt = 0
targetFileName = directoryToWriteTo + filenameOnly + "-" + str(filecnt) + "." + extensionOnly
print ('Writing to ')
print (targetFileName)
outputFile = open(targetFileName,"w")
outputFile.write(header_line)
for line in curFileContents:
outputFile.write(line)
curlineCnt +=1
if ( curlineCnt > splitLen):
break
filecnt += 1
if ( curlineCnt < splitLen):
outputFile.close()
break
You can utilize multiprocessing to complete this task quickly. Divide the logic in smaller chunks and then execute these chunks in separate processes. For example, you can make a process for each new file you want to create. Read more about multiprocessing here

Printing to a file in python

I have the below program wherein I am trying to convert text files to a character unigram (feature vector) and writing the output to a text file.
I am printing the output on the console and writing it to a text file at the same time, however, printing to the console will print all the records while printing to the file prints only the last iteration of the filename in articles.
Should I be using an array for rawcu?
My Code:
for fileName in allarticles:
rawcu = [0.0]*95
out=open("CASIS-25fvs_rawcu.txt","w")
fileOpen = open(fileName)
charFrequency = {}
for line in fileOpen:
for letter in line:
if((ord(letter) > 31) and ord(letter) < 127):
rawcu[ord(letter)-32] += 1.0
print rawcu
print >> out, rawcu
You opened the file for over-writing, not for appending. Must be:
open("CASIS-25fvs_rawcu.txt", "a")

How to parse logs and extract lines containing specific text strings?

I've got several hundred log files that I need to parse searching for text strings. What I would like to be able to do is run a Python script to open every file in the current folder, parse it and record the results in a new file with the original_name_parsed_log_file.txt. I had the script working on a single file but now I'm having some issues doing all files in the directory.
Below is what I have so far but it's not working atm. Disregard the first def... I was playing around with changing font colors.
import os
import string
from ctypes import *
title = ' Log Parser '
windll.Kernel32.GetStdHandle.restype = c_ulong
h = windll.Kernel32.GetStdHandle(c_ulong(0xfffffff5))
def display_title_bar():
windll.Kernel32.SetConsoleTextAttribute(h, 14)
print '\n'
print '*' * 75 + '\n'
windll.Kernel32.SetConsoleTextAttribute(h, 13)
print title.center(75, ' ')
windll.Kernel32.SetConsoleTextAttribute(h, 14)
print '\n' + '*' * 75 + '\n'
windll.Kernel32.SetConsoleTextAttribute(h, 11)
def parse_files(search):
for filename in os.listdir(os.getcwd()):
newname=join(filename, '0_Parsed_Log_File.txt')
with open(filename) as read:
read.seek(0)
# Search line for values if found append line with spaces replaced by tabs to new file.
with open(newname, 'ab') as write:
for line in read:
for val in search:
if val in line:
write.write(line.replace(' ', '\t'))
line = line[5:]
read.close()
write.close()
print'\n\n'+'Parsing Complete.'
windll.Kernel32.SetConsoleTextAttribute(h, 15)
display_title_bar()
search = raw_input('Please enter search terms separated by commas: ').split(',')
parse_files(search)
This line is wrong:
newname=join(filename, '0_Parsed_Log_File.txt')
use:
newname= "".join([filename, '0_Parsed_Log_File.txt'])
join is a string method which requires a list of strings to be joined

Trying to make columns in text file from python

So far I have this. I opened the data file, I was able to make a list from the data and print the data I needed from the list in 2 columns correctly. It shows up in python just fine. But when I try to write it to a txt file, it all shows up on 1 line. Not sure what to do so it's into 2 columns in the new text file.
# open file
data = open("BigCoCompanyData.dat", "r")
data.readline()
# skip header and print number of employees
n = eval(data.readline())
print(n)
# read in employee information
longest = 0
# save phone list in text file
phoneFile = open("PhoneList.txt", "w")
for i in range(n):
lineI = data.readline().split(",")
nameLength = len(lineI[1])+len(lineI[2])
if nameLength > longest:
longest = nameLength
longest = longest + 5
print((lineI[2].title()+", "+lineI[1].title()).ljust(longest) + ("("+lineI[-2][0:3]+")"+lineI[-2][3:6]+"-"+lineI[-2][6:10]).rjust(14))
phoneFile.write((lineI[2].title()+", "+lineI[1].title()).ljust(longest) + ("("+lineI[-2][0:3]+")"+lineI[-2][3:6]+"-"+lineI[-2][6:10]).rjust(14))
data.close()
# close the file
phoneFile.close()
phoneFile.write(...)simply writes the line you give it. Every time you give a line it appends it to the previous lines, unless you end your lines with \n.
phoneFile.write((lineI[2].title()+", "+lineI[1].title()).ljust(longest) +
("("+lineI[-2][0:3]+")"+lineI[-2][3:6]+"-"+lineI[-2][6:10]).rjust(14)+'\n')

Why loop overwriting my file instead of writing after text?

i = 1 # keep track of file number
directory = '/some/directory/'
for i in range(1, 5170): #number of files in directory
filename = directory + 'D' + str(i) + '.txt'
input = open(filename)
output = open('output.txt', 'w')
input.readline() #ignore first line
for g in range(0, 7): #write next seven lines to output.txt
output.write(input.readline())
output.write('\n') #add newline to avoid mess
output.close()
input.close()
i = i + 1
I have this code, and i am trying to get one file and rewrite it to output.txt, but when i want to attach next file, my code overwrite older file that has been attached. In result when code is complete i have something like this:
dataA[5169]=26
dataB[5169]=0
dataC[5169]=y
dataD[5169]='something'
dataE[5169]=x
data_date[5169]=2012.06.02
Instead of datas ranging from files 0 to 5169. Any tips how to fix it?
You probably want to open output.txt before your for loop (and close it after). As it is written, you overwrite the file output.txt everytime you open it. (an alternative would be to open for appending: output = open('output.txt','a'), but that's definitely not the best way to do it here ...
Of course, these days it's better to use a context manager (with statement):
i = 1 # keep track of file number <-- This line is useless in the code you posted
directory = '/some/directory/' #<-- os.path.join is better for this stuff.
with open('output.txt','w') as output:
for i in range(1, 5170): #number of files in directory
filename = directory + 'D' + str(i) + '.txt'
with open(filename) as input:
input.readline() #ignore first line
for g in range(0, 7): #write next seven lines to output.txt
output.write(input.readline())
output.write('\n') #add newline to avoid mess
i = i + 1 #<---also useless line in the code you posted
Your issue is that you open in write mode. To append to file you want to use append. See here.

Categories