After parsing, I've got a lot of urls that have unfortunately joined together in one line. It will take a long time to re-parse, so I ask if there is a method as one long line with Url to turn into a multiple lines - 1 Url per line?
What i have:
'https:// url1.com/bla1','https:// url1.com/bla2',..thousands of urls..,'https:// url999.com/blaN'
What i need:
'https:// url1.com/bla-1',
'https:// url1.com/bla-2',
etc
'https:// url999.com/bla-N'
I've already tried to uncheck Line breaks in Python - Wrapping and Braces and check Ensure right margin is not exceeded - doesn't work
So how can i fix it?
Yes.
First set Code->Style->Wrapping and Braces->Method parameters/Method call arguments to wrap if long or chop down if long.
After that simply call reformat code on the line (Command+Alt+L).
Let's try a simple method, if I understand your query correctly. Read the first file, replace commas with newline character, and write the result to the same file.
urlsfile = open('test1.txt', 'r+') # in case you are getting the data from file itself
urls = urlsfile.readline()
urlsfile.close()
newlines = urls.replace(",", "\n") # otherwise replace newlines with the variable name that you are trying to write to the file
newfile = open('test1.txt','w+')
newfile.write(newlines)
newfile.close()
Related
I'm aware this is a much discussed topic and even though there are similar questions I haven't found one that covers my particular case.
I have a csv file that is as follows:
alarm_id,alarm_incident_id,alarm_sitename,alarm_additionalinfo,alarm_summary
"XXXXXXX","XXXXXXXXX","XXXXX|4G_Availability_Issues","TTN-XXXX","XXXXXXX;[{"severity":"CRITICAL","formula":"${XXXXX} < 85"}];[{"name":"XXXXX","value":"0","updateTimestamp":"Oct 27, 2021, 2:00:00 PM"}];[{"coName":{"XXXX/XXX":"MRBTS-XXXX","LNCEL":"XXXXXX","LNBTS":"XXXXXXX"}}]||"
It has more lines but this is the trouble line. If you notice, the fifth field has within it several quotes and commas, which is also the separator. The quotes are also single instead of double quotes which are normally used to signal a quote character that should be kept in the field. What this is doing is splitting this last field into several when reading with pandas.read_csv() method, which throws an error of extra fields. I've tried several configurations and parameters regarding quoting in pandas.read_csv() but none works...
The csv is badly formatted, I just wanted to know if there is a way to still read it, even if using a roundabout way or it really is just hopeless.
Edit: This can happen to more than one column and I never know in which column(s) this may happen
Thank you for your help.
I think I've got what you're looking for, at least I hope.
You can read the file as regular, creating a list of the lines in the csv file.
Then iterate through the lines variable and split it into 4 parts, since you have 4 columns in the csv.
with open("test.csv", "r") as f:
lines = f.readlines()
for item in lines:
new_ls = item.strip().split(",", 4)
for new_item in new_ls:
print(new_item)
Now you can iterate through each lines' column item and do whatever you have/want to do.
If all your lines fields are consistently enclosed in quotes, you can try to split the line on ",", and to remove the initial and terminating quote. The current line is correctly separated with:
row = line.strip('"').split('","', 4)
But because of the incorrect formatting of your initial file, you will have to manually control it matches all the lines...
Can't post a comment so just making a post:
One option is to escape the internal quotes / commas, or use a regex.
Also, pandas.read_csv has a quoting parameter where you can adjust how it reacts to quotes, which might be useful.
I'm trying to remove a couple of lines from a text file that I imported from my Kindle. The text looks like:
Shall I come to you?
Nicholls David, One Day, loc. 876-876
Dexter looked up at the window of the flat where Emma used to live.
Nicholls David, One Day, loc. 883-884
I want to grab the bin bag and do a forensics
Sophie Kinsella, I've Got Your Number, loc. 64-64
The complete file is longer, this is just a piece of document. The aim with my code is to remove all lines where "loc. " is written so that just the extracts remain. My target can be also seen as removing the line which is just before the blank line.
My code so far look like this:
f = open('clippings_export.txt','r', encoding='utf-8')
message = f.read()
line=message[0:400]
f.close()
key=["l","o","c","."," "]
for i in range(0,len(line)-5):
if line[i]==key[0]:
if line[i+1]==key[1]:
if line[i + 2]==key[2]:
if line[i + 3]==key[3]:
if line[i + 4]==key[4]:
The last if finds exactly the position (indices) where each "loc. " is located in file. Nevertheless, after this stage I do not know how to go back in the line so that the code catches where the line starts, and it can be completely remove. What could I do next? Do you recommend me another way to remove this line?
Thanks in advance!
I think that the question might be a bit misleading!
Anyway, if you simply want to remove those lines, you need to check whether they contain the "loc." substring. Probably the easiest way is to use the in operator.
Instead of getting whole file from read() function, read the file line by line (using the readlines() function for example). You can then check if it contains your key and omit it if it does.
Since the result is now list of strings, you might want to merge it: str.join().
Here I used another list to store desired lines, you can also use "more pythonic" filter() or list comprehension (example in similar question I mentioned below).
f = open('clippings_export.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
filtered_lines = []
for line in lines:
if "loc." in line:
continue
else:
filtered_lines.append(line)
result = ""
result = result.join(filtered_lines)
By the way, I thought it might be a duplicate - Here's question about the opposite (that is wanting lines which contain the key).
I'm very new to programming and am working on some code to extract data from a bunch of text files. I've been able to do this however the data is not useful to me in Excel. Therefore, I would like to print it all on a single line and separate it by a special character, which I can then delimit in Excel.
Here is my code:
import os
data=['Find me','find you', 'find us']
with open('C:\\Users\\Documents\\File.txt', 'r') as inF:
for line in inF:
for a in data:
string=a
if string in line:
print (line,end='*') #print on same line
inF.close()
So basically what I'm doing is finding if a keyword is on that line and then printing that line if it is.
Even though I have print(,end='*'), I don't get the print on a single line. It outputs:
Find me
*find you
*find us
Where is the problem? (I'm using Python 3.5.1)
Your immediate problem is that you're not removing the newline characters from your lines before printing them. The usual way to do this is with strip(), eg:
print(line.strip(), end='*')
You'll also print multiple copies of the line if more than one of your special phrases appear in the line. To avoid that, add a break statement after your print, or (better, but a more advanced construct that might not make sense until you're used to generator expressions) use if any(keyword in line for keyword in data):
You also don't need to explicitly close the input file - the point of the with open(...) as ...: context manager is that it closes the file when exiting it.
And I would avoid using string as a variable name - it doesn't tell anyone anything about what the variable is used for, and it can cause confusion if you end up using the built-in string module for anything. It's not as bad as shadowing a built-in constructor like list, but it's worth avoiding. Especially since it does nothing for you here, you can just use if a in line: here if you don't want to use the any() version above.
In addition to all that, if your data is not extremely large (and I hope it's not if you're trying to fit it all on one line) you'll get tidier code and avoid the trailing delimiter by using the .join() method on strings, eg something like:
import os
data=['Find me','find you', 'find us']
with open('C:\\Users\\Documents\\File.txt', 'r') as inF:
print "*".join(line.strip() for line in inF if any(keyword in line for keyword in data))
I am trying to use a very basic text file as a settings file. Three lines repeat in this order/format that govern some settings/input for my program. Text file is as follows:
Facebook
1#3#5#2
Header1#Header2#Header3#Header4
...
This is read in using the following Python code:
f = open('settings.txt', 'r')
for row in f:
platform = f.readline()
rows_to_keep = int(f.readline().split('#'))
row_headers = f.readline().split('#')
clean_output(rows_to_keep, row_headers, platform)
I would expect single string to be read in platform, an array of ints in the second and an array of strings in the third. These are then passed to the function and this is repeated numerous times.
However, the following three things are happening:
Int doesn't convert and I get a TypeError
First line in text file is ignored and I get rows to keep in platform
\n at the end of each line
I suspect these are related and so am only posting one question.
You cannot call int on a list, you need do do some kind of list comprehension like
rows_to_keep = [int(a) for a in f.readline().split('#')]
You're reading a line, then reading another line from the file. You should either do some kind of slicing (see Python how to read N number of lines at a time) or call a function with the three lines after every third iteration.
use .strip() to remove end of lines and other whitespace.
Try this:
with open('settings.txt', 'r') as f:
platform, rows_to_keep, row_headers = f.read().splitlines()
rows_to_keep = [int(x) for x in rows_to_keep.split('#')]
row_headers = row_headers.split('#')
clean_output(rows_to_keep, row_headers, platform)
There are several things going on here. First, when you do the split on the second line, you're trying to cast a list to type int. That won't work. You can, instead, use map.
rows_to_keep = map(int,f.readline().strip().split("#"))
Additionally, you see the strip() method above. That removes trailing whitespace chars from your line, ie: \n.
Try that change and also using strip() on each readline() call.
With as few changes as possible, I've attempted to solve your issues and show you where you went wrong. #Daniel's answer is how I would personally solve the issues.
f = open('settings.txt', 'r')
#See 1. We remove the unnecessary for loop
platform = f.readline()
#See 4. We make sure there are no unwanted leading or trailing characters by stripping them out
rows_to_keep = f.readline().strip().split('#')
#See 3. The enumerate function creates a list of pairs [index, value]
for row in enumerate(rows_to_keep):
rows_to_keep[row[0]] = int(row[1])
row_headers = f.readline().strip().split('#')
#See 2. We close the file when we're done reading
f.close()
clean_output(rows_to_keep, row_headers, platform)
You don't need (and don't want) a for loop on f, as well as calls to readline. You should pick one or the other.
You need to close f with f.close().
You cannot convert a list to an int, you want to convert the elements in the list to int. This can be accomplished with a for loop.
You probably want to call .strip to get rid of trailing newlines.
I have an html file that i am retrieving just the body of text
i would like to print one single line
right now i am print
for line in newName.body(text=True):
print line
this gives me everything in the body
what i would like is to print like
for line in newName.body(text=True):
print line[257:_____] # this is where i need help
instead of __ or choosing another number to end i want it to go to the newline character
so it looks like
for line in newName.body(text=True):
print line[257:'\n']
however that dosent work
how can i make that work?
the text which i am working in is located in
body
pre
the text i want
/pre
/body
You could use .partition() method to get the first line:
first_line = newName.body.getText().partition("\n")[0]
assuming newName is a BeautifulSoup object. It is usually named soup.
To get text from the first <pre> tag in the html:
text = soup.pre.string
To get a list of lines in the text:
list_of_lines = text.splitlines()
If you want to keep end of line markers in the text:
list_of_lines = text.splitlines(True)
To get i-th line from the list:
ith_line = list_of_lines[i]
note: zero-based indexing e.g., i = 2 corresponds to the 3rd line.
There is no guarantee that your HTML file has more than one line. The web page may be laid out in lines, but the structure of the page doesn't have to match the structure of the markup and vice versa.
Just to be sure, try this:
print len(newName.body(text=True).split('\n'))
If the value is >1, then you should be able to get the line you need like:
newName.body(text=True).split('\n')[257]
Maybe not the most graceful way, but it works, if there are in fact multiple lines.
Is it that you want line[127:line.find('\n')] as you are sure it's from 127 then equally you must be sure there's a \n.