Write to CSV file after re.sub Python - python

I have done a regex substitution on a CSV file that prints following output, just like anything else:
H1,H2,H3
A1,GG,98
B3,KLK,Oe
But when I write it to a CSV file, it writes complete line in one cell (doesn't use commas as delimiters even though specified). I used the writer.writerow(row.split("\n")) to write, where row is the data obtained after re.sub (i.e. the output posted above).

From the docs:
A row must be a sequence of strings or numbers
You are passing a list of rows, not individual values. You have to split each row by commas:
for row in row.split('\n'):
writer.writerow(row.split(','))

Related

'\n' not sending file.write() to next line using seek() while looping

My problem is a simple one (too simple...). I am opening a new text file via with and attempting to write each row from a pandas.DataFrame to the file. Specifically, I'm trying to place column entries at very specific character positions on each line, as that is the required format for the people receiving my file.
df represents my pandas.DataFrame in the code below.
with open(os.path.join(a_directory_var, 'folder/myfile.txt'), 'x') as file:
for index, row in df.iterrows():
file.seek(1)
file.write(row['col1'])
file.seek(56)
file.write('|')
file.seek(61)
file.write(row['col2'])
file.seek(76)
file.write('|')
file.seek(81)
file.write('col3')
file.seek(96)
file.write('|\n')
Expected Output:
I expected my last line to place a pipe, and send file to the next line with '\n', so that the next call to file.write() would begin writing entries to the next line.
Actual Output: Characters from each row being written over themselves on the first line, over and over again. It may be worth noting that the resulting text file does have an empty second line.
In summation, I'm simply trying to write to a line, go to the next, write to that line, go to the next, etc, etc.
It looks like you're trying to write a fixed-width column format, with additional | characters as separators. As that is not a simple option in Pandas (such as df.to_csv(fp, sep='|'), you have to iterate over the rows, as you do, and write them one by one. But don't write each part separately: format the lines using Python formatting.
For example, something like this should get close to what you want (give or take a slight offset due to me not counting properly):
sep = "|"
with open(os.path.join(a_directory_var, 'folder/myfile.txt'), 'x') as fp:
for index, row in df.iterrows():
fp.write("{:56s}{:15s}{:15s}{:15s}{:15s}\n".format(
row['col1'], sep, row['col2'], sep, row['col3'], sep)

Tuple and CSV Reader in Python

Attempting something relatively simple.
First, I have dictionary with tuples as keys as follows:
(0,1,1,0): "Index 1"
I'm reading in a CSV file which has a corresponding set of fields with various combinations of those zeroes and ones. So for example, the row in the CSV may read 0,1,1,0 without any quoting. I'm trying to match the combination of zeroes and ones in the file to the keys of the dictionary. Using the standard CSV module for this
However the issue is that the zeroes and ones are being read in as strings with single quotes rather than integers. In other words, the tuple created from each row is structured as ('0','1','1','0') which does not match (0,1,1,0)
Can anyone shed some light on how to bring the CSV in and remove the single quotes? Tuple matching and CSV reading seem to work -- just need to straighten out the data format.
Thanks!
tuple(int(x) for x in ('0','1','1','0')) # returns (0,1,1,0)
So, if your CSV reader object is called csv_reader, you just need a loop like this:
for row in csv_reader:
tup = tuple(int(x) for x in row)
# ...
when you read in the CSV file, depending on what libraries you're using, you can specify the delimiter.
typically, the comma is interpreted as the delimiter. perhaps you can specify the delimiter to be something else, e.g. '-', so that set of digits are read together as a string, and you can convert it to a tuple using variety of methods, such as using ast.literal_eval mentioned in converting string to tuple
hope that helps!

Python: Read in Data from File

I have to read data from a text file from the command line. It is not too difficult to read in each line, but I need a way to separate each part of the line.
The file contains the following in order for several hundred lines:
String (Sometimes more than 1 word)
Integer
String (Sometimes more than 1 word)
Integer
So for example the input could have:
Hello 5 Sample String 10
The current implementation I have for reading in each line is as follows... how can I modify it to separate it into what I want? I have tried splitting the line, but I always end up getting only one character of the first string this way with no integers or any part of the second string.
with open(sys.argv[1],"r") as f:
for line in f:
print(line)
The desired output would be:
Hello
5
Sample String
10
and so on for each line in the file. There could be thousands of lines in the file. I just need to separate each part so I can work with them separately.
The program can't magically split lines the way you want. You will need to read in one line at a time and parse it yourself based on the format.
Since there are two integers and an indeterminate number of (what I assume are) space-delimited words, you may be able to use a regular expression to find the integers then use them as delimiters to split up the line.

Retain only specified content in a string

I have data in the following form in a file:
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established</text\u003e\n______<sha1\u003eqwjfowt5my8t6yuszdb88k2ehskjuh0</sha1\u003e\n____</revision\u003e\n__</page\u003e\n__<page\u003e\n____<title\u003ePortal:Tropical_cyclones/Anniversaries/August_22</title\u003e\n____<ns\u003e100</ns\u003e\n____<id\u003e7957689</id\u003e\n____<revision\u003e\n______<id\u003e446349886</id\u003e\n______<timestamp\u003e2011-08-23T17:38:19Z</timestamp\u003e\n______<contributor\u003e\n________<username\u003eLightbot</username\u003e\n________<id\u003e7178666</id\u003e\n______</contributor\u003e\n______<comment\u003eDelink_non-obscure_units._Conversions._Report_bugs_to_[[User_talk:Lightmouse>.
The delimiter in the above file is a tab (\t) i.e. string1 is separated from abc:string2by \t. Similarly for the rest of the strings.
Now I want to retain just alphabets, numbers, /, :,'.' and _ within the strings which are enclosed within <>. I want to delete all the characters apart from the specified ones from the strings which are enlosed in <>.
Is there some way by which I may achieve this using linux commands or python? I want to replace all the unwanted characters by an underscore.
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established_text_u003e_n_______sha1_u003eqwjfowt5my8t6yuszdb88k2ehskjuh0_sha1_u003e_n_____revision_u003e_n___/page_u003e_n___page_u003e_n_____title_u003ePortal:Tropical_cyclones/Anniversaries/August_22_/title_u003e_n_____ns_u003e100_/ns_u003e_n_____id_u003e7957689_/id_u003e_n_____revision_u003e_n_______id_u003e446349886_/id_u003e_n_______timestamp_u003e2011-08-23T17:38:19Z_/timestamp_u003e_n_______contributor_u003e_n_________username_u003eLightbot_/username_u003e_n_________id_u003e7178666_/id_u003e_n_______/contributor_u003e_n_______comment_u003eDelink_non-obscure_units._Conversions._Report_bugs_to___User_talk:Lightmouse>.
Is there some way by which I may achieve this?
You can probably achieve this just with UNIX tools and some crazy regular expression, but I would write a small Python script for this:
Open two files (input and output) with open()
Iterate over the input file line by line: for line in input_file:
Split the line at tab: for part in line.split('\t'):
Check if a part is enclosed in <>: if part.startswith('<') and line.endswith('>'):
Filter characters with a regular expression: filtered_part = re.sub(r'[^a-zA-Z0-9/:._]', '', part)
Join the filtered parts back together: filtered_line = '\t'.join(filtered_parts)
Write the filtered line to the output file: output_file.write(filtered_line + '\n')
Following this outline, it should be easy for you to write a working script.

Compile and split a string from file using python

How can I compile a string from selected rows of a file, run some operations on the string and then split that string back to the original rows into that same file?
I only need certain rows of the file. I cannot do the operations to the other parts of the file. I have made a class that separates these rows from the file and runs the operations on these rows, but I'm thinking this would be even faster to run these operations on a single string containing parts of the file that can be used in these operations...
Or, if I can run these operations on a whole dictionary, that would help too. The operations are string replacements and RegEx replacements.
I am using python 3.3
Edit:
I'm going to explain this in greater detail here since my original post was so vague (thank you Paolo for pointing that out).
For instance, if I would like to fix a SubRipper (.srt-file), which is a common subtitle file, I would take something like this as an input (this is from an actual srt-file):
Here you can find correct example, submitting the file contents here messes newlines:
http://pastebin.com/ZdWUpNZ2
...And then I would only fix those rows which have the actual subtitle lines, not those ordering number rows or those hide/show rows of the subtitle file. So my compiled string might be:
"They're up on that ridge.|They got us pinned down."
Then I would run operations on that string. Then I would have to save those rows back to the file. How can I get those subtitle rows back into my original file after they are fixed? I could split my compiled and fixed string using "|" as a row delimiter and put them back to the original file, but how can I be certain what row goes where?
You can use pysrt to edit SubRip files:
from pysrt import SubRipFile
subs = SubRipFile.open('some/file.srt')
for sub in subs:
# do something with sub.text
pass
# save changes to a new file
subs.save('other/path.srt', encoding='utf-8')

Categories