I am trying to replace exact string only in a csv file using another csv as dictionary.
This is my code
import re
text = open("input.csv", "r", encoding="ISO-8859-1")
replacelist = open("replace.csv","r", encoding="ISO-8859-1").readlines()
for r in replacelist:
r = r.split(",")
text = ''.join([i for i in text]) \
.replace(r[0],r[1])
print ({r[0]})
print ({r[1]})
x = open("new.csv","w")
x.writelines(text)
x.close()
Is it possible to use replace method to only replace exact match strings? Should I import and use re.sub() instead of replace?
input.csv example
ciao123;xxxxx;0
ciao12345;xxzzx;2
replace.csv example
ciao123,ok
aaaa,no
bbb,cc
Only first line in input.csv should be replaced.
Well, as per your comments, your task would be much simpler and you don't need to play with regex as well!
Basically, you are trying to replace something in a csv column if it is a exact word match, if that is the case, you should not be treating it as raw text, treat it as a column data.
If you do so, you could use one example like below:
text = open("input.csv", "r", encoding="ISO-8859-1").readlines()
replacelist = open("replace.csv","r", encoding="ISO-8859-1").readlines()
# make a replace word dictionary with O(n) time complexity
replace_data = {i.split(',')[0]: i.split(',')[1] for i in replacelist}
# Now treat data in input.csv as tabular data to replace the words
# Start another loop of O(n) time complexity
for idx, line in enumerate(text):
line_lis = line.split(';')
if line_lis[0] in replace_data:
only replace word if it is meant to be replaced
line_lis[0] = replace_data.get(line_lis[0])
text[idx] = ';'.join(line_lis)
# write results
with open("new.csv","w") as f:
f.writelines(text)
Result would be as:
ok;xxxxx;0
ciao12345;xxzzx;2
Related
I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')
I am currently trying to extract information from a text file using Python. I want to extract a subset from the file and store it in a separate file from everywhere it occurs in the text file. To give you an idea of what my file looks like, here is a sample:
C","datatype":"double","value":25.71,"measurement":"Temperature","timestamp":1573039331258250},
{"unit":"%RH","datatype":"double","value":66.09,"measurement":"Humidity","timestamp":1573039331258250}]
Here, I want to extract "value" and the corresponding number beside it. I have tried various techniques but have been unsuccessful. I tried to iterate through the file and stop at where I have "value" but that did not work.
Here is a sample of the code:
with open("DOTemp.txt") as openfile:
for line in openfile:
for part in line.split():
if "value" in part:
print(part)
A simple solution to return the value marked by the "value" key:
with open("DOTemp.txt") as openfile:
for line in openfile:
line = line.replace('"', '')
for part in line.split(','):
if "value" in part:
print(part.split(':')[1])
Note that by default str.split() splits on whitespace. In the last line, if we printed element zero of the list it would just be "value". If you wish to use this as an int or float, simply cast it as such and return it.
First split using , (comma) as the delimiter, then split the corresponding strings using : as the delimiter.
if required trim leading and trailing "" then compare with value
Below code will work for you:
file1 = open("untitled.txt","r")
data = file1.readlines()
#Convert to a single string
val = ""
for d in data:
val = val + d
#split string at comma
comma_splitted = val.split(',')
#find the required float
for element in comma_splitted:
if 'value' in element:
out = element.split('"value":')[1]
print(float(out))
I assume your input file is a json string(list of dictionaries) (looking at the file sample). If that's the case, perhaps you can try this.
import json
#Assuming each record is a dictionary
with open("DOTemp.txt") as openfile:
lines = openfile.readlines()
records = json.loads(lines)
out_lines = list(map(lambda d: d.get('value'), records))
with open('DOTemp_out.txt', 'w') as outfile:
outfile.write("\n".join(out_lines))
I'm attempting to anonymize a file so that all the content except certain keywords are replaced with gibberish, but the format is kept the same (including punctuation, length of string and capitalization). For example:
I am testing this, check it out! This is a keyword: long
Wow, another line.
should turn in to:
T ad ehistmg ptrs, erovj qo giw! Tgds ar o qpyeogf: long
Yeg, rmbjthe yadn.
I am attempting to do this in python, but i'm having no luck in finding a solution. I have tried replacing via tokenization and writing to another file, but without much success.
Initially let's disregard the fact that we have to preserve some keywords. We will fix that later.
The easiest way to perform this kind of 1-to-1 mapping is to use the method str.translate. The string module also contains constants that contain all ASCII lowercase and uppercase characters, and random.shuffle can be used to obtain a random permutation.
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
random.shuffle(random_caps)
random.shuffle(random_lows)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
with open('the-file-i-want.txt', 'r') as f:
contents = f.read()
translated_contents = contents.translate(translation_table)
with open('the-file-i-want.txt', 'w') as f:
f.write(translated_contents)
In python 2 the str.maketrans is a function in the string module instead of a static method of str.
The translation_table is a mapping from characters to characters, so it will map every single ASCII character to an other one. The translate method simply applies this table to each character in the string.
Important note: the above method is actually reversible, because each letter its mapped to a unique other letter. This means that using a simple analysis over the frequency of the symbols it's possible to reverse it.
If you want to make this harder or impossible, you could re-create the translation_table for every line:
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
with open('the-file-i-want.txt', 'r') as f:
translated_lines = []
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
translated_lines.append(line.translate(translation_table))
with open('the-file-i-want.txt', 'w') as f:
f.writelines(translated_lines)
Also note that you could translate and save the file line by line:
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
o.write(line.translate(translation_table))
Which means you can translate huge files with this code, as far as the lines themselves are not insanely long.
The code above messing all characters, without taking into account such keywords.
The simplest way to handle the requirement is to simply check for each line whether one of keywords occur and "reinsert" it there:
import re
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
keywords = ['long'] # add all the possible keywords in this list
keyword_regex = re.compile('|'.join(re.escape(word) for word in keywords))
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
matches = keyword_regex.finditer(line)
translated_line = list(line.translate(translation_table))
for match in matches:
translated_line[match.start():match.end()] = match.group()
o.write(''.join(translated_line))
Sample usage (using the version that prevserves keywords):
$ echo 'I am testing this, check it out! This is a keyword: long
Wow, another line.' > the-file-i-want.txt
$ python3 trans.py
$ cat output.txt
M vy hoahitc hfia, ufoum ih pzh! Hfia ia v modjpel: long
Ltj, fstkwzb hdsz.
Note how long is preserved.
I'm parsing a very big csv (big = tens of gigabytes) file in python and I need only the value of the first column of every line. I wrote this code, wondering if there is a better way to do it:
delimiter = ','
f = open('big.csv','r')
for line in f:
pos = line.find(delimiter)
id = int(line[0:pos])
Is there a more effective way to get the part of the string before the first delimiter?
Edit: I do know about the CSV module (and I have used it occasionally), but I do not need to load in memory every line of this file - I need the first column. So lets focus on string parsing.
>>> a = '123456'
>>> print a.split('2', 1)[0]
1
>>> print a.split('4', 1)[0]
123
>>>
But, if you're dealing with a CSV file, then:
import csv
with open('some.csv') as fin:
for row in csv.reader(fin):
print int(row[0])
And the csv module will handle quoted columns containing quotes etc...
If the first field can't have an escaped delimiter in it such as in your case where the first field is an integer and there are no embed newlines in any field i.e., each row corresponds to exactly one physical line in the file then csv module is an overkill and you could use your code from the question or line.split(',', 1) as suggested by #Jon Clements.
To handle occasional lines that have no delimiter in them you could use str.partition:
with open('big.csv', 'rb') as file:
for line in file:
first, sep, rest = line.partition(b',')
if sep: # the line has ',' in it
process_id(int(first)) # or `yield int(first)`
Note: s.split(',', 1)[0] silently returns a wrong result (the whole string) if there is no delimiter in the string.
'rb' file mode is used to avoid unnecessary end of line manipulation (and implicit decoding to Unicode on Python 3). It is safe to use if the csv file has '\n' at the end of each raw i.e., newline is either '\n' or '\r\n'
Personnally , I would do with generators:
from itertools import imap
import csv
def int_of_0(x):
return(int(x[0]))
def obtain(filepath, treat):
with open(filepath,'rb') as f:
for i in imap(treat,csv.reader(f)):
yield i
for x in obtain('essai.txt', int_of_0):
# instructions
I have a dictionary that searches for an ID name and reads tokens after it. But I want to know if there is a way to read and print out the whole line that contains that ID name as well.
Here is what I have so far:
lookup = defaultdict(list)
wholelookup =defaultdict(list)
mydata = open('summaryfile.txt')
for line in csv.reader(mydata, delimiter='\t'):
code = re.match('[a-z](\d+)[a-z]', line[-1], re.I)
if code:
lookup[line[-2]].append(code.group(1))
wholelookup[line[-2]].append(code.group(0))
Your code calls csv.reader() which will return a parsed version of the whole line. In my test, this returns a list of values. If this list of values will do for the "whole line" then you can save that.
You have a line where you append something called wholelookup. I think you want to just save line there instead of code.group(0). code.group(0) returns everything matched by the regular expression, and this would be identical to line[-1].
So maybe put this line in your code:
wholelookup[line[-2]].append(line)
Or maybe you need to join together the values from line to make a single string:
s = ' '.join(line)
wholelookup[line[-2]].append(s)
If you want the whole line, not the parsed version, then do something like this:
lookup = defaultdict(list)
wholelookup = defaultdict(list)
pat = re.compile('[a-z](\d+)[a-z]', re.I)
with open('summaryfile.txt') as mydata:
for s_line in mydata:
values = s_line.split('\t')
code = re.match(pat, values[-1])
if code:
lookup[values[-2]].append(code.group(1))
wholelookup[values[-2]].append(s_line)
This example pre-compiles the pattern for the slight speed advantage.
If you have enough memory, the easiest way is to simply save the lines in another defaultdict:
wholeline = defaultdict(list)
...
idname = line[-2]
wholeline[idname].append(line)