Python: Replace string in a txt file but not on every occurrence - python

I am really new to python and I need to change new artikel Ids to the old ones. The Ids are mapped inside a dict. The file I need to edit is a normal txt where every column is sperated by Tabs. The problem is not replacing the values rather then only replacing the ouccurances in the desired column which is set by pos.
I really would appreciate some help.
def replaceArtCol(filename, pos):
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
val = each_line.split("\t")[pos]
for row in artikel_ID:
if each_line[pos] == pos
line = each_line.replace(val, artikel_ID[val])
output_file.write(line)`
This Code just replaces any occurance of the string in the text file.

supposed your ID mapping dict looks like ID_mapping = {'old_id': 'new_id'}, I think your code is not far from working correctly. A modified version could look like
with open(filename) as input_file, open('test.txt','w') as output_file:
for each_line in input_file:
line = each_line.split("\t")
if line[pos] in ID_mapping.keys():
line[pos] = ID_mapping[line[pos]]
line = '\t'.join(line)
output_file.write(line)
if you're not working in pandas anyway, this can save a lot of overhead.

if your data is tab separated then you must load this data into dataframe.. this way you can have columns and rows structure.. what you are sdoing right now will not allow you to do what you want to do without some complex and buggy logic. you may try these steps
import pandas as pd
df = pd.read_csv("dummy.txt", sep="\t", encoding="latin-1")
df['desired_column_name'] = df['desired_column_name'].replace({"value_to_be_changed": "newvalue"})
print(df.head())

Related

Reading from and Writing to CSV files

I am struggling with Python 2.7.10. Iā€™m trying to create a program that will eventually open a CSV file, read numbers from the file, perform calculations with the numbers and write back to the CSV file.
The code (i.e. the calculations) is not finished, I just wanted to try a few small bits so I could start to identify problems. The data in the CSV file looks like this:
['110000,75000\n', '115000,72500\n', '105000,85250\n', '100000,70000']
One thing that I am having issues with is properly converting the CSV strings to numbers and then telling Python what row, column I want to use in the calculation; something like Row(0), Column(0) ā€“ Row(1) Column(1).
I have tried a few different things but its seems to crash on the converting to numbers bit. Error message is TypeError int() argument must be a string or a number, not list OR IOError File not open for string ā€“ depending on what I have tried. Can someone point me in the right direction?
import csv
def main():
my_file = open('InputData.csv','rU')
#test = csv.writer(my_file, delimiter=',')
file_contents = my_file.readlines()
print file_contents
for row in file_contents:
print row
#convert to numbers
#val0 = int(file_contents.readlines(0))
#val1 = int(file_contents.readlines(1))
#val0 = int(my_file.readlines(0))
#val1 = int(my_file.readlines(1))
#perform calculation
#valDiff = val1 - val0
#append to third column, may need to be in write file mode, num to strings
#file_contents.append
my_file.close()
main()
The list file_contents now contains all of your excel data, so trying to use readlines probably won't work on the list type. I would try
row0 = file_contents[0].split(",")
Which should give you the first row in a list format. You should (and most likely will need to) put this in a loop to cover any size excel sheet you have. Then
val0 = int(row0[0])
should give you the value you want. But again I would make this iterative to save yourself some time and effort.
Assuming that your file is in plain text format and that you do not want to use a third party library like pandas then this would be the basic way to do it:
data = []
with open('InputData.csv','r') as my_file:
for row in my_file:
columns = row.split(',') #clean and split
data.append([int(value) for value in columns])
print(data[0][0]) #row=0 col=0
print(data[0][1]) #row=0 col=1
I think this will do what you want:
import csv
def main(filename):
# read entire csv file into memory
with open(filename, 'rb') as my_file:
reader = csv.reader(my_file, delimiter=',')
file_contents = list(reader)
# rewrite file adding a difference column
with open(filename, 'wb') as my_file:
writer = csv.writer(my_file, delimiter=',')
for row in file_contents:
val0, val1 = map(int, row)
difference = val1 - val0
#print(val0, val1, difference)
writer.writerow([val0, val1, difference])
if __name__ == '__main__':
main('InputData.csv')
Be careful when using this because it will rewrite the file. For testing and debugging, you might want to have it write the results to a second file with a different name.

Python convert TXT to CSV

I've been trying to convert a txt file to CSV but have been running into trouble.
My text document is in the following format:
POP Issue: key=u'VPER-242', id=u'167782'
POP Issue: key=u'TE-8', id=u'215771'
POP Issue: key=u'OUTDIAL-233', id=u'223166'
POP Issue: key=u'OUTDIAL-232', id=u'223047'
The goal is to throw this into a CSV file that looks like the following with 2 columns:
Name of issue
POP Issue: key=u'VPER-242'
POP Issue: key=u'TE-8'
POP Issue: key=u'OUTDIAL-233'
POP Issue: key=u'OUTDIAL-232'
Issue ID
id=u'167782'
id=u'215771'
id=u'223166'
id=u'223047'
Basically using the " , " in the txt file to act as a delimiter and separate the two into columns. The following code has worked to get the column names at the top of my CSV as well as splitting, but it is not in the right format and doesn't separate by " , ".
import csv
import itertools
with open('newfile1.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line for line in stripped if line)
grouped = itertools.izip(*[lines] * 2)
with open('newfile1.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name of Issue', 'Issue ID'))
writer.writerows(grouped)
This is what this code outputs - which is close but not quite right. I don't want spaces and need for the Issue ID column to only have the ID=u'number' data and the Name of issue to only have the POP Issue data. Anyone have any suggestions? Thank you!
Name of Issue
POP Issue: key=u'VPER-242', id=u'167782'
POP Issue: key=u'TE-8', id=u'215771'
POP Issue: key=u'OUTDIAL-233', id=u'223166'
Issue ID
POP Issue: key=u'TE-8', id=u'215771'
POP Issue: key=u'OUTDIAL-232', id=u'223047'
POP Issue: key=u'OUTDIAL-229', id=u'222309'
Your code is just using itertools.izip to zip together the same array, so that is why it is printing the same result under both columns. You need to split on the comma and then move forward.
import csv
txt_file = r"YourTextDocument.txt"
csv_file = r"NewProcessedDoc.csv"
in_txt = csv.reader(open(txt_file, "rb"), delimiter = ',')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
print 'done! go check your NewProcessedDoc.csv file'
# You can insert new rows manually in your csv for the titles (Name of issue & Issue ID)
EDITED: more details
Short answer:
Replace this:
grouped = itertools.izip(*[lines] * 2)
with this:
grouped = [line.split(',') for line in lines]
Longer answer:
Your "grouped" variable is containing pairs of duplicate lines (not what you wanted)
If your Input line doesn't contain any other comma (",") then str.split is your friend for this Mission.
Cheers

How to copy multiple rows and one column from one CSV file to another CSV Excel?

I am extremely new to python(coding, for that matter).
Could I please get some help as to how can I achieve this. I have gone through numerous threads but nothing helped.
My input file looks like this:
I want my output file to look like this:
Just replication of the first column, twice in the second excel sheet. With a line after every 5 rows.
A .csv file can be opened with a normal text editor, do this and you'll see that the entries for each column are comma-separated (csv = comma separated values). Most likely it's semicolons ;, though.
Since you're new to coding, I recommend trying it manually with a text editor first until you have the desired output, and then try to replicate it with python.
Also, you should post code examples here and ask specific questions about why it doesn't work like you expected it to work.
Below is the solution. Don't forget to configure input/output files and the delimiter:
input_file = 'c:\Temp\input.csv'
output_file = 'c:\Temp\output.csv'
delimiter = ';'
i = 0
output_data = ''
with open(input_file) as f:
for line in f:
i += 1
output_data += line.strip() + delimiter + line
if i == 5:
output_data += '\n'
i = 0
with open(output_file, 'w') as file_:
file_.write(output_data)
Python has a csv module for doing this. It is able to automatically read each row into a list of columns. It is then possible to simply take the first element and replicate it into the second column in an output file.
import csv
with open('input.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
input_rows = list(csv_input)
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
for line, row in enumerate(input_rows, start=1):
csv_output.writerow([row[0], row[0]])
if line % 5 == 0:
csv_output.writerow([])
Note, it is not advisable to write the updated data directly over the input file as if there was a problem you would lose your original file.
If your input file has multiple columns, this script will remove them and simple duplicate the first column.
By default, the csv format separates each column using a comma, this can be modified by specifying a desired delimiter as follows:
csv_output = csv.writer(f_output, delimiter=';')

read and concatenate 3,000 files into a pandas data frame starting at a specific value, python

I have 3,000 .dat files that I am reading and concatenating into one pandas dataframe. They have the same format (4 columns, no header) except that some of them have a description at the beginning of the file while others don't. In order to concatenate those files, I need to get rid of those first rows before I concatenate them. The skiprows option of the pandas.read_csv() doesn't apply here, because the number of rows to skip is very inconsistent from one file to another (btw, I use pandas.read_csv() and not pandas.read_table() because the files are separated by a coma).
However, the fist value after the rows I am trying to omit is identical for all 3,000 files. This value is "2004", which is the first data point of my dataset.
Is there an equivalent to skiprows where I could mention something such as "start reading the file starting at "2004" and skip everything else before that (for each of the 3,00 files)?
I am really out of luck at this point and would appreciate some help,
Thank you!
You could just loop through them and skip every line that doesn't start with 2004.
Something like ...
while True:
line = pandas.read_csv()
if line[0] != '2004': continue
# whatever else you need here
Probably not worth trying to be clever here; if you have a handy criterion, you might as well use it to figure out what skiprows is, i.e. something like
import pandas as pd
import csv
def find_skip(filename):
with open(filename, newline="") as fp:
# (use open(filename, "rb") in Python 2)
reader = csv.reader(fp)
for i, row in enumerate(reader):
if row[0] == "2004":
return i
for filename in filenames:
skiprows = find_skip(filename)
if skiprows is None:
raise ValueError("something went wrong in determining skiprows!")
this_df = pd.read_csv(filename, skiprows=skiprows, header=None)
# do something here, e.g. append this_df to a list and concatenate it after the loop
uss the skip_to() function:
def skip_to(f, text):
while True:
last_pos = f.tell()
line = f.readline()
if not line:
return False
if line.startswith(text):
f.seek(last_pos)
return True
with open("tmp.txt") as f:
if skip_to(f, "2004"):
df = pd.read_csv(f, header=None)
print df

How to read a text file into a list or an array with Python

I am trying to read the lines of a text file into a list or array in python. I just need to be able to individually access any item in the list or array after it is created.
The text file is formatted as follows:
0,0,200,0,53,1,0,255,...,0.
Where the ... is above, there actual text file has hundreds or thousands more items.
I'm using the following code to try to read the file into a list:
text_file = open("filename.dat", "r")
lines = text_file.readlines()
print lines
print len(lines)
text_file.close()
The output I get is:
['0,0,200,0,53,1,0,255,...,0.']
1
Apparently it is reading the entire file into a list of just one item, rather than a list of individual items. What am I doing wrong?
You will have to split your string into a list of values using split()
So,
lines = text_file.read().split(',')
EDIT:
I didn't realise there would be so much traction to this. Here's a more idiomatic approach.
import csv
with open('filename.csv', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
# do something
You can also use numpy loadtxt like
from numpy import loadtxt
lines = loadtxt("filename.dat", comments="#", delimiter=",", unpack=False)
So you want to create a list of lists... We need to start with an empty list
list_of_lists = []
next, we read the file content, line by line
with open('data') as f:
for line in f:
inner_list = [elt.strip() for elt in line.split(',')]
# in alternative, if you need to use the file content as numbers
# inner_list = [int(elt.strip()) for elt in line.split(',')]
list_of_lists.append(inner_list)
A common use case is that of columnar data, but our units of storage are the
rows of the file, that we have read one by one, so you may want to transpose
your list of lists. This can be done with the following idiom
by_cols = zip(*list_of_lists)
Another common use is to give a name to each column
col_names = ('apples sold', 'pears sold', 'apples revenue', 'pears revenue')
by_names = {}
for i, col_name in enumerate(col_names):
by_names[col_name] = by_cols[i]
so that you can operate on homogeneous data items
mean_apple_prices = [money/fruits for money, fruits in
zip(by_names['apples revenue'], by_names['apples_sold'])]
Most of what I've written can be speeded up using the csv module, from the standard library. Another third party module is pandas, that lets you automate most aspects of a typical data analysis (but has a number of dependencies).
Update While in Python 2 zip(*list_of_lists) returns a different (transposed) list of lists, in Python 3 the situation has changed and zip(*list_of_lists) returns a zip object that is not subscriptable.
If you need indexed access you can use
by_cols = list(zip(*list_of_lists))
that gives you a list of lists in both versions of Python.
On the other hand, if you don't need indexed access and what you want is just to build a dictionary indexed by column names, a zip object is just fine...
file = open('some_data.csv')
names = get_names(next(file))
columns = zip(*((x.strip() for x in line.split(',')) for line in file)))
d = {}
for name, column in zip(names, columns): d[name] = column
This question is asking how to read the comma-separated value contents from a file into an iterable list:
0,0,200,0,53,1,0,255,...,0.
The easiest way to do this is with the csv module as follows:
import csv
with open('filename.dat', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
Now, you can easily iterate over spamreader like this:
for row in spamreader:
print(', '.join(row))
See documentation for more examples.
Im a bit late but you can also read the text file into a dataframe and then convert corresponding column to a list.
lista=pd.read_csv('path_to_textfile.txt', sep=",", header=None)[0].tolist()
example.
lista=pd.read_csv('data/holdout.txt',sep=',',header=None)[0].tolist()
Note: the column name of the corresponding dataframe will be in the form of integers and i choose 0 because i was extracting only the first column
Better this way,
def txt_to_lst(file_path):
try:
stopword=open(file_path,"r")
lines = stopword.read().split('\n')
print(lines)
except Exception as e:
print(e)

Categories