Text file manipulation with Python - python

First off, I am very new to Python. When I started to do this it seemed very simple. However I am at a complete loss.
I want to take a text file with as many as 90k entries and put the data groups on a single line separated by a ';' My examples are below. Keep in mind that the groups of data vary in size. They could be two entries, or 100 entries.
Raw Data
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
Formatted Data
group1;data;
group2;data;data;data;
group3;data;data;data;data;data;data;data;data;data;data;data;data;
group4;data;data;

try something like the following. (untested...you can learn a bit of python by debugging!)
create python file "parser.py"
import sys
f = open('filename.txt', 'r')
for line in f:
txt = line.strip()
if txt == '':
sys.stdout.write('\n\n')
sys.stdout.flush()
sys.stdout.write( txt + ';')
sys.stdout.flush()
f.close()
and in a shell, type:
python parser.py > output.txt
and see if output.txt is what you want.

Assuming the groups are separated with an empty line, you can use the following one-liner:
>>> print "\n".join([item.replace('\n', ';') for item in open('file.txt').read().split('\n\n')])
group1;data
group2;data;data;data
group3;data;data;data;data;data;data;data;data;data;data;data;data
group4;data;data;
where file.txt contains
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
First the file content (open().read()) is split on empty lines split('\n\n') to produce a list of blocks, then, in each block [item ... for item in list], newlines are replaced with semi-colons, and finally all blocks are printed separated with a newline "\n".join(list)
Note that the above is not safe for production, that is code that you would write for interactive data transformation, not in production-level scripts.

What have you tried? Text file is for/from what? File manipulation is one of the last "basic" things I plan on learning. I'm saving it for when I understand the nuances of for loops, while loops, dictionaries, lists, appending, and a million other handy functions out there. That's after 2-3 months of research, coding and creating GUI's by the way.
Anyways here's some basic suggestions.
';'.join(group) will put a ";" in between each group, effectively creating one long (semi-colon delimited) string
group.replace("SPACE CHARACTER", ";") : This will replace any spaces or specified character (like a newline) within a group with a semi-colon.
There's a lot of other methods that include loading the txt file into a python script, .append() functions, putting the groups into lists, dictionaries, or matrix's, etc..

These are my bits to throw on the problem:
from collections import defaultdict
import codecs
import csv
res = defaultdict(list)
cgroup = ''
with codecs.open('tmp.txt',encoding='UTF-8') as f:
for line in f:
if line.startswith('group'):
cgroup = line.strip()
continue
res[cgroup].append(line.strip())
with codecs.open('out.txt','w',encoding='UTF-8') as f:
w = csv.writer(f, delimiter=';',quoting=csv.QUOTE_MINIMAL)
for k in res:
w.writerow([k,]+ res[k])
Let me explain a bit on the why I did things, as I did. First, I used the codecs module to open the data file explicitly with the codec, since data should always be treated right and not by just guessing what it might be. Then I used a defaultdict, which has a nice documentation online, cause its more pythonic, at least regarding to mr. hettinger. It is one of the patterns, that can be unlearned if you use python.
At least, I used a csv-writer to generate the output, cause writing CSV files is not as easy as one might think. And to be able to just meet the right criteria, or just to get the data into a correct csv format, it is better to use, what many eyes have seen, instead of reinventing the wheel.

Related

How can I read and search multiple textfiles so that I can store a list of files that match my search?

I hope you can help out a new learner of Python. I could not find my problem in other questions, but if so: apologies. What I basically want to do is this:
Read a large number of text files and search each for a number of string terms.
If the search terms are matched, store the corresponding file name to a new file called "filelist", so that I can tell the good files from the bad files.
Export "filelist" to Excel or CSV.
Here is the code that I have so far:
#textfiles all contain only simple text e.g. "6 Apples"
filelist=[]
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, encoding="Latin1") as f:
fine=f.read()
if re.search('APPLES',fine) or re.search('ORANGE',fine) or re.search('BANANA',fine):
filelist.append(file)
listoffiles = pd.DataFrame(filelist)
writer = pd.ExcelWriter('ListofFiles.xlsx', engine='xlsxwriter')
listoffiles.to_excel(writer,sheet_name='welcome',index=False)
writer.save()
print(filelist)
Questions:
Surely, there is a more elegant or time-efficient way? I need to do this for a large amount of files :D
Related to the former, is there a way to solve the reading-in of files using pandas? Or is it less time efficient? For me as a STATA user, having a dataframe feels a bit more like home....
I added the "Latin1" option, as some characters in the raw data create conflict in encoding. Is there a way to understand which characters are causing the problem? Can I get rid of this easily, e.g. by cutting of the first line beforehand (skiprow maybe)?
Just couple of things to speed up the script:
1.) compile your regex beforehand, not every time in the loop (also use | to combine multiple strings to one regex!
2.) read files line by line, not all at once!
3.) Use any() to terminate search when you get first positive
For example:
import re
import os
filelist=[]
r = re.compile(r'APPLES|ORANGE|BANANA') # you can add flags=re.I for case insensitive search
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, 'r', encoding='latin1') as f:
if any(r.search(line) for line in f): # read files line by line, not all content at once
filelist.append(file) # add to list
# convert list to pandas, etc...

Labelling and Grouping Postcodes using Python

I'm fairly new to Python and I am attempting to group various postcodes together under predefined labels. For example "SA31" would be labelled a "HywelDDAPostcode"
I have some code where I read lots of postcodes from a singled columned file into a list and compare them with postcodes that are in predefined lists. However, when I output my postcode labels only the Label "UKPostcodes" is outputted for every postcode in my original file. It would appear that the first two conditions in my code always evaluate to false no matter what. Am I doing the right thing using "in"? Or perhaps it's a file reading issue? I'm not sure
The input file is simply a file which contains a list of postcodes ( in reality it has thousands of rows)
The CSV file
Here is my code:
import csv
with open('postcodes.csv', newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
my_list =[]
HywelDDAPostcodes=["SA46","SY23","SY24","SA18","SA16","SA43","SA31","SA65","SA61","SA62","SA17","SA48","SA40","SA19","SA20","SA44","SA15","SA14","SA73","SA32","SA67","SA45",
"SA38","SA42","SA41","SA72","SA71","SA69","SA68","SA33","SA70","SY25","SA34","LL40","LL42","LL36","SY18","SY17","SY20","SY16","LD6"]
NationalPostcodes=["LL58","LL59","LL60","LL61","LL62","LL63","LL64","LL65","LL66","LL67","LL68","LL69","LL70","LL71","LL72","LL73","LL74","LL75","LL76","LL77","LL78",
"NP1","NP2","NP23","NP3","CF31","CF32","CF33","CF34","CF35","CF36","CF3","CF46","CF81","CF82","CF83","SA35","SA39","SA4","SA47","LL16","LL18","LL21","LL22","LL24","LL25","LL26","LL27","LL28","LL29","LL30","LL31","LL32","LL33","LL34","LL57","CH7","LL11","LL15","LL16","LL17","LL18","LL19","LL20","LL21","LL22","CH1","CH4","CH5","CH6","CH7","LL12","CF1","CF32","CF35","CF5","CF61","CF62","CF63","CF64","CF71","LL23","LL37","LL38","LL39","LL41","LL43","LL44","LL45","LL46","LL47","LL48","LL49","LL51","LL52","LL53","LL54","LL55","LL56","LL57","CF46","CF47","CF48","NP4","NP5","NP6","NP7","SA10","SA11","SA12","SA13","SA8","CF3","NP10","NP19","NP20","NP9","SA36","SA37","SA63","SA64","SA66","CF44","CF48","HR3","HR5","LD1","LD2","LD3","LD4","LD5","LD7","LD8","NP8","SY10","SY15","SY19","SY21","SY22","SY5","CF37","CF38","CF39","CF4","CF40","CF41","CF42","CF43","CF45","CF72","SA1","SA2","SA3","SA4","SA5","SA6","SA7","SA1","NP4","NP44","NP6","LL13","LL14","SY13","SY14"]
NationalPostcodes2= list(dict.fromkeys(NationalPostcodes))
labels=["HywelDDA","NationalPostcodes","UKPostcodes"]
for postcode in your_list:
#print(postcode)
if postcode in HywelDDAPostcodes:
my_list.append(labels[0])
if postcode in NationalPostcodes2:
my_list.append(labels[1])
else:
my_list.append(labels[2])
with open('DiscretisedPostcodes.csv','w') as result_file:
wr = csv.writer(result_file, dialect='excel')
for item in my_list:
wr.writerow([item,])
If anyone has any advice as to what could be causing the issue or just any advice surrounding Python, in general, I would very much appreciate it. Thank you!
The reason why your comparison block isn't working is that when you use csv reader to read your file, each line is being added to your_list as a list. So you are making a list of lists and when you compare those things it doesn't match.
['LL58'] == 'LL58' # fails
So, inspect your_list and see what I mean. You should make a shell your_list before you read the file and append each new reading to it. Then inspect that to make sure it looks good. It would also behoove you to use the strip() command to strip off whitespace from each item. I can't recall if csv reader does that automatically.
Also... a better structure for testing for membership is to use sets instead of lists. in will work for lists, but it is MUCH faster for sets, so I would put your comparison items into sets.
Lastly, it isn't clear what you are trying to do with NationalPostcodes2. Just use your NationalPostcodes, but put them in a set with {}.
#Jeff H's answer is correct, but for what it's worth here's how I might write this code (untested):
# Note: Since, as you wrote, these are only single-column files I did not use the csv
# module, as it will just add additional unnecessary overhead.
# Read the known data from files--this will always be more flexible and maintainable than
# hard-coding them in your code. This is just one possible scheme for doing this; e.g.
# you could also put all of them into a single JSON file
standard_postcode_files = {
'HywelDDA': 'hyweldda.csv',
'NationalPostcodes': 'nationalpostcodes.csv',
'UKPostcodes': 'ukpostcodes.csv'
}
def read_postcode_file(filename):
with open(filename) as f:
# exclude blank lines and strip additional whitespace
return [line.strip() for line in f if line.strip()]
standard_postcodes = {}
for key, filename in standard_postcode_files.items():
standard_postcodes[key] = set(read_postcode_file(filename))
# Assuming all post codes are unique to a set, map postcodes to the set they belong to
postcodes_reversed = {v: k for k, s in standard_postcodes.items() for v in s}
your_postcodes = read_postcode_file('postcodes.csv')
labels = [postcodes_reversed[code] for code in your_postcodes]
with open('DiscretisedPostCodes.csv', 'w') as f:
for label in labels:
f.write(label + '\n')
I would probably do other things like not make the input filename hard-coded. If you need to work with multiple columns using the csv module would also be fine with minimal additional changes, but since you're just writing one item per line I figured it was unnecessary.

Python fast way to read several rows of csv text?

I wish to to the following as fast as possible with Python:
read rows i to j of a csv file
create the concatenation of all the strings in csv[row=(loop i to j)][column=3]
My first code was a loop (i to j) of the following:
with open('Train.csv', 'rt') as f:
row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
tags = (row[3].decode('utf8'))
return tags
but my code above reads the csv one column at a time and is slow.
How can I read all rows in one call and concatenate fast?
Edit for additional information:
the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).
Since I know which data you are interested in, I can speak from experience:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
row[0] # ID
row[1] # title
row[2] # body
row[3] # tags
You can of course per row select anything you want, and store it as you like.
By using an iterator variable, you can decide which rows to collect:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
linenum = 0
tags = [] # you can preallocate memory to this list if you want though.
for row in reader:
if linenum > 1000 and linenum < 2000:
tags.append(row[3]) # tags
if linenum == 2000:
break # so it won't read the next 3 million rows
linenum += 1
The good thing about it is also that this will really use low memory as you read in line by line.
As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).
Then I used:
train = pandas.io.parsers.read_csv(file, quotechar="\"")
To quickly read in the split files.
If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just
tags = " ".join(x.split("\t")[3]
for x in open("Train.csv").readlines()[from_row:to_row+1])
is going to be the fastest way.
If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.
If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.
If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.
Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.
The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?
If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:
Reduce file to the required data.
Transform the remaining data.
sed is designed for the task 'read rows i to j of a csv file'.to
If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp', then parse the output with Python would be simple and quick.

How to copy specific data out of a file using python?

I have some large data files and I want to copy out certain pieces of data on each line, basically an ID code. The ID code has a | on one side and a space on the other. I was wondering would it be possible to pull out just the ID. Also I have two data files, one has 4 ID codes per line and the other has 23 per line.
At the moment I'm thinking something like copying each line from the data file, then subtract the strings from each other to get the desired ID code, but surely there must be an easier way! Help?
Here is an example of a line from the data file that I'm working with
cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327
and from this line I would want to output on separate lines
Wood_4286
EIK58010
AEV644870.1
PSEBR_a4327
Use the regex module for such a task. The following code shows you how to extract the ID's from a string (works for any number of ID's as long as they are structured the same way).
import re
s = """cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327"""
results = re.findall('\|([^ ]*)',s) #list of ids that have been extracted from string
print('\n'.join(results)) #pretty output
Output:
Wood_4286
EIK58010
AEV64487.1
PSEBR_a4327
To write the output to a file:
with open('out.txt', mode = 'w') as filehandle:
filehandle.write('\n'.join(results))
For more information, see the regex module documentation.
If all your lines have the given format, a simple split is enough:
#split by '|' and the result by space
ids = [x.split()[0] for x in line.split("|")[1:]]

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

Categories