I have some large data files and I want to copy out certain pieces of data on each line, basically an ID code. The ID code has a | on one side and a space on the other. I was wondering would it be possible to pull out just the ID. Also I have two data files, one has 4 ID codes per line and the other has 23 per line.
At the moment I'm thinking something like copying each line from the data file, then subtract the strings from each other to get the desired ID code, but surely there must be an easier way! Help?
Here is an example of a line from the data file that I'm working with
cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327
and from this line I would want to output on separate lines
Wood_4286
EIK58010
AEV644870.1
PSEBR_a4327
Use the regex module for such a task. The following code shows you how to extract the ID's from a string (works for any number of ID's as long as they are structured the same way).
import re
s = """cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327"""
results = re.findall('\|([^ ]*)',s) #list of ids that have been extracted from string
print('\n'.join(results)) #pretty output
Output:
Wood_4286
EIK58010
AEV64487.1
PSEBR_a4327
To write the output to a file:
with open('out.txt', mode = 'w') as filehandle:
filehandle.write('\n'.join(results))
For more information, see the regex module documentation.
If all your lines have the given format, a simple split is enough:
#split by '|' and the result by space
ids = [x.split()[0] for x in line.split("|")[1:]]
Related
I hope you can help out a new learner of Python. I could not find my problem in other questions, but if so: apologies. What I basically want to do is this:
Read a large number of text files and search each for a number of string terms.
If the search terms are matched, store the corresponding file name to a new file called "filelist", so that I can tell the good files from the bad files.
Export "filelist" to Excel or CSV.
Here is the code that I have so far:
#textfiles all contain only simple text e.g. "6 Apples"
filelist=[]
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, encoding="Latin1") as f:
fine=f.read()
if re.search('APPLES',fine) or re.search('ORANGE',fine) or re.search('BANANA',fine):
filelist.append(file)
listoffiles = pd.DataFrame(filelist)
writer = pd.ExcelWriter('ListofFiles.xlsx', engine='xlsxwriter')
listoffiles.to_excel(writer,sheet_name='welcome',index=False)
writer.save()
print(filelist)
Questions:
Surely, there is a more elegant or time-efficient way? I need to do this for a large amount of files :D
Related to the former, is there a way to solve the reading-in of files using pandas? Or is it less time efficient? For me as a STATA user, having a dataframe feels a bit more like home....
I added the "Latin1" option, as some characters in the raw data create conflict in encoding. Is there a way to understand which characters are causing the problem? Can I get rid of this easily, e.g. by cutting of the first line beforehand (skiprow maybe)?
Just couple of things to speed up the script:
1.) compile your regex beforehand, not every time in the loop (also use | to combine multiple strings to one regex!
2.) read files line by line, not all at once!
3.) Use any() to terminate search when you get first positive
For example:
import re
import os
filelist=[]
r = re.compile(r'APPLES|ORANGE|BANANA') # you can add flags=re.I for case insensitive search
for file in os.listdir('C:/mydirectory/'):
with open('C:/mydirectory/' + file, 'r', encoding='latin1') as f:
if any(r.search(line) for line in f): # read files line by line, not all content at once
filelist.append(file) # add to list
# convert list to pandas, etc...
Having trouble with reading a csv file into a pandas dataframe where the line endings are not standard.
Here is my code:
df_feb = pd.read_csv(data_location, sep = ",",nrows = 500, header = None, skipinitialspace = True,encoding = 'utf-8')
Here is the output (personal info scratched out):
Output
This is what the input data looks like:
The above output splits what should be a single line into 4 lines. A new line should start for every phone number (phone number = scratched out bit).
I am aiming to have each line look like this:
Goal output
Thank you in advance for your help!
If the file have format have any rule (not unique format for each record), then I suggest you write your own convertion tool.
Here I suggest what the tool should do
Read file as plain text.
Put 4 lines into 1 records/class object ( as I see in picture, 4 records seem to have 4 lines)
Parse the line (split by comma, tab, whatever you have) to get attribute
Write attribute in another file, split by tab (or comma) => your csv
Now, you can load your csv to Pandas.
I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.
This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)
The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.
First off, I am very new to Python. When I started to do this it seemed very simple. However I am at a complete loss.
I want to take a text file with as many as 90k entries and put the data groups on a single line separated by a ';' My examples are below. Keep in mind that the groups of data vary in size. They could be two entries, or 100 entries.
Raw Data
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
Formatted Data
group1;data;
group2;data;data;data;
group3;data;data;data;data;data;data;data;data;data;data;data;data;
group4;data;data;
try something like the following. (untested...you can learn a bit of python by debugging!)
create python file "parser.py"
import sys
f = open('filename.txt', 'r')
for line in f:
txt = line.strip()
if txt == '':
sys.stdout.write('\n\n')
sys.stdout.flush()
sys.stdout.write( txt + ';')
sys.stdout.flush()
f.close()
and in a shell, type:
python parser.py > output.txt
and see if output.txt is what you want.
Assuming the groups are separated with an empty line, you can use the following one-liner:
>>> print "\n".join([item.replace('\n', ';') for item in open('file.txt').read().split('\n\n')])
group1;data
group2;data;data;data
group3;data;data;data;data;data;data;data;data;data;data;data;data
group4;data;data;
where file.txt contains
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
First the file content (open().read()) is split on empty lines split('\n\n') to produce a list of blocks, then, in each block [item ... for item in list], newlines are replaced with semi-colons, and finally all blocks are printed separated with a newline "\n".join(list)
Note that the above is not safe for production, that is code that you would write for interactive data transformation, not in production-level scripts.
What have you tried? Text file is for/from what? File manipulation is one of the last "basic" things I plan on learning. I'm saving it for when I understand the nuances of for loops, while loops, dictionaries, lists, appending, and a million other handy functions out there. That's after 2-3 months of research, coding and creating GUI's by the way.
Anyways here's some basic suggestions.
';'.join(group) will put a ";" in between each group, effectively creating one long (semi-colon delimited) string
group.replace("SPACE CHARACTER", ";") : This will replace any spaces or specified character (like a newline) within a group with a semi-colon.
There's a lot of other methods that include loading the txt file into a python script, .append() functions, putting the groups into lists, dictionaries, or matrix's, etc..
These are my bits to throw on the problem:
from collections import defaultdict
import codecs
import csv
res = defaultdict(list)
cgroup = ''
with codecs.open('tmp.txt',encoding='UTF-8') as f:
for line in f:
if line.startswith('group'):
cgroup = line.strip()
continue
res[cgroup].append(line.strip())
with codecs.open('out.txt','w',encoding='UTF-8') as f:
w = csv.writer(f, delimiter=';',quoting=csv.QUOTE_MINIMAL)
for k in res:
w.writerow([k,]+ res[k])
Let me explain a bit on the why I did things, as I did. First, I used the codecs module to open the data file explicitly with the codec, since data should always be treated right and not by just guessing what it might be. Then I used a defaultdict, which has a nice documentation online, cause its more pythonic, at least regarding to mr. hettinger. It is one of the patterns, that can be unlearned if you use python.
At least, I used a csv-writer to generate the output, cause writing CSV files is not as easy as one might think. And to be able to just meet the right criteria, or just to get the data into a correct csv format, it is better to use, what many eyes have seen, instead of reinventing the wheel.
I am new to python and this may be a foolish question but it troubles me for several days.
I have about 30 log files, and each of them contains strings and data. They are almost the same except the difference of several data, and their names are arranged regularly, like 'log10.lammps', 'log20.lammps', etc.(The '10''20'represent the temperature of the simulation). I want to write a python script, which loops all these files and read their data in a specific line, (say line3900). Then I want to write these data to another data file, which is arranged like these:
10 XXX
20 XXX
30 XXX
.
.
.
I can read and write from a single file, but I cannot achieve the loop. Could anyone please tell me how to to that. Thanks a lot!
PS. Still another difficulty is that the data in line 3900 is presented like this: "The C11 is 180.1265465616", the one I want to extract is 180.1265465616. How can I extract the number without the strings?
This answer covers how to get all the files in a folder in Python. To summarize the top answer, to get all the files in a single folder, you would do this:
import os
import os.path
def get_files(folder_path):
return [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
The next step would be to extract the number from the line The C11 is 180.1265465616.
I'm assuming that you have a function called get_line that given a filename, will return that exact line.
You can do one of three things. If the length of the number at the end is constant, then you can just grab the last n characters in the string and convert it to a number. Alternatively, you could split the string by the spaces and grab the last item -- the number. Finally, you could use Regex.
I'm just going to go with the second option since it looks the most straightforward for now.
def get_numbers():
numbers = []
for file in get_files('folder'):
line = get_line(file)
components = line.split(' ')
number = float(components[-1])
numbers.append(number)
return numbers
I wasn't sure how you wanted to write the numbers to the file, but hopefully these should help you get started.
#assuming files is a list of filenames
for filename in files:
with open(filename) as f:
<do stuff with file f>
ps. float(line.split(' ')[-1])
Well, I can give you a hint which path I would have taken (but there might be a better one):
Get all files in a directory to a list with os.listdir
Loop over every one and perform the following:
Use the re module to extract the temperature from the filename (if not match pattern break) else add it to a list (to_write_out)
Read the right line with linecache
Get the value (line.split()[-1])
Append the value to the list to_write_out.
Join the list to_write_out to a string with join
Write the string to a file.
Help with the regex
The regular expressions can be a bit tricky if you haven't used them before. To extract the temperature from your filenames (first bullet beneath point number 2) you would use something like:
for fname in filenames:
pattern = 'log(\d+)\.lammps'
match = re.search(pattern, fname)
if match:
temp = match.group(1)
# Append the temperature to the list.
else:
break
# Continue reading the right line etc.