Python read .txt File -> list - python

I have a .txt File and I want to get the values in a list.
The format of the txt file should be:
value0,timestamp0
value1,timestamp1
...
...
...
In the end I want to get a list with
[[value0,timestamp0],[value1,timestamp1],.....]
I know it's easy to get these values by
direction = []
for line in open(filename):
direction,t = line.strip().split(',')
direction = float(direction)
t = long(t)
direction.append([direction,t])
return direction
But I have a big problem: When creating the data I forgot to insert a "\n" in each row.
Thats why I have this format:
value0, timestamp0value1,timestamp1value2,timestamp2value3.....
Every timestamp has exactly 13 characters.
Is there a way to get these data in a list as I want it? Would be very much work get the data again.
Thanks
Max

import re
input = "value0,0123456789012value1,0123456789012value2,0123456789012value3"
for (line, value, timestamp) in re.findall("(([^,]+),(.{13}))", input):
print value, timestamp

You will have to strip the last , but you can insert a comma after every 13 chars following a comma:
import re
s = "-0.1351197,1466615025472-0.25672746,1466615025501-0.3661744,1466615025531-0.4646‌​7665,1466615025561-0.5533287,1466615025591-0.63311553,1466615025621-0.7049236,146‌​6615025652-0.7695509,1466615025681-1.7158673,1466615025711-1.6896278,146661502574‌​1-1.65375,1466615025772-1.6092329,1466615025801"
print(re.sub("(?<=,)(.{13})",r"\1"+",", s))
Which will give you:
-0.1351197,1466615025472,-0.25672746,1466615025501,-0.3661744,1466615025531,-0.4646‌​7665,1466615025561,-0.5533287,1466615025591,-0.63311553,1466615025621,-0.7049236,146‌​6615025652-0.7695509,1466615025681,-1.7158673,1466615025711,-1.6896278,146661502574‌​1-1.65375,1466615025772,-1.6092329,1466615025801,

I coded a quickie using your example, and not using 13 but len("timestamp") so you can adapt
instr = "value,timestampvalue2,timestampvalue3,timestampvalue4,timestamp"
previous_i = 0
for i,c in enumerate(instr):
if c==",":
next_i = i+len("timestamp")+1
print(instr[previous_i:next_i])
previous_i = next_i
output is descrambled:
value,timestamp
value2,timestamp
value3,timestamp
value4,timestamp

I think you could do something like this:
direction = []
for line in open(filename):
list = line.split(',')
v = list[0]
for s in list[1:]:
t = s[:13]
direction.append([float(v), long(t)])
v = s[13:]
If you're using python 3.X, then the long function no longer exists -- use int.

Related

Python program which could auto sort and replace hashes with plain password in csv file

I have an excel file "ex.csv" with columns - Hash, Salt, Name And I have a txt file "found.txt" where are decrypted hashes. Their format is Hash: Salt: Plain_Password I would like to change Hash from "ex.csv" with Plain_Password from "found.txt". Would like to know, how I could do that :) I have written a test program that would output into separate txt file the Hash: Salt but it is not working.
Python code -
# File Reads
a = open("ex.csv")
b = open("found.txt")
# Reading contents
ex = a.read()
found = b.read()
# Splitting files by newline
ex_s = ex.split("\n")
found_s = found.split("\n")
# Splitting them into subarrays by splitting them by ','
temp_exsp2 = []
temp_foundsp2 = []
i=0
for item in ex_s:
temp_exsp2[i] = item[0] # Presumeably here's an error
i+=1
i=0
for item in found_s:
temp_foundsp2[i] = item[0] # Same thing here
i+=1
i=0
z=0 #Used for incrementing found array
FoundArray0 = [] #For line from ex
FoundArray1 = [] #For line from found
while i!=len(ex_s): # Main comparison loop
for item in temp_foundsp2: # Inner loop for looping through all found file
j=0
if item in temp_exsp2[i]:
FoundArray0[z] = i
FoundArray1[z] = j
z+=1
j+=1
i+=1 # Go to the next line in the ex.csv
output = open("output.txt","w")
for out in FoundArray0:
for out2 in FoundArray1:
output.write(str(ex_s[FoundArray0]) + ":" + str(temp_foundsp2[FoundArray1]))
FoundArray here is the line numbers from ex.csv and found.txt (Would like to know if there's a way to do it better ;) Because I feel that it is not right) It is giving me an error - temp_exsp2[i] = item[0] # Presumably here's an error
IndexError: list assignment index out of range
Samples from ex.csv:
210ac64b3c5a570e177b26bb8d1e3e93f72081fd,gx0FMxymN,user1
039e8c304c9ada05fd9cc549ac62e178edbfaed6,eVRCBE2OG,user2
Samples from found.txt
f8fa3b3da3fc71e1eaf6c18e4afef626e1fc7fc1:t7e2jlLvs:pass1
bce61cb17c381e11afbcf89ab30ae5cc8276722f:rjCAX5D6K:pass2
Maybe there's an excel function that does that :D I don't know.
I am new at python and would like to know the best way to realize this :)
Thanks ;)
To split string that has entries separated by a specific delimeter you can use string.split(delimeter) method.
Example:
a = '123:456:abc'
a.split(':')
>>> ['123', '456', 'abc']
You could also take a look at pandas DataFrame which is able to load csv file and then lets you easily manipulate the columns and much more.

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

python newbie - where is my if/else wrong?

Complete beginner so I'm sorry if this is obvious!
I have a file which is name | +/- or IG_name | 0 in a long list like so -
S1 +
IG_1 0
S2 -
IG_S3 0
S3 +
S4 -
dnaA +
IG_dnaA 0
Everything which starts with IG_ has a corresponding name. I want to add the + or - to the IG_name. e.g. IG_S3 is + like S3 is.
The information is gene names and strand information, IG = intergenic region. Basically I want to know which strand the intergenic region is on.
What I think I want:
open file
for every line, if the line starts with IG_*
find the line with *
print("IG_" and the line it found)
else
print line
What I have:
with open(sys.argv[2]) as geneInfo:
with open(sys.argv[1]) as origin:
for line in origin:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3]
for newline in geneInfo:
if re.match(nname, newline):
print("IG_"+newline)
else:
print(line)
where origin is the mixed list and geneInfo has only the names not IG_names.
With this code I end up with a list containing only the else statements.
S1 +
S2 -
S3 +
S4 -
dnaA +
My problem is that I don't know what is wrong to search so I can (attempt) to fix it!
Below is some step-by-step annotated code that hopefully does what you want (though instead of using print I have aggregated the results into a list so you can actually make use of it). I'm not quite sure what happened with your existing code (especially how you're processing two files?)
s_dict = {}
ig_list = []
with open('genes.txt', 'r') as infile: # Simulating reading the file you pass in sys.argv
for line in infile:
if line.startswith('IG_'):
ig_list.append(line.split()[0]) # Collect all our IG values for later
else:
s_name, value = line.split() # Separate out the S value and its operator
s_dict[s_name] = value.strip() # Add to dictionary to map S to operator
# Now you can go back through your list of IG values and append the appropriate operator
pulled_together = []
for item in ig_list:
s_value = item.split('_')[1]
# The following will look for the operator mapped to the S value. If it is
# not found, it will instead give you 'not found'
corresponding_operator = s_dict.get(s_value, 'Not found')
pulled_together.append([item, corresponding_operator])
print ('List structure')
print (pulled_together)
print ('\n')
print('Printout of each item in list')
for item in pulled_together:
print(item[0] + '\t' + item[1])
nname = name[:-3]
Python's slicing through list is very powerful, but can be tricky to understand correctly.
When you write [:-3], you take everything except the last three items. The thing is, if you have less than three element in your list, it does not return you an error, but an empty list.
I think this is where things does not work, as there are not much elements per line, it returns you an empty list. If you could tell what do you exactly want it to return there, with an example or something, it would help a lot, as i don't really know what you're trying to get with your slicing.
Does this do what you want?
from __future__ import print_function
import sys
# Read and store all the gene info lines, keyed by name
gene_info = dict()
with open(sys.argv[2]) as gene_info_file:
for line in gene_info_file:
tokens = line.split()
name = tokens[0].strip()
gene_info[name] = line
# Read the other file and lookup the names
with open(sys.argv[1]) as origin_file:
for line in origin_file:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3].strip()
if nname in gene_info:
lookup_line = gene_info[nname]
print("IG_" + lookup_line)
else:
pass # what do you want to do in this case?
else:
print(line)

Modifying data in a column Python

Hi guys i have a column like this,
Start
Start = 11122001
Start = 12012014
Start = 23122001
And i want to remove the "Start =" and the date format into
Start
11/12/2001
12/01/2014
23/12/2001
How do I do this properly?
It depends on what you are trying to do.
If you want to remove Start = from each line:
lines = [ format_date(re.sub("^Start =", '', line)) for line in lines ]
(presuming you have your text line by line in a list).
To format date you need to implement the function format_date
which will convert dates from 11122001 to 11/12/2001.
There are several ways how to do this depending on the input format.
One of the solutions:
def format_date(x):
if re.match(x, '[0-9]{8}'):
return "/".join([x[:2], x[2:4], x[4:]])
else:
return x
You check first if the line match the date expression (looks like a date),
and if it does, rewrite it. Otherwise just return it as is.
Of course, you can combine the solution in one line
and don't use function at all, but in this case
it will be not so clear.
Another, map-based solution:
def format_line(x):
x = re.sub("^Start =", '', line)
if re.match(x, '[0-9]{8}'):
return "/".join([x[:2], x[2:4], x[4:]])
else:
return x
map(format_line, lines)

How do I avoid errors when parsing a .csv file in python?

I'm trying to parse a .csv file that contains two columns: Ticker (the company ticker name) and Earnings (the corresponding company's earnings). When I read the file using the following code:
f = open('earnings.csv', 'r')
earnings = f.read()
The result when I run print earnings looks like this (it's a single string):
Ticker;Earnings
AAPL;52131400000
TSLA;-911214000
AMZN;583841600
I use the following code to split the string by the break line character (\n), followed by splitting each resulting line by the semi-colon character:
earnings_list = earnings.split('\n')
string_earnings = []
for string in earnings_list:
colon_list = string.split(';')
string_earnings.append(colon_list)
The result is a list of lists where each list contains the company's ticker at index[0] and its earnigns at index[1], like such:
[['Ticker', 'Earnings\r\r'], ['AAPL', '52131400000\r\r'], ['TSLA', '-911214000\r\r'], ['AMZN', '583841600\r\r']]
Now, I want to convert the earnings at index[1] of each list -which are currently strings- intro integers. So I first remove the first list containing the column names:
headless_earnings = string_earnings[1:]
Afterwards I try to loop over the resulting list to convert the values at index[1] of each list into integers with the following:
numerical = []
for i in headless_earnings:
num = int(i[1])
numerical.append(num)
I get the following error:
num = int(i[1])
IndexError: list index out of range
How is that index out of range?
You certainly mishandle the end of lines.
If I try your code with this string: "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600" it works.
But with this one: "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600\r\r\n" it doesn't.
Explanation: split creates a last list item containing only ['']. So at the end, python tries to access [''][1], hence the error.
So a very simple workaround would be to remove the last '\n' (if you're sure it's a '\n', otherwise you might have surprises).
You could write this:
earnings_list = earnings[:-1].split('\n')
this will fix your error.
If you want to be sure you remove a last '\n', you can write:
earnings_list = earnings[:-1].split('\n') if earnings[-1] == '\n' else earnings.split('\n')
EDIT: test code:
#!/usr/bin/env python2
earnings = "Ticker;Earnings\r\r\nAAPL;52131400000\r\r\nTSLA;-911214000\r\r\nAMZN;583841600\r\r\n"
earnings_list = earnings[:-1].split('\n') if earnings[-1] == '\n' else earnings.split('\n')
string_earnings = []
for string in earnings_list:
colon_list = string.split(';')
string_earnings.append(colon_list)
headless_earnings = string_earnings[1:]
#print(headless_earnings)
numerical = []
for i in headless_earnings:
num = int(i[1])
numerical.append(num)
print numerical
Output:
nico#ometeotl:~/temp$ ./test_script2.py
[52131400000, -911214000, 583841600]

Categories