I wrote a script to transform a large 4MB textfile with 40k+ lines of unordered data to a specifically formatted and easier to deal with CSV file.
Problem:
Analyzing my file sizes, it appears i've lost over 1MB of data (20K Lines | edit: original file was 7MB so lost ~4MB of data), and when I attempt to search specific data points present in CommaOnly.txt in sorted_CSV.csv I cannot find them.
I found this really weird so.
What I tried:
I searched for and replaced all unicode chars present in the CommaOnly.txt that might be causing a problem.. No luck!
Example: \u0b99 replaced with " "
Here's an example of some data loss
A line from: CommaOnly.txt
name,SJ Photography,category,Professional Services,
state,none,city,none,country,none,about,
Capturing intimate & milestone moment from pregnancy and family portraits to weddings
Sorted_CSV.csv
Not present.
What could be causing this?
Code:
import re
import csv
import time
# Final Sorted Order for all data:
#['name', 'data',
# 'category','data',
# 'about', 'data',
# 'country', 'data',
# 'state', 'data',
# 'city', 'data']
## Recieves String Item, Splits on "," Delimitter Returns the split List
def split_values(string):
string = string.strip('\n')
split_string = re.split(',', string)
return split_string
## Iterates through the list, reorganizes terms in the desired order at the desired indices
## Adds the field if it does not initially
def reformo_sort(list_to_sort):
processed_values=[""]*12
for i in range(11):
try:
## Terrible code I know, but trying to be explicit for the question
if(i==0):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="name"):
processed_values[0]=(list_to_sort[j])
processed_values[1]=(list_to_sort[j+1])
## append its neighbour
## if after iterating, name does not appear, add it.
if(processed_values[0] != "name"):
processed_values[0]="name"
processed_values[1]="None"
elif(i==2):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="category"):
processed_values[2]=(list_to_sort[j])
processed_values[3]=(list_to_sort[j+1])
if(processed_values[2] != "category"):
processed_values[2]="category"
processed_values[3]="None"
elif(i==4):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="about"):
processed_values[4]=(list_to_sort[j])
processed_values[5]=(list_to_sort[j+1])
if(processed_values[4] != "about"):
processed_values[4]="about"
processed_values[5]="None"
elif(i==6):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="country"):
processed_values[6]=(list_to_sort[j])
processed_values[7]=(list_to_sort[j+1])
if(processed_values[6]!= "country"):
processed_values[6]="country"
processed_values[7]="None"
elif(i==8):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="state"):
processed_values[8]=(list_to_sort[j])
processed_values[9]=(list_to_sort[j+1])
if(processed_values[8] != "state"):
processed_values[8]="state"
processed_values[9]="None"
elif(i==10):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="city"):
processed_values[10]=(list_to_sort[j])
processed_values[11]=(list_to_sort[j+1])
if(processed_values[10] != "city"):
processed_values[10]="city"
processed_values[11]="None"
except:
print("failed to append!")
return processed_values
# Converts desired data fields to a string delimitting values by ','
def to_CSV(values_to_convert):
CSV_ENTRY=str(values_to_convert[1])+','+str(values_to_convert[3])+','+str(values_to_convert[5])+','+str(values_to_convert[7])+','+str(values_to_convert[9])+','+str(values_to_convert[11])
return CSV_ENTRY
with open("CommaOnly.txt", 'r') as c:
print("Starting.. :)")
for line in c:
entry = c.readline()
to_sort = split_values(entry)
now_sorted = reformo_sort(to_sort)
CSV_ROW=to_CSV(now_sorted)
with open("sorted_CSV.csv", "a+") as file:
file.write(str(CSV_ROW)+"\n")
print("Finished! :)")
time.sleep(60)
I've rewritten the main loop that seems fishy to me, using csv package.
Your reformo_sort routine is incomplet and syntaxically incorrect, with empty elif blocks and missing processing, so I got incomplete lines, but that should work much better than your code. Note the usage of csv, the "binary" flag, the single open in write mode instead of open/close each line (much faster) and the 1-out-of-2 filtering of the now_sorted array.
with open("CommaOnly.txt", 'rb') as c:
print("Starting.. :)")
cr = csv.reader(c,delimiter=",",quotechar='"')
with open("sorted_CSV.csv", "wb") as fw:
cw = csv.writer(fw,delimiter=",",quotechar='"')
for to_sort in cr:
now_sorted = reformo_sort(to_sort)
cw.writerow(now_sorted[1::2])
Related
How can I read a csv file without using any external import (e.g. csv or pandas) and turn it into a list of lists? Here's the code I worked out so far:
m = []
for line in myfile:
m.append(line.split(','))
Using this for loop works pretty fine, but if in the csv I get a ',' is in one of the fields it breaks wrongly the line there.
So, for example, if one of the lines I have in the csv is:
12,"This is a single entry, even if there's a coma",0.23
The relative element of the list is the following:
['12', '"This is a single entry', 'even if there is a coma"','0.23\n']
While I would like to obtain:
['12', '"This is a single entry, even if there is a coma"','0.23']
I would avoid trying to use a regular expression, but you would need to process the text a character at a time to determine where the quote characters are. Also normally the quote characters are not included as part of a field.
A quick example approach would be the following:
def split_row(row, quote_char='"', delim=','):
in_quote = False
fields = []
field = []
for c in row:
if c == quote_char:
in_quote = not in_quote
elif c == delim:
if in_quote:
field.append(c)
else:
fields.append(''.join(field))
field = []
else:
field.append(c)
if field:
fields.append(''.join(field))
return fields
fields = split_row('''12,"This is a single entry, even if there's a coma",0.23''')
print(len(fields), fields)
Which would display:
3 ['12', "This is a single entry, even if there's a coma", '0.23']
The CSV library though does a far better job of this. This script does not handle any special cases above your test string.
Here is my go at it:
line ='12, "This is a single entry, more bits in here ,even if there is a coma",0.23 , 12, "This is a single entry, even if there is a coma", 0.23\n'
line_split = line.replace('\n', '').split(',')
quote_loc = [idx for idx, l in enumerate(line_split) if '"' in l]
quote_loc.reverse()
assert len(quote_loc) % 2 == 0, "value was odd, should be even"
for m, n in zip(quote_loc[::2], quote_loc[1::2]):
line_split[n] = ','.join(line_split[n:m+1])
del line_split[n+1:m+1]
print(line_split)
I have an excel file "ex.csv" with columns - Hash, Salt, Name And I have a txt file "found.txt" where are decrypted hashes. Their format is Hash: Salt: Plain_Password I would like to change Hash from "ex.csv" with Plain_Password from "found.txt". Would like to know, how I could do that :) I have written a test program that would output into separate txt file the Hash: Salt but it is not working.
Python code -
# File Reads
a = open("ex.csv")
b = open("found.txt")
# Reading contents
ex = a.read()
found = b.read()
# Splitting files by newline
ex_s = ex.split("\n")
found_s = found.split("\n")
# Splitting them into subarrays by splitting them by ','
temp_exsp2 = []
temp_foundsp2 = []
i=0
for item in ex_s:
temp_exsp2[i] = item[0] # Presumeably here's an error
i+=1
i=0
for item in found_s:
temp_foundsp2[i] = item[0] # Same thing here
i+=1
i=0
z=0 #Used for incrementing found array
FoundArray0 = [] #For line from ex
FoundArray1 = [] #For line from found
while i!=len(ex_s): # Main comparison loop
for item in temp_foundsp2: # Inner loop for looping through all found file
j=0
if item in temp_exsp2[i]:
FoundArray0[z] = i
FoundArray1[z] = j
z+=1
j+=1
i+=1 # Go to the next line in the ex.csv
output = open("output.txt","w")
for out in FoundArray0:
for out2 in FoundArray1:
output.write(str(ex_s[FoundArray0]) + ":" + str(temp_foundsp2[FoundArray1]))
FoundArray here is the line numbers from ex.csv and found.txt (Would like to know if there's a way to do it better ;) Because I feel that it is not right) It is giving me an error - temp_exsp2[i] = item[0] # Presumably here's an error
IndexError: list assignment index out of range
Samples from ex.csv:
210ac64b3c5a570e177b26bb8d1e3e93f72081fd,gx0FMxymN,user1
039e8c304c9ada05fd9cc549ac62e178edbfaed6,eVRCBE2OG,user2
Samples from found.txt
f8fa3b3da3fc71e1eaf6c18e4afef626e1fc7fc1:t7e2jlLvs:pass1
bce61cb17c381e11afbcf89ab30ae5cc8276722f:rjCAX5D6K:pass2
Maybe there's an excel function that does that :D I don't know.
I am new at python and would like to know the best way to realize this :)
Thanks ;)
To split string that has entries separated by a specific delimeter you can use string.split(delimeter) method.
Example:
a = '123:456:abc'
a.split(':')
>>> ['123', '456', 'abc']
You could also take a look at pandas DataFrame which is able to load csv file and then lets you easily manipulate the columns and much more.
I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?
On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).
Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.
Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)
A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...
For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.
I have genomic data from 16 nuclei. The first column represents the nucleus, the next two columns represent the scaffold (section of genome) and the position on the scaffold respectively, and the last two columns represent the nucleotide and coverage respectively. There can be equal scaffolds and positions in different nuclei.
Using input for start and end positions (scaffold and position of each), I'm supposed to output a csv file which shows the data (nucleotide and coverage) of each nucleus within the range from start to end. I was thinking of doing this by having 16 columns (one for each nucleus), and then showing the data from top to bottom. The leftmost region would be a reference genome in that range, which I accessed by creating a dictionary for each of its scaffolds.
In my code, I have a defaultdict of lists, so the key is a string which combines the scaffold and the location, while the data is an array of lists, so that for each nucleus, the data can be appended to the same location, and in the end each location has data from every nucleus.
Of course, this is very slow. How should I be doing it instead?
Code:
#let's plan this
#input is start and finish - when you hit first, add it and keep going until you hit next or larger
#dictionary of arrays
#loop through everything, output data for each nucleus
import csv
from collections import defaultdict
inrange = 0
start = 'scaffold_41,51335'
end = 'scaffold_41|51457'
locations = defaultdict(list)
count = 0
genome = defaultdict(lambda : defaultdict(dict))
scaffold = ''
for line in open('Allpaths_SL1_corrected.fasta','r'):
if line[0]=='>':
scaffold = line[1:].rstrip()
else:
genome[scaffold] = line.rstrip()
print('Genome dictionary done.')
with open('automated.csv','rt') as read:
for line in csv.reader(read,delimiter=','):
if line[1] + ',' + line[2] == start:
inrange = 1
if inrange == 1:
locations[line[1] + ',' + line[2]].append([line[3],line[4]])
if line[1] + ',' + line[2] == end:
inrange = 0
count += 1
if count%1000000 == 0:
print('Checkpoint '+str(count)+'!')
with open('region.csv','w') as fp:
wr = csv.writer(fp,delimiter=',',lineterminator='\n')
for key in locations:
nuclei = []
for i in range(0,16):
try:
nuclei.append(locations[key][i])
except IndexError:
nuclei.append(['',''])
wr.writerow([genome[key[0:key.index(',')][int(key[key.index(',')+1:])-1],key,nuclei])
print('Done!')
Files:
https://drive.google.com/file/d/0Bz7WGValdVR-bTdOcmdfRXpUYUE/view?usp=sharing
https://drive.google.com/file/d/0Bz7WGValdVR-aFdVVUtTbnI2WHM/view?usp=sharing
(Only focusing on the CSV section in the middle of your code)
The example csv file you supplied is over 2GB and 77,822,354 lines. Of those lines, you seem to only be focused on 26,804,253 lines or about 1/3.
As a general suggestion, you can speed thing up by:
Avoid processing the data you are not interested in (2/3 of the file);
Speed up identifying the data you are interested in;
Avoid the things that repeated millions of times that tend to be slower (processing each line as csv, reassembling a string, etc);
Avoid reading all data when you can break it up into blocks or lines (memory will get tight)
Use faster tools like numpy, pandas and pypy
You data is block oriented, so you can use a FlipFlop type object to sense if you are in a block or not.
The first column of your csv is numeric, so rather than splitting the line apart and reassembling two columns, you can use the faster Python in operator to find the start and end of the blocks:
start = ',scaffold_41,51335,'
end = ',scaffold_41,51457,'
class FlipFlop:
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
rtr=True if self.state else False
if self.patterns[self.state] in st:
self.state = not self.state
return self.state or rtr
lines_in_block=0
with open('automated.csv') as f:
ff=FlipFlop(start, end)
for lc, line in enumerate(f):
if ff(line):
lines_in_block+=1
print lines_in_block, lc
Prints:
26804256 77822354
That runs in about 9 seconds in PyPy and 46 seconds in Python 2.7.
You can then take the portion that reads the source csv file and turn that into a generator so you only need to deal with one block of data at a time.
(Certainly not correct, since I spent no time trying to understand your files overall..):
def csv_bloc(fn, start_pat, end_pat):
from itertools import ifilter
with open(fn) as csv_f:
ff=FlipFlop(start_pat, end_pat)
for block in ifilter(ff, csv_f):
yield block
Or, if you need to combine all the blocks into one dict:
def csv_line(fn, start, end):
with open(fn) as csv_in:
ff=FlipFlop(start, end)
for line in csv_in:
if ff(line):
yield line.rstrip().split(",")
di={}
for row in csv_line('/tmp/automated.csv', start, end):
di.setdefault((row[2],row[3]), []).append([row[3],row[4]])
That executes in about 1 minute on my (oldish) Mac in PyPy and about 3 minutes in cPython 2.7.
Best
I am trying to put data from a text file into an array. below is the array i am trying to create.
[("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)]
But instead when i use the text file and load the data from it I get below as my output. it should output as above, i realise i have to split it but i dont really know how for this sort of set array. could anyone help me with this
['("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)']
below is my text file I am trying to load.
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
And this is how im loading it
file = open("slide.txt", "r")
scale = [file.readline()]
If you mean a list instead of an array:
with open(filename) as f:
list_name = f.readlines()
Some questions come to mind about what the rest of your implementation looks like and how you figure it all will work, but below is an example of how this could be done in a pretty straight forward way:
class W(object):
pass
class S(object):
pass
class WS(W, S):
pass
class R(object):
pass
def main():
# separate parts that should become tuples eventually
text = str()
with open("data", "r") as fh:
text = fh.read()
parts = text.split("),")
# remove unwanted characters and whitespace
cleaned = list()
for part in parts:
part = part.replace('(', '')
part = part.replace(')', '')
cleaned.append(part.strip())
# convert text parts into tuples with actual data types
list_of_tuples = list()
for part in cleaned:
t = construct_tuple(part)
list_of_tuples.append(t)
# now use the data for something
print list_of_tuples
def construct_tuple(data):
t = tuple()
content = data.split(',')
for item in content:
t = t + (get_type(item),)
return t
# there needs to be some way to decide what type/object should be used:
def get_type(id):
type_mapping = {
'"harmonic minor"': 'harmonic minor',
'"major"': 'major',
'"relative minor"': 'relative minor',
's': S(),
'w': W(),
'w+s': WS(),
'r': R()
}
return type_mapping.get(id)
if __name__ == "__main__":
main()
This code makes some assumptions:
there is a file data with the content:
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
you want a list of tuples which contains the values.
It's acceptable to have w+s represented by some data type, as it would be difficult to have something like w+s appear inside a tuple without it being evaluated when the tuple is created. Another way to do it would be to have w and s represented by data types that can be used with +.
So even if this works, it might be a good idea to think about the format of the text file (if you have control of that), and see if it can be changed into something which would allow you to use some parsing library in a simple way, e.g. see how it could be more easily represented as csv or even turn it into json.