I have a CSV file with the following format:
"SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode"
Basically what I'm looking to do in Python 2.x is read the file and if within the filename column, any files exist with a specified file extension from a list, the data from the MD5 hash column is parsed out into a text document.
So my pseudo code is looking like this:
list = [.doc,.xls,.ppt]
with open(new.csv) as new_f:
with open(x.csv) as old_f:
x = f.readlines()
if list in x:
# *copy out the value from the MD5 value column to new.csv*
I just don't know how to extract the MD5 hash.
Any suggestions?
Create one list for the MD5-Hash and one for the filename, if the list is in the item of the filename list, save the index and use it for your MD5 list (since you've got a table the index has to be the same)
Solution identified:-
import csv
results = []
filetypes = ['jpg','bmp','jpeg','mov','mp4','avi','wmv','wav','tif','gif','png']
reader = csv.reader(open('c:\users\me\Desktop\x.csv'))
for extension in filetypes:
for line in reader: # iterate over the lines in the csv
if extension in line[3]:
print line[1] + "\t" + line[3]
Related
so my problem is - I have proteomes in FASTA format, which look like this:
Name of the example file:
GCA_003547095.1_protein.faa
Contents:
>CAG77607.1
ABCDEF
>CAG72141.1
CSSDAS
And I also have files that contain just names of the proteins, i.e.:
Filename:
PF00001
Contents:
CAG77607.1
CAG72141.1
My task is to iterate through proteomes using list of proteins to find out how many proteins are in each proteome. PE told me that it should be a dictionary made from filenames of proteomes as keys and sequence names after ">" as values.
My approach was as follows:
import pandas as pd
file_names = open("proteomes_list").readlines()
d = {x: pd.read_csv("/proteomes/" + "GCA_003547095.1_protein.faa").columns.tolist() for x in file_names}
print (d)
As You can see I've made proteome filenames into list (using simple bash "ls", these are ONLY names of proteomes) and then creating dictionary with sequence names as values - unfortunetly each proteome (including the tested proteome) has only one value.
I will be grateful if You could shed some light on my case.
My goal was to make dictionary where key would be i.e. GCA_003547095.1_protein.faa and value i.e. CAG77607.1, CAG72141.1.
Is this the output you expect. This function should iterate over your file and grab the fasta file header or the name of the proteins that are expected in the file. Here is a quick function that can create a list of the fasta header.
You can create the dictionary you mentioned buy iterating over the file names and update the parent dictionary
import os
def extract_proteomes(folder: str, filename: str) -> list[str]:
with open(os.path.join(folder, filename), mode='r') as file:
content: str = file.read().split('\n')
protein_names = [i[1:] for i in content if i.startswith('>')]
if not protein_names:
protein_names = [i for i in content if i]
return protein_names
folder = "/Users/user/Downloads/"
files = ["GCA_003547095.1_protein.faa", "PF00001"]
d = {}
for i in files:
d.update({i: extract_proteomes(folder=folder, filename=i)})
I'm trying to change several hex values in a text file. I made a CSV that has the original values in one column and the new values in another.
My goal is to write a simple Python script to find old values in the text file based on the first column and replace them with new values in the second.
I'm attempting to use a dictionary to facilitate this replace() that I created by looping through the CSV. Building it was pretty easy, but using it to executing a replace() hasn't been working out. When I print out the values after my script runs I'm still seeing the original ones.
I've tried reading in the text file using read() and executing the change to the whole file like above.
import csv
filename = "origin.txt"
csv_file = 'replacements.csv'
conversion_dict = {}
# Create conversion dictionary
with open(csv_file, "r") as replace:
reader = csv.reader(replace, delimiter=',')
for rows in reader:
conversion_dict.update({rows[0]:rows[1]})
#Replace values on text files based on conversion dict
with open(filename, "r") as fileobject:
txt = str(fileobject.read())
for keys, values, in conversion_dict.items():
new_text = txt.replace(keys, values)
I've also tried adding the updated text to a list:
#Replace values on text files based on conversion dict
with open(filename, "r") as fileobject:
txt = str(fileobject.read())
for keys, values, in conversion_dict.items():
new_text.append(txt.replace(keys, values))
Then, I tried using readlines() to replace the old values with new ones one line at a time:
# Replace values on text files based on conversion dict
with open(filename, "r") as reader:
reader.readlines()
type(reader)
for line in reader:
print(line)
for keys, values, in conversion_dict.items():
new_text.append(txt.replace(keys, values))
While troubleshooting, I ran a test to see if I was getting any matches between the keys in my dict and the text in the file:
for keys, values, in conversion_dict.items():
if keys in txt:
print("match")
else:
print("no match")
My output returned match on every row except the first one. I imagine with some trimming or something I could fix that. However, this proves that there are matches, so there must be some other issue with my code.
Any help is appreciated.
origin.txt:
oldVal9000,oldVal1,oldVal2,oldVal3,oldVal69
test.csv:
oldVal1,newVal1
oldVal2,newVal2
oldVal3,newVal3
oldVal4,newVal4
import csv
filename = "origin.txt"
csv_file = 'test.csv'
conversion_dict = {}
with open(csv_file, "r") as replace:
reader = csv.reader(replace, delimiter=',')
for rows in reader:
conversion_dict.update({rows[0]:rows[1]})
f = open(filename,'r')
txt = str(f.read())
f.close()
txt= txt.split(',') #not sure what your origin.txt actually looks like, assuming comma seperated values
for i in range(len(txt)):
if txt[i] in conversion_dict:
txt[i] = conversion_dict[txt[i]]
with open(filename, "w") as outfile:
outfile.write(",".join(txt))
modified origin.txt:
oldVal9000,newVal4,newVal1,newVal3,oldVal69
I have a folder with several csv file and also compressed file in gz format type. Each of these unzipped gz file also contain one csv file. I want to extract all of them and create a dataframe for each one with same name as the csv file name (without the extension).
For example, if have the following files:
train.csv
test.csv
validation.csv.gz
I want to have 3 dataframes objects whose names are exactly : train, test and validation.
I've tried this code :
import pandas as pd
import gzip
extension = ".gz"
for item in os.listdir():
if item.endswith(extension):
with gzip.open(item) as f:
item.split('.', 1)[0] = pd.read_csv(f) #Split on the first occurence of '.' and give this name to my dataframe
else:
item.split('.', 1)[0] = pd.read_csv(item)
This code doesn't work because when I try to access my environment variables, python couldn't find them.
Any help, please !!
Strings are immutable. If you want to dynamically assign an object to a given string, just make use of exec.
This statement supports dynamic execution of Python code. The first
expression should evaluate to either a string, an open file object, or
a code object.
import pandas as pd
import gzip
extension = ".gz"
for item in os.listdir():
if item.endswith(extension):
with gzip.open(item) as f:
exec(item.split('.', 1)[0] + "=" + "pd.read_csv(f)" ) #Split on the first occurence of '.' and give this name to my dataframe
else:
exec(item.split('.', 1)[0] + "=" + "pd.read_csv('" + item + "')")
Use a dictionary for a variable number of variables.
While it's possible to name variables via strings, it is strongly discouraged. A dictionary is performant and allows you to maintain a collection of objects in a structured way.
d = {}
for item in os.listdir():
fn, ext = item.split('.')
if ext == 'gz':
with gzip.open(item) as f:
d[fn] = pd.read_csv(f)
else:
d[fn] = pd.read_csv(item)
Then access via d['train'], d['test'], etc.
Your code does not work because item.split('.', 1)[0] is a scalar, not a variable name to which you can assign an object.
I am very new with python. I have a .txt file and want to convert it to a .csv file with the format I was told but could not manage to accomplish. a hand can be useful for it. I am going to explain it with screenshots.
I have a txt file with the name of bip.txt. and the data inside of it is like this
I want to convert it to csv like this csv file
So far, what I could do is only writing all the data from text file with this code:
read_files = glob.glob("C:/Users/Emrehana1/Desktop/bip.txt")
with open("C:/Users/Emrehana1/Desktop/Test_Result_Report.csv", "w") as outfile:
for f in read_files:
with open(f, "r") as infile:
outfile.write(infile.read())
So is there a solution to convert it to a csv file in the format I desire? I hope I have explained it clearly.
There's no need to use the glob module if you only have one file and you already know its name. You can just open it. It would have been helpful to quote your data as text, since as an image someone wanting to help you can't just copy and paste your input data.
For each entry in the input file you will have to read multiple lines to collect together the information you need to create an entry in the output file.
One way is to loop over the lines of input until you find one that begins with "test:", then get the next line in the file using next() to create the entry:
The following code will produce the split you need - creating the csv file can be done with the standard library module, and is left as an exercise. I used a different file name, as you can see.
with open("/tmp/blip.txt") as f:
for line in f:
if line.startswith("test:"):
test_name = line.strip().split(None, 1)[1]
result = next(f)
if not result.startswith("outcome:"):
raise ValueError("Test name not followed by outcome for test "+test_name)
outcome = result.strip().split(None, 1)[1]
print test_name, outcome
You do not use the glob function to open a file, it searches for file names matching a pattern. you could open up the file bip.txt then read each line and put the value into an array then when all of the values have been found join them with a new line and a comma and write to a csv file, like this:
# set the csv column headers
values = [["test", "outcome"]]
current_row = []
with open("bip.txt", "r") as f:
for line in f:
# when a blank line is found, append the row
if line == "\n" and current_row != []:
values.append(current_row)
current_row = []
if ":" in line:
# get the value after the semicolon
value = line[line.index(":")+1:].strip()
current_row.append(value)
# append the final row to the list
values.append(current_row)
# join the columns with a comma and the rows with a new line
csv_result = ""
for row in values:
csv_result += ",".join(row) + "\n"
# output the csv data to a file
with open("Test_Result_Report.csv", "w") as f:
f.write(csv_result)
Not sure where to start with this... I know how to read in a csv file but if I have a heap of files in the same directory, how can read them in according to whether they are in a list. For example, a list such as...
l= [['file1.csv','title1','1'], ['file2.csv','title2','1'],['file3.csv','title3','1']]
How can I get just those 3 files even though I up to 'file20.csv' in the directory.
Can I somehow loop through the list and use an if-statement to check the filenames and open the file if found?
for filedesc in l: #go over each sublist in l
fname, ftitle, _ = filedesc #unpack the information contained in it
with open(fname) as f: #open the file with the appropriate name
reader = csv.reader(f) #create reader of that file
#go about bussiness
An updated post because I've gotten so close with this....
lfiles= []
csvfiles=[]
for row in l:
lfiles= row[0] #This reads in just the filesnames from list 'l'
with open(lfiles) as x:
inread = csv.reader(x)
for i in x:
print i
That's print everything in the files that were read in, but now I want to append 'csvfiles' (an empty list) with a row if a particular column equals something.
Probably like this...????
for i in x:
for line in i:
if line= 'ThisThingAppears5Times':
csvfiles.append(line) # and now the 5 lines are in a 2dlist
Of course that doesn't work but close??