From Excel to Python: from columns to strings

From Excel to Python: from columns to strings - python

I have an Excel file with 2 columns (it is an CSV file). The first contains dates, while the second contains numbers.
I want to make a list out of this with Python, the following way: ['date1','number1','date2','number2']. At the moment I always get: ['date1;number1','date2;number2']. I basically want every element to be treated as a string on its own.
At the moment I'm using following code:
abc = []
with open('C:...doc.csv', 'r', encoding = "utf8") as f:
reader = csv.reader(f)
for row in reader:
abc.extend([row])
I've tried everything I could come up with, e.g. nested for etc, but nothing seems to work.
Can anyone help me out please? Thank you!

Related

Python import csv file and replace blank values

I have just started a data quality class in which I got zero instruction on Python but am expected to create a script. There are three instructions for my Python script:
Create a script that loads an entire CSV file and replace all the blank values to NAN
Use genfromtxt function
Write the results set into a different file
I have been working on this for a few hours, but with no previous experience with Python, I am completely stuck! This is what I have so far:
import csv
file = open(quality.csv, 'r')
csvreader = csv.reader(file)
header = next(csvreader)
print(header)
rows = []
for row in csvreader:
rows.append(row)
print(rows)
My first problem is that when I tried using genfromtxt, it would not print out the headers or the entire csv file, it would only print out a few lines. If it matters, all of the values of the csv file are ints/floats, but the headers are strings.
See here
The next problem is I have tried several different ways to replace blank values, but I was not successful. All of the blank fields in this file are in the last column. When I print out the csv in full, this is what the line looks like (I've highlighted the empty value):
See here
Finally, I have no idea what instruction #3 means. I am completely new at this with zero Python knowledge! I think I am unsure of the Python syntax and rules - which I will look into more and learn, however I only had two days to complete this assignment and I do not know anything yet! Thank you in advance.

What you did with genfromtxt seems correct already. With big data like this, terminal only shows some data from the beginning and at the end, and those 3 dots in the middle also indicates the other records you're not seeing there!

Python's CSVReader seems to be seperating on periods

Interesting problem, I'm using python's CSVreader to read comma delimited data from a UTF-8 formatted CSV file. It appears the reader is truncating column names when it encounters a period.
For example, here is a sample of my column names.
time,b12.76org2101.xz,b12.75org2001.xz,b11.72ogg8090.xy
Here's how I'm reading this data
def parseCSV(inputData):
file_to_open = inputData
with open(file_to_open) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
headerLine = True
line = []
for row in csv_reader:
//column manipulation code here
And here's how CSVReader interprets those column names
time,76org2101,75org2001,72ogg8090
Here's the important bit, the code I shared is the first thing in the program that touches that CSV file. After the code has finished execution I can also verify that the CSV file itself is unchanged. The problem must lie with how CSVReader interprets periods but I'm not sure what the fix is
Here's another interesting find. Later in the program I use Pandas to read a list of identical names from a column in another file.
The data is formatted as follows
COLUMN_NAMES
b12.76org2101.xz,
b12.75org2001.xz,
b11.72ogg8090.xy,
Where COLUMN_NAMES is the CSV's header and the items below are rows.
You can see the code I use to read these values here.
data = pandas.read_csv(file_to_open)
Headers = data['COLUMN_NAMES'].tolist()
And this is how Pandas interprets those rows
76org2101
75org2001
72ogg8090
The Data is exactly the same, and we see exactly the same behavior! The column names with periods are truncated in exactly the same way.
So what's up? Because both Pandas and CSVReader have identical issues I'm tempted to think this is a python problem, but I'm not sure how to resolve it. Any ideas are appreciated!
EDIT: The issue was with my code, I was reading the wrong files which incidentally happened to have the same column names as my expected files, just without anything before or after the periods. What're the odds!

Using pd.__version__ '0.23.0' and python version 3.6.5, I get the expected results:
print(pd.read_csv('test.csv'))
COLUMN_NAMES
0 b12.76org2101.xz
1 b12.75org2001.xz
2 b11.72ogg8090.xy
headers = pd.read_csv('test.csv')['COLUMN_NAMES'].tolist()
print(headers)
['b12.76org2101.xz', 'b12.75org2001.xz', 'b11.72ogg8090.xy']
It also works if those values are columns:
pd.DataFrame(columns=headers).to_csv('test1.csv', index=None)
print(pd.read_csv('test1.csv'))
Empty DataFrame
Columns: [b12.76org2101.xz, b12.75org2001.xz, b11.72ogg8090.xy]
Index: []
Maybe try updating your version of python?

Cross referencing two csv files in python

so as i'm out of ideas I've turned to geniuses on this site.
What I want to be able to do is to have two separate csv files. One of which has a bunch of store names on it, and the other to have black listed stores.
I'd like to be able to run a python script that reads the 'black listed' sheet, then checks if those specific names are within the other sheet, and if they are, then delete those off the main sheet.
I've tried for about two days straight and cannot for the life of me get it to work. So i'm coming to you guys to help me out.
Thanks so much in advance.
p.s If you can comment the hell out out of the script so I know what's going on it would be greatly appreciated.
EDIT: I deleted the code I originally had but hopefully this will give you an idea of what I was trying to do. (I also realise it's completely incorrect)
import csv
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in reader:
if line in readern:
with open('Destinations.csv', 'w'):
del(line)

The first thing you need to be aware of is that you can't update the file you are reading. Textfiles (which include .csv files) don't work like that. So you have to read the whole of Destinations.csv into memory, and then write it out again, under a new name, but skipping the rows you don't want. (You can overwrite your input file, but you will very quickly discover that is a bad idea.)
import csv
blacklist_rows = []
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
for line in reader:
blacklist_rows.append(line)
destination_rows = []
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in readern:
destination_rows.append(line)
Now at this point you need to loop through destination_rows and drop any that match something in blacklist_rows, and write out the rest. I can't suggest what the matching test should look like, because you haven't shown us your input data, so I don't actually know that blacklist_rows and destination_rows contain.
with open('FilteredDestinations.csv', 'w') as output:
writer = csv.writer(output)
for r in destination_rows:
if not r: # trap for blank rows in the input
continue
if r *matches something in blacklist_rows*: # you have to code this
continue
writer.writerow(r)

You could try Pandas
import pandas as pd
df1 = pd.read_csv("Destinations.csv")
df2 = pd.read_csv("Black List.csv")
blacklist = df2["column_name_in_blacklist_file"].tolist()
df3 = df2[~df2['destination_column_name'].isin(blacklist)]
df3.to_csv("results.csv")
print(df3)

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.

This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)

The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.

Iteratively copy specific rows from CSV file to new file

I have a large tab-delimited csv file with the following format:
#mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment gene_alignment mirna_start mirna_end gene_start gene_end genome_coordinates conservation align_score seed_cat energy mirsvr_score
What I would like to be able to do is iterate through rows and select items based on data (strings) in the "gene_id" field, then copy those rows to a new file.
I am a python noob, and thought it would be a good way to get my feet wet, but it is harder than it looks! I have been trying to use the csv package to manipulate the files, reading and writing basic stuff using dictreader and dictwriter. If anyone can help me out coming up with a template for the iterative searching aspect, I would be greatly indebted. So far I have:
import csv
f = open("C:\Documents and Settings\Administrator\Desktop\miRNA Scripting\mirna_predictions_short.txt", "r")
reader = csv.DictReader(f, delimiter='\t')
writer = open("output.txt",'wb')
writer = csv.writer(writer, delimiter='\t')
Then the iterative bit, bleurgh:
for row in reader:
if reader.gene_id == str(CG11710):
writer.writerow
This obviously doesnt work. Any ideas on better ways to structure this??

You're almost there! The code is nearly correct :)
Accessing dicts goes like this:
some_dict['some_key']
Instead of:
some_object.some_attribute
Creating a string isn't done with str(...) but with quotes like CG11710
In your case:
for row in reader:
if row['gene_id'] == 'CG11710':
writer.writerow(row)

Dictionaries in python are addressed like dictionary['key']. So for you it'd be reader['gene_id']. Also strings are declared in quotes "text", not like str(text). str(text) will try to cast whatever is stored in the variable text to a string, which is not what I think you want. Also writer.writerow is a function, and functions take arguments, so you need to do writer.writerow(row).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.