I have just started a data quality class in which I got zero instruction on Python but am expected to create a script. There are three instructions for my Python script:
Create a script that loads an entire CSV file and replace all the blank values to NAN
Use genfromtxt function
Write the results set into a different file
I have been working on this for a few hours, but with no previous experience with Python, I am completely stuck! This is what I have so far:
import csv
file = open(quality.csv, 'r')
csvreader = csv.reader(file)
header = next(csvreader)
print(header)
rows = []
for row in csvreader:
rows.append(row)
print(rows)
My first problem is that when I tried using genfromtxt, it would not print out the headers or the entire csv file, it would only print out a few lines. If it matters, all of the values of the csv file are ints/floats, but the headers are strings.
See here
The next problem is I have tried several different ways to replace blank values, but I was not successful. All of the blank fields in this file are in the last column. When I print out the csv in full, this is what the line looks like (I've highlighted the empty value):
See here
Finally, I have no idea what instruction #3 means. I am completely new at this with zero Python knowledge! I think I am unsure of the Python syntax and rules - which I will look into more and learn, however I only had two days to complete this assignment and I do not know anything yet! Thank you in advance.
What you did with genfromtxt seems correct already. With big data like this, terminal only shows some data from the beginning and at the end, and those 3 dots in the middle also indicates the other records you're not seeing there!
Related
my code is only returning the first 7 elements of each row, but not the final two elements. I have tried copy-pasting the data into another file but it hasn't worked. Nothing appears to be wrong with the file itself either.
with open("tr_cities.csv") as file:
for row in csv.reader(file):
print(row)
Here is a photo of the output
Its interesting to note that it picks up the final two cells as a null cell and doesn't ignore them entirely. idk what to do
Here is the file in question. Im running Ubuntu if it helps.
Here are a few lines as text of the file:
city,lat,lng,country,iso2,admin_name,type,population,population_proper
Istanbul,41.01,28.9603,Turkey,TR,Istanbul,admin,15154000,15029231
Ankara,39.93,32.85,Turkey,TR,Ankara,capital,5503985,5503985
Izmir,38.4127,27.1384,Turkey,TR,Izmir,admin,4320519,4320519
Bursa,40.1833,29.0667,Turkey,TR,Bursa,admin,2901396,2901396
Antalya,36.9081,30.6956,Turkey,TR,Antalya,admin,2426356,2426356
Konya,37.8714,32.4847,Turkey,TR,Konya,admin,2232374,2232374
Adana,37,35.325,Turkey,TR,Adana,admin,2220125,2220125
...
This is because your file does not have data for the last two columns after the 196 record.
Your application is working properly.
See this:
And this is the output of the program:
question regarding pandas:
Say I created a dataframe and generated output under separate variables, rather than printing them, how would I go about combining them back into another dataframe correctly to either send as a CSV and then upload to a DB or directly upload to a DB?
Everything works fine code wise, I just haven't really seen or know of the best practice to do this. I know we can store things in list, dict, etc
What I did was:
#imported all modules
object = df.iloc[0,0]
#For loop magic goes here
#nested for loop
#if conditions are met, do this
result = df.iloc[i, k+1]
print(object, result)
I've also stored them into a separate DataFrame trying:
df2 = pd.DataFrame({'object': object, 'result' : result}, index=[0])
df2.to_csv('output.csv', index=False, mode='a')
The only problem with that is that it appends everything to each row, most likely do to the append and perhaps not including it in the for loop. Which is odd because the raw output is EXACTLY how I'm trying to get it into a csv or into a DB
As saying though, looking to combine both values back into a dataframe for speed. I tried concat etc, but no luck, so I was wondering what the correct format would be? Thanks
So it turns out that after more research and revising, I solved my issue
Referenced this and personal revisions, this is a basis of what I did:
Empty space in between rows after using writer in python
import csv
/* Had to wrap in a for loop that is not listed and append to file while clearing it first to remove spaces in each row*/
with open('csvexample.csv', wb+, newline='') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Additional supporting material:
Confused by python file mode "w+"
Interesting problem, I'm using python's CSVreader to read comma delimited data from a UTF-8 formatted CSV file. It appears the reader is truncating column names when it encounters a period.
For example, here is a sample of my column names.
time,b12.76org2101.xz,b12.75org2001.xz,b11.72ogg8090.xy
Here's how I'm reading this data
def parseCSV(inputData):
file_to_open = inputData
with open(file_to_open) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
headerLine = True
line = []
for row in csv_reader:
//column manipulation code here
And here's how CSVReader interprets those column names
time,76org2101,75org2001,72ogg8090
Here's the important bit, the code I shared is the first thing in the program that touches that CSV file. After the code has finished execution I can also verify that the CSV file itself is unchanged. The problem must lie with how CSVReader interprets periods but I'm not sure what the fix is
Here's another interesting find. Later in the program I use Pandas to read a list of identical names from a column in another file.
The data is formatted as follows
COLUMN_NAMES
b12.76org2101.xz,
b12.75org2001.xz,
b11.72ogg8090.xy,
Where COLUMN_NAMES is the CSV's header and the items below are rows.
You can see the code I use to read these values here.
data = pandas.read_csv(file_to_open)
Headers = data['COLUMN_NAMES'].tolist()
And this is how Pandas interprets those rows
76org2101
75org2001
72ogg8090
The Data is exactly the same, and we see exactly the same behavior! The column names with periods are truncated in exactly the same way.
So what's up? Because both Pandas and CSVReader have identical issues I'm tempted to think this is a python problem, but I'm not sure how to resolve it. Any ideas are appreciated!
EDIT: The issue was with my code, I was reading the wrong files which incidentally happened to have the same column names as my expected files, just without anything before or after the periods. What're the odds!
Using pd.__version__ '0.23.0' and python version 3.6.5, I get the expected results:
print(pd.read_csv('test.csv'))
COLUMN_NAMES
0 b12.76org2101.xz
1 b12.75org2001.xz
2 b11.72ogg8090.xy
headers = pd.read_csv('test.csv')['COLUMN_NAMES'].tolist()
print(headers)
['b12.76org2101.xz', 'b12.75org2001.xz', 'b11.72ogg8090.xy']
It also works if those values are columns:
pd.DataFrame(columns=headers).to_csv('test1.csv', index=None)
print(pd.read_csv('test1.csv'))
Empty DataFrame
Columns: [b12.76org2101.xz, b12.75org2001.xz, b11.72ogg8090.xy]
Index: []
Maybe try updating your version of python?
So I've got about 5008 rows in a CSV file, a total of 5009 with the headers. I'm creating and writing this file all within the same script. But when i read it at the end, with either pandas pd.read_csv, or python3's csv module, and print the len, it outputs 4967. I checked the file for any weird characters that may be confusing python but don't see any. All the data is delimited by commas.
I also opened it in sublime and it shows 5009 rows not 4967.
I could try other methods from pandas like merge or concat, but if python wont read the csv correct, that's no use.
This is one method i tried.
df1=pd.read_csv('out.csv',quoting=csv.QUOTE_NONE, error_bad_lines=False)
df2=pd.read_excel(xlsfile)
print (len(df1))#4967
print (len(df2))#5008
df2['Location']=df1['Location']
df2['Sublocation']=df1['Sublocation']
df2['Zone']=df1['Zone']
df2['Subnet Type']=df1['Subnet Type']
df2['Description']=df1['Description']
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df2.to_csv(newfile, index=False)
print('Done.')
target.close()
Another way I tried is
dfcsv = pd.read_csv('out.csv')
wb = xlrd.open_workbook(xlsfile)
ws = wb.sheet_by_index(0)
xlsdata = []
for rx in range(ws.nrows):
xlsdata.append(ws.row_values(rx))
print (len(dfcsv))#4967
print (len(xlsdata))#5009
df1 = pd.DataFrame(data=dfcsv)
df2 = pd.DataFrame(data=xlsdata)
df3 = pd.concat([df2,df1], axis=1)
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df3.to_csv(newfile, index=False)
print('Done.')
target.close()
But not matter what way I try the CSV file is the actual issue, python is writing it correctly but not reading it correctly.
Edit: Weirdest part is that i'm getting absolutely no encoding errors or any errors when running the code...
Edit2: Tried testing it with nrows param in first code example, works up to 4000 rows. Soon as i specify 5000 rows, it reads only 4967.
Edit3: manually saved csv file with my data instead of using the one written by the program, and it read 5008 rows. Why is python not writing the csv file correctly?
I ran into this issue also. I realized that some of my lines had open-ended quotes, which was for some reason interfering with the reader.
So for example, some rows were written as:
GO:0000026 molecular_function "alpha-1
GO:0000027 biological_process ribosomal large subunit assembly
GO:0000033 molecular_function "alpha-1
and this led to rows being read incorrectly. (Unfortunately I don't know enough about how csvreader works to tell you why. Hopefully someone can clarify the quote behavior!)
I just removed the quotes and it worked out.
Edited: This option works too, if you want to maintain the quotes:
quotechar=None
My best guess without seeing the file is that you have some lines with too many or not enough commas, maybe due to values like foo,bar.
Please try setting error_bad_lines=True. From Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html to see if it catches lines with errors in them, and my guess is that there will be 41 such lines.
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
The csv.QUOTE_NONE option seems to not quote fields and replace the current delimiter with escape_char + delimiter when writing, but you didn't paste your writing code, but on read it's unclear what this option does. https://docs.python.org/3/library/csv.html#csv.Dialect
I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.
This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)
The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.