Python Pandas converting strings to NaN - python

So I am using pandas to read in excel files and csv files. These files contain both strings and numbers not just numbers. Problem is all my strings get converted into NaN which I do not want at all. I do not know what the types of the columns will be ahead of time (it is actually my job to handle the system that figures this out) so I can't tell pandas what they will be (that must come later). I just want to read in each cell as a string for now.
here is my code
if csv: #check weather to read in excell file or csv
frame = pandas.read_csv(io.StringIO(data))
else:
frame = pandas.read_excel(io.StringIO(data))
tbl = []
print frame.dtypes
for (i, col) in enumerate(frame):
tmp = [col]
for (j, value) in enumerate(frame[col]):
tmp.append(unicode(value))
tbl.append(tmp)
I just need to be able to produce a column wise 2D list and I can do everything from there. I also need to be able to handle Unicode (data is already in Unicode).
How do I construct 'tbl' so that cells that should be strings do not come out as 'NaN'?

In general cases where you can't know the dtypes or column names of a CSV ahead of time, using a CSV sniffer can be helpful.
import csv
[...]
dialect = csv.Sniffer().sniff(f.read(1024))
f.seek(0)
frame = pandas.read_csv(io.StringIO(data), dialect=dialect)

Related

How do I pull specific data from one file and add it to another file in a specific spot?

I am learning how to use python.
For the project I am working on, I have hundreds of datasheets containing a City, Species, and Time (speciesname.csv).
I also have a single datasheet that has all cities in the world with their latitude and longitude point (cities.csv).
My goal is to have 2 more columns for latitude and longitude (from cities.csv) in every (speciesname.csv) datasheet, corresponding to the location of each species.
I am guessing my workflow will look something like this:
Go into speciesname.csv file and find the location on each line
Go into cities.csv and search for the location from speciesname.csv
Copy the corresponding latitude and longitude into new columns in speciesname.csv.
I have been unsuccessful in my search for a blog post or someone else with a similar question. I don't know where to start so anyone with a starting point would be very helpful.
Thank you.
You can achieve it in many ways.
The simplest way I can think of to approach this problem is:
collect all cities.csv data inside a dictionary {"cityname":(lat,lon), ...}
read line by line your speciesname.csv and for each line search by key (key == speciesname_cityname) in the dictionary.
when you find a correspondence add all data from the line and the lat & lon separated by comma to a buffer string that has to end with a "\n" char
when the foreach line is ended your buffer string will contains all the data and can be used as input to the write to file function
Here is a little program that should work if you put it in the same folder as your separate CSVs. I'm assuming you just have 2 sheets, one that is cities and another with the species. Your description saying the cities info is in hundreds of datasheets is confusing since then you say it's all in one csv.
This program turns the two separate CSV files into pandas dataframe format which can then be joined on the common city column. Then it creates a new CSV from the joined data frame.
In order for this program to work, you need to need to install pandas which is a library specifically for dealing with things in tabular (spreadsheet) format. I don't know what system you are on so you'll have to find your own instructions from here:
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
This is the version if your csv do not have a header, which is when the first row is just some data.
# necessary for the functions like pd.read_csv
import pandas as pd
species_column_names = ['city','species','time']
speciesname = pd.read_csv('speciesname.csv', names=species_column_names, header=None)
cities_column_names = ['city','lat','long']
cities = pd.read_csv('cities.csv', names=cities_column_names, header=None)
# this joining function relies on both tables having a 'city' column
combined = speciesname.join(cities.set_index('city'), on='city')
combined_csv = combined.to_csv()
If you already have headers for both files, use these two lines instead to ignore the first row since I don't know how they are spelled/capitalized/whatever and we are joining based on all lower case custom column names:
import pandas as pd
species_column_names = ['city','species','time']
speciesname = pd.read_csv('speciesname.csv', names=species_column_names, skiprows = 0, header=None)
cities_column_names = ['city','lat','long']
cities = pd.read_csv('cities.csv', names=cities_column_names, skiprows = 0, header=None)
# this joining function relies on both tables having a 'city' column
combined = speciesname.join(cities.set_index('city'), on='city')
combined_csv = combined.to_csv()

When reading excel files with pandas, what determines the datatype of the cells being read?

I am reading an excel sheet and plucking data from rows containing the given PO.
import pandas as pd
xlsx = pd.ExcelFile('Book2.xlsx')
df = pd.read_excel(xlsx)
PO_arr = ['121121','212121']
for i in PO_arr:
PO = i
PO_DATA = df.loc[df['PONUM'] == PO]
for i in range(1, max(PO_DATA['POLINENUM'].values) +1):
When I take this Excel sheet straight from its source, my code works fine. But when I cut out only the rows I want and paste them to a new spreadsheet with the exact same formatting and read this new spreadsheet, I have to change PO_DATA to look for an integer instead of a string as such:
PO_DATA = df.loc[df['PONUM'] == int(PO)]
If not, I get an error, and calling PO_DATA returns an empty dataframe.
C:\...\pandas\core\ops\array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
I checked the cell formatting in Excel and in both cases, they are formatted as 'General' cells.
What is going on that makes it so when I chop up my spreadsheet, I have to look for an integer and not a string? What do I have to do to make it work for sheets I've created and pasted relevant data into instead of only sheets from the source?
Excel can do some funky formatting when copy and paste is used: ctl-c : ctl-v.
I am sure you tried these but...
A) Try copy ctl-c then ctl-alt-v:"v":enter ... on new sheet/file
B) Try using the format painter in Excel : Looks like a paintbrush on the home tab - select the properly formatted cells first - double click format painter - move to your new file/sheet - select cells you want the format to conform to.
C) Select your new file/table you pasted into - select purple eraser icon from the top options in excel - clear all formats
Update: I found an old related thread that didn't necessarily answer the question but solved the problem.
you can force pandas to import values as a certain datatype when reading from excel using the converters argument for read_excel.
df = pd.read_excel(xlsx, converters={'POLINENUM':int,'PONUM':int})

Creating Arrays from cvs files in python

So I have a data file, which i must extract specific data from. Using;
x=15 #need a way for code to assess how many lines to skip from given data
maxcol=2000 #need a way to find final row in data
data=numpy.genfromtxt('data.dat.csv',skip_header=x,delimiter=',')
column_one=data[0;max,0]
column_two=data[0:max,1]
this gives me an array for the specific case where there are (x=)15 lines of metadata above the required data and where the number of rows of data is (maxcol=)2000. In what way do I go about changing the code to satisfy any value for x and maxcol?
Use pandas. Its read_csv function does all that you want (I don't include its equivalent of delimiter, sep=',', because comma-delimited is the default):
import pandas as pd
data = pd.read_csv('data.dat.csv', skiprows=x, nrows=maxcol)
If you really want that as a numpy array, you can do this:
data = data.values
But you can probably just leave it as a pandas DataFrame.

Extract designated data from one csv file then assign to another csv file using python

I got a csv file containing data in this form,
I want to extract data from column C and write them into a new csv file, like this,
So I need to do 2 things:
write 'node' and number from 1 to 22 into the first row and column (since in this case, there are 22 in one repeated cycle in the column A in input csv)
I have got data in column c extracted and write in output csv, like this,
I need to transpose those data every 22 rows one time and fill them in row starts from B2 position in excel, then B3, B4,...etc.
It's clear that I must loop through every row to do this efficiently, but I don't know how to apply the csv module in python.
Should I download the xlrd package, or can I handle this only use the built-in csv module?
I am working with python 2.7.6 and pyscripter under Windows 8.1 x64. Feel free to give me any suggestion, thanks a lot!
Read the csv python documentation.
The simple way to iterate through rows with csv reader:
import csv
X = []
spamreader = csv.reader('path_to_file/filename.csv',delimiter=',')
for row in spamreader:
X.append(row)
This creates a variable with all the csv data. The structure of your file will make it difficult to read because the cell_separator is ',' but there are also multiple commas within each cell and because of the parentheses there will be a mixture of string and numerical data that will require some cleaning. If you have access to reformatting the csv it might be easier if each cell looked like 1,2,0.01 instead of (1,2,0.01), also consider using a different delimiter between cells such as ';'.
If not expect some tedious data cleaning, and definitely read through the documentation linked above.
Edit: Try the following
import csv
X = []
with open('path_to_file/filename.csv','rb') as csvfile:
spamreader = csv.reader(csvfile,delimiter=',')
for row in spamreader:
rowTemp = []
for i in range(len(row)):
if (i+1)%3==0: #gets every third cell
rowTemp.append(row[i])
X.append(rowTemp)
This is a matrix of all the distance values. Then try:
with open('path_to_output_file/output_file.csv','wb') as csvfile:
spamwriter = csv.writer(csvfile,delimter=',')
for sublist in X:
spamwriter.writerow(sublist)
Not sure if this is exactly what you're looking for but it should be close. It ouputs a csv file that is stripped of all the node pairs

Exporting a list to a CSV/space separated and each sublist in its own column

I'm sure there is an easy way to do this, so here goes. I'm trying to export my lists into CSV in columns. (Basically, it's how another program will be able to use the data I've generated.) I have the group called [frames] which contains [frame001], [frame002], [frame003], etc. I would like the CSV file that's generated to have all the values for [frame001] in the first column, [frame002] in the second column, and so on. I thought if I could save the file as CSV I could manipulate it in Excel, however, I figure there is a solution that I can program to skip that step.
This is the code that I have tried using so far:
import csv
data = [frames]
out = csv.writer(open(filename,"w"), delimiter=',',quoting=csv.QUOTE_ALL)
out.writerow(data)
I have also tried:
import csv
myfile = open(..., 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(mylist)
If there's a way to do this so that all the values are space separated, that would be ideal, but at this point I've been trying this for hours and can't get my head around the right solution.
What you're describing is that you want to translate a 2 dimensional array of data. In Python you can achieve this easily with the zip function as long as the inner lists are all the same length.
out.writerows(zip(*data))
If they are not all the same length, you can use itertools.izip_longest to fill the remaining fields with some default value (even '').

Categories