Pandas.read_csv not reading full header

Pandas.read_csv not reading full header - python

I have a csv file that has positions and velocities of particles saved like this:
x, y, z, vx, vy, vz
-0.960, 0.870, -0.490, 962.17, -566.10, 713.40
1.450, 0.777, 2.270, -786.27, 63.31, -441.00
-3.350, -1.640, 1.313, 879.20, 637.76, -556.24
-0.504, 2.970, -0.278, 613.22, -717.32, 557.02
0.338, 0.220, 0.090, -927.18, -778.77, -443.05
...
I'm trying to read this file and save it as a Pandas dataframe in a script with read_csv. But I would get errors when calling any column except the first one
AttributeError: 'DataFrame' object has no attribute 'y'
I would never get the error for the 'x' column, so I wrote a snippet to see if I could figure out where the reading error was stemming from.
import pandas as pd
data = pd.read_csv('snap.csv')
print data
print data.x
print data.y
The console correctly prints out
x y z vx vy vz
0 -0.960 0.870 -0.490 962.17 -566.10 713.40
1 1.450 0.777 2.270 -786.27 63.31 -441.00
2 -3.350 -1.640 1.313 879.20 637.76 -556.24
3 -0.504 2.970 -0.278 613.22 -717.32 557.02
4 0.338 0.220 0.090 -927.18 -778.77 -443.05
...
meaning it is assigning the columns the correct names. Then
0 -0.960
1 1.450
2 -3.350
3 -0.504
4 0.338
...
showing it can take one of the columns out correctly. But then it throws the error again when trying to print the second column
AttributeError: 'DataFrame' object has no attribute 'y'
I then looped through data.itertuples() to print the first row individually in order to see what that looked like, and it confirmed that the names were only being assigned to the first column and none of the others.
Pandas(Index=0, x=-0.96, _2=0.87, _3=-0.49, _4=962.17, _5=-566.1, _6=713.4)
There aren't any other problems with the data. The values all correspond to the right index. It's just that the names are not being assigned correctly and only the first column can be called by name. I tried putting single quotes around each column name, and that shows the exact same errors.
I know there are ways I might be able to work around this such as assigning the names in the read_csv function, but I'm curious as to what the issue could actually be so as to avoid having this happen again.

Try declaring column names when you create the data frame.
df = pd.DataFrame(pd.read_csv(“file.csv”), columns=[“x”, “y”, “z”, “vx”, “vy”, “vz”])

df = pd.read_csv("snap.csv",names =["x", "y", "z", "vx", "vy", "vz"])

Related

Avoiding type <class 'method'> in a Pandas Dataframe

I have a code that I am trying to run and it should be fairly simple, it is just math. But whenever I try to do the math with the Pandas Dataframes I'm using, I'm getting an error. I know that in the dataframe, it is the column labeled 'first' which is giving me some issues. I have gone through and checked all the others. I have also tried a few things to convert it to a column with floating point values, so I can do the math with it. But I am getting errors every time. I am attaching the code along with comments of what I have tried and what errors I have been getting.
Any help would be greatly appreciated! I am very stuck on this. Thank you!
# Set constants
pi = 3.14159265359
e = 2.71828
h = 6.62607004*(10**-34) # J*s
c = 299792458 # m / s
kb = 1.380649*(10**-23) # J/K
temp3 = 3000 # K
temp10 = 10000 # K
constant = (2*pi*h*(c**2))
bb_df = pd.DataFrame({ 'wl_nm' : range(200, 1101 ,1)}) # Gets wavelength ranges I will want plotted
#bb_df.wl
bb_df['wl_m'] = (bb_df.wl_nm * (10**-9)) # Gets wavelength in meters (this one does work doing math with)
bb_df['first'] = constant/((bb_df.wl_m)**5) # This one does not work doing math with; says it's a method, not number, and cannot figure out how to change it
#bb_df['first'] = bb_df['first'].astype(float) # Tried this, but get error: TypeError: Cannot broadcast np.ndarray with operand of type <class 'method'>
#float(bb_df['first']) # Tried this, but get error: TypeError: cannot convert the series to <class 'float'>
bb_df['exponent'] = (h*c)/((bb_df.wl_m)*kb*temp3)
bb_df['denominator'] = e ** (bb_df.exponent) - 1
bb_df['second'] = 1 / bb_df.denominator
bb_df['second'] = bb_df.second + 1
bb_df['final'] = (bb_df.first) * (bb_df.second) # ERROR (because of bb_df.first)
#bb_df['test'] = float(bb_df.first) - float(bb_df.second)
#bb_df['intensity'] = (((2*pi*h*(c**2))/((bb_df.wl_m**5))(1/(e**((h*c)/((bb_df.wl_m)*kb*temp3))-1)))) # Also just tried typing out entire equation here, but this also gives an error
print(bb_df)
When I comment out all the lines that are not working, this is the dataframe I get. It is the 'first' column that says it is a method, and I have been trouble converting it to a floating point value to do math with. I thought perhaps it was because the number is so small, but then I should also not have had the issue when I just tried to do the entire equation all at once (in the 'intensity' column attempt, which also did not work):
wl_nm wl_m first exponent denominator second
0 200 2.000000e-07 1.169304e+18 23.979614 2.595417e+10 1.000000
1 201 2.010000e-07 1.140505e+18 23.860313 2.303537e+10 1.000000
2 202 2.020000e-07 1.112552e+18 23.742192 2.046898e+10 1.000000
3 203 2.030000e-07 1.085418e+18 23.625236 1.820969e+10 1.000000
4 204 2.040000e-07 1.059074e+18 23.509426 1.621836e+10 1.000000
.. ... ... ... ... ... ...
896 1096 1.096000e-06 2.366053e+14 4.375842 7.850652e+01 1.012738
897 1097 1.097000e-06 2.355289e+14 4.371853 7.819001e+01 1.012789
898 1098 1.098000e-06 2.344583e+14 4.367871 7.787533e+01 1.012841
899 1099 1.099000e-06 2.333935e+14 4.363897 7.756247e+01 1.012893
900 1100 1.100000e-06 2.323346e+14 4.359930 7.725142e+01 1.012945
[901 rows x 6 columns]

From the documentation:
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.
You cannot use bb_df.first to mean the same thing as bb_df['first'], because there is already a .first method of the DataFrame class. That's why the error message tells you that bb_df.first is a method - because it is, one that is pre-defined by Pandas. The first column of your DataFrame contains floating-point values the entire time, and no attempt to convert the already-floating-point values into floating-point is relevant because the problem is not with the column. The problem is that the code bb_df.first does not access the column.
Just use indexing consistently (bb_df['first']). The other way is a not-really-convenience that results in this problem sometimes.

Instead of typing bb_df.columnname, change it to bb_df['columnname']. It fixes the issue!

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256

Try assign the index back
r.index = Stocks_Open.index

Parsing Data, Excel to Python

So I have a excel(.csv) data file that looks like:
Frequency Frequency error
0.00575678 17
0.315 2
0.003536329 13
0.00481 1
0.004040379 4
where the second column is the error in the first data column e.g. the value of the first entry is 0.00575678 +/- 0.0000000017 and the second is 0.315 +/- 0.002. So using python is there a way to parse the data using Python so that I can get two data arrays, the 1st being frequency and the 2nd the frequency error. Where the first entry in the 2nd array is in the format of 0.0000000017. If this was a small data file I'd do it manually but it has a few thousand entries so its not really an option. Thanks

Maybe not the fastest, but looks close.
sample = """\
0.00575678,17
0.315,2
0.003536329,13
0.00481,1
0.004040379,4"""
for line in sample.splitlines():
value,errordigits = line.split(',')
error = ''.join(c if c in '0.' else '0' for c in value)[:-1]
error += errordigits
print "%s,%s" % (value,error)
prints:
0.00575678,0.000000017
0.315,0.002
0.003536329,0.0000000013
0.00481,0.00001
0.004040379,0.000000004

i found pandas useful to get data from a csv.
f = pandas.read_csv("YOURFILE.csv");
dfw = pandas.DataFrame(data = df, columns=['COLUMNNAME1','COLUMNNAME2'])
y = df.COLUMNNAME1.values
x = df.COLUMNNAME2.values

Compute values from sequential pandas rows

I'm a python novice trying to preprocess timeseries data so that I can compute some changes as an object moves over a series of nodes and edges so that I can count stops, aggregate them into routes, and understand behavior over the route. Data originally comes in the form of two CSV files (entrance, Typedoc = 0 and clearance, Typedoc = 1, each about 85k rows / 19MB) that I merged into 1 file and performed some dimensionality reduction. I've managed to get it into a multi-index dataframe. Here's a snippet:
In [1]: movements.head()
Out[1]:
Typedoc Port NRT GRT Draft
Vessname ECDate
400 L 2012-01-19 0 2394 2328 7762 4.166667
2012-07-22 1 2394 2328 7762 17.000000
2012-10-29 0 2395 2328 7762 6.000000
A 397 2012-05-27 1 3315 2928 2928 18.833333
2012-06-01 0 3315 2928 2928 5.250000
I'm interested in understanding the changes for each level as it traverses through its timeseries. I'm going to represent this as a graph eventually. I think I'd really like this data in dictionary form where each entry for a unique Vessname is essentially a tokenized string of stops along the route:
stops_dict = {'400 L':[
['2012-01-19', 0, 2394, 4.166667],
['2012-07-22', 1, 2394, 17.000000],
['2012-10-29', 0, 2395, 6.000000]
]
}
Where the nested list values are:
[ECDate, Typedoc, Port, Draft]
If i = 0, then the values I'm interested in are the Dwell and Transit times and the Draft Change, calculated as:
t_dwell = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
d_draft = stops_dict['400 L'][i+1][3] - stops_dict['400 L'][i][3]
i += 1
and
t_transit = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
assuming all of the dtypes are correct (a big if, since I have not mastered getting pandas to want to parse my dates). I'm then going to extract the links as some form of:
link = str(stops_dict['400 L'][i][2])+'->'+str(stops_dict['400 L'][i+1][2]),t_transit,d_draft
The t_transit and d_draft values as edge weights. The nodes are list of unique Port values that get assigned the '400 L':[t_dwell,NRT,GRT] k,v pairs (somehow). I haven't figured that out exactly, but I don't think I need help with that process.
I couldn't figure out a simpler way, so I've tried defining a function that required starting over by writing my sorted dataframe out and reading it back in using:
with open(filename,'sb) as csvfile:
datareader = csv.reader(csvfile, delimiter=",")
next(datareader, None)
<FLOW CONTROL> #based on Typedoc and ECDate values
The function adds to an empty dictionary:
stops_dict = {}
def createStopsDict(row):
#this reads each row in a csv file,
#creates a dict entry from row[0]: Vessname if not in dict
#or appends things after row[0] to the dict entry if Vessname in dict
ves = row[0]
if ves in stops_dict:
stops_dict[ves].append(row[1:])
else:
stops_dict[ves]=[row[1:]]
return
This is an inefficient way of doing things...
I could possibly be using iterrows instead of a csv reader...
I've looked into melt and unstack and I don't think those are correct...
This seems essentially like a groupby effort, but I haven't managed to implement that correctly because of the multi-index...
Is there a simpler, dare I say 'elegant', way to map the dataframe rows based on the multi index value directly into a reusable data structure (right now the dictionary stop_dict).
I'm not tied to the dictionary or its structure, so if there's a better way I am open to suggestions.
Thanks!
UPDATE 2:
I think I have this mostly figured out...
Beginning with my original data frame movements:
movements.reset_index().apply(
lambda x: makeRoute(x.Vessname,
[x.ECDate,
x.Typedoc,
x.Port,
x.NRT,
x.GRT,
x.Draft]),
axis=1
)
where:
routemap = {}
def makeRoute(Vessname, info):
if Vessname in routemap:
route = routemap[Vessname]
route.append(info)
else:
routemap[Vessname] = [info]
return
returns a dictionary keyed to Vessname in the structure I need to compute things by calling list elements.

How to force pandas.io.parsers to set column-specific types

today I am struggling with an interesting warnings:
parsers.py:1139: DtypeWarning: Columns (1,4) have mixed types. Specify dtype option on import or set low_memory=False.
Let's start from the beginning, I have several files with thousands of lines each, the content of each file looks like this:
##ID ChrA StartA EndA ChrB StartB EndB CnvType Orientation GeneA StrandA LastExonA TotalExonsA PhaseA GeneB StrandB LastExonB TotalExonsB PhaseB InFrame InPhase
nsv871164 1 8373207 8373207 1 8436802 8436802 DELETION HT ? ? ? ? ? RERE - 14 24 0 Not in Frame
dgv1n68 1 16765770 16765770 1 16936692 16936692 DELETION HT ? ? ? ? ? NBPF1 - 2 29 -1 Not in Frame
nsv9213 1 16777016 16777016 1 16779533 16779533 DELETION HT NECAP2 + 6 8 0 NECAP2 + 6 8 1 In Frame Not in Phase
.....
nsv510572 Y 16898737 16898737 Y 16904738 16904738 DELETION HT NLGN4Y + 4 6 1 NLGN4Y + 3 6 1 In Frame In Phase
nsv10042 Y 59192042 59192042 Y 59196197 59196197 DELETION HT ? ? ? ? ? ? ? ? ? ? ?
column[1] and column[4] refers to "Human Chromosomes" and are supposed to be 1 to 22 then X and Y.
Some files are short (2k lines) some are very long (200k lines).
If I make a pandas.Dataframe out of a short file, then no problem, the parser correctly recognizes the items in columns[1] and [4] as 'string'.
But if the file is long enough, the parser assigns 'int' until a certain point and then 'string' as soon it encounters 'X' or 'Y'.
At this point I got the warnings.
I think that is happening because the parser loads in memory a limited number of rows, then checks the best type to assign considering all the values of a column and then it goes on parsing the rest of the file.
Now, if all the rows can be parsed at once, then there are no mistakes, the parser recognizes all the values at once [1,2,3,4...,'X','Y'] and assign the best type (in this case 'str').
If the number of rows is too big, then the file is parsed in pieces and in my case the first piece contains only [1,2,3,4] and the parser assigns 'int'.
This, of course, is messing up my pipeline..
How can I force the parser to assign ONLY to column[1] and [4] the type 'str'?
This is the code I use to make Dataframes out of my files:
dataset = pandas.io.parsers.read_table(my_file, sep='\t', index_col=0)

You can set the dtypes of the columns as a param to read_csv so if you know the columns then just pass a dict with column names as the key and dtype as the value, for example:
dataset = pandas.io.parsers.read_table(my_file, sep='\t', index_col=0, dtype={'ChrA':'str'})
Just keep adding additional column names to the dict.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.