Reading in text file into pandas dataframe failing - python

I have the foll. input file:
1988 1 1 7.88-15.57-25.00 0.00 0.81 4.02
1988 1 2 6.50-10.37-24.87 0.00 0.49 4.30
1988 1 3 6.48 -8.79-21.28 0.00 0.62 3.91
and I read it as follows:
df = pandas.read_csv(inp_file, header=None, sep=' ')
However, because of no spaces present between some columns, they are not getting read correctly. Is there a way I can specify individual column widths?

OK, read_fwf works I thought your 3rd line was malformed but it looks pukka:
In [9]:
t="""1988 1 1 7.88-15.57-25.00 0.00 0.81 4.02
1988 1 2 6.50-10.37-24.87 0.00 0.49 4.30
1988 1 3 6.48 -8.79-21.28 0.00 0.62 3.91"""
pd.read_fwf(io.StringIO(t),header=None)
Out[9]:
0 1 2 3 4 5 6
0 1988 1 1 7.88-15.57-25.00 0 0.81 4.02
1 1988 1 2 6.50-10.37-24.87 0 0.49 4.30
2 1988 1 3 6.48 -8.79-21.28 0 0.62 3.91

Related

Delete rows from a pandas DataFrame based on a conditional expression, iteration (n) and (n+1)

Given a dataframe as follows:
col1
1 0.6
2 0.88
3 1.2
4 1.2
5 1.2
6 0.55
7 0.55
8 0.65
I want to delete rows from it where the value in row (n+1) is the same value in (n), such that this would yield:
col1
1 0.6
2 0.88
3 1.2
4 row deleted
5 row deleted
6 0.55
7 row deleted
8 0.65
In [191]: df[df["col1"] != df["col1"].shift()]
Out[191]:
col1
1 0.60
2 0.88
3 1.20
6 0.55
8 0.65
Ok let do
df=df[df.diff().ne(0)]
Try this:
df = df[~df['col1'].eq(df['col1'].shift(1))]
print(df)
col1
0 0.60
1 0.88
2 1.20
5 0.55
7 0.65
Or:
df = df[df['col1'].ne(df['col1'].shift(1))]
print(df)
col1
0 0.60
1 0.88
2 1.20
5 0.55
7 0.65

Merging two pandas data frame with common columns

I have a lower triangular matrix and then I transpose it and I have the transpose of it.
I am trying to merge them together
lower triangular:
Data :
0 1 2 3
0 1 0 0 0
1 0.21 0 0 0
2 0.31 0.32 0 0
3 0.41 0.42 0.43 0
4 0.51 0.52 0.53 0.54
transpose triangular:
Data :
0 1 2 3
0 1 0.21 0.31 0.41
1 0 0 0.32 0.52
2 0 0 0 0.53
3 0 0 0 0.54
4 0 0 0 0
Merged matrix:
Data :
0 1 2 3 4
0 1 0.21 0.31 0.41 0.51
1 0.21 0 0.32 0.42 0.52
2 0.31 0.32 0 0.43 0.53
3 0.41 0.42 0.43 0 0.54
4 0.51 0.52 0.53 0.54 0
I tried using pd.merge but I couldn't get it to work
Let us using combine_first after mask
df.mask(df==0).T.combine_first(df).fillna(0)
Out[1202]:
0 1 2 3 4
0 1.00 0.21 0.31 0.41 0.51
1 0.21 0.00 0.32 0.42 0.52
2 0.31 0.32 0.00 0.43 0.53
3 0.41 0.42 0.43 0.00 0.54
4 0.51 0.52 0.53 0.54 0.00
How about just adding the two dataframes?
df3 = df1.add(df2, fill_value=0)
BR

Rearranging columns after groupby in pandas

I created a DataFrame like this:
df_example= pd.DataFrame({ 'A': [1,1,6,6,6,3,4,4],
'val_A': [3,4,1,1,2,1,1,1],
'val_B': [4,5,2,2,3,2,2,2],
'val_A_frac':[0.25,0.25,0.3,0.7,0.2,0.1,0.4,0.5],
'val_B_frac':[0.75,0.65,0,0.3,np.NaN,np.NaN,np.NaN,np.NaN]
}, columns= ['A','val_A','val_B','val_A_frac','val_B_frac'])
Then I ran a groupby operation on A to sum over val_A and val_B:
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'})
I got this df:
sum_df_ex
Out[67]:
val_A_frac val_B_frac
A val_A val_B
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
Groupby operations resulted in two columns:
sum_df_ex.columns
Out[68]: Index(['val_A_frac', 'val_B_frac'], dtype='object')
I want to create a df after groupby operation consisting of all columns that is displayed after groupby i.e like this:
Out[67]:
A val_A val_B val_A_frac val_B_frac
1 3 4 0.25 0.75
4 5 0.25 0.65
3 1 2 0.10 0.00
4 1 2 0.90 0.00
6 1 2 1.00 0.30
2 3 0.20 0.00
How to do this?
use reset_index()
sum_df_ex = df_example.groupby(['A','val_A','val_B']).agg({'val_A_frac':'sum', 'val_B_frac':'sum'}).reset_index()
Output:
A val_A val_B val_B_frac val_A_frac
0 1 3 4 0.75 0.25
1 1 4 5 0.65 0.25
2 3 1 2 NaN 0.10
3 4 1 2 NaN 0.90
4 6 1 2 0.30 1.00
5 6 2 3 NaN 0.20

Data split over 2 rows for each row entry - read in with pandas

I'm dealing with a dataset where each 'entry' is split over many rows which are different sizes,
i.e.
yyyymmdd hhmmss lat lon name nprt depth ubas udir cabs cdir
hs tp lp theta sp wf
20140701 000000 -76.500 208.000 'grid_point' 1 332.2 2.8 201.9 0.00 0.0
0 0.10 1.48 3.40 183.19 30.16 0.89
1 0.10 1.48 3.40 183.21 29.66 0.90
20140701 000000 -74.500 251.000 'grid_point' 1 1.0 8.4 159.7 0.00 0.0
0 0.63 4.24 28.02 105.05 32.71 0.85
1 0.60 4.21 27.68 110.42 27.04 0.95
2 0.20 5.78 52.18 43.73 17.98 0.01
3 0.06 6.55 66.86 176.86 11.04 0.10
20140701 000000 -74.500 258.000 'grid_point' 0 1.0 7.7 137.0 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00
I'm only interested in the rows that begin with a date so the rest can be discarded. However, the number of additional rows varies throughout the data set (see code snippet for an example).
Ideally, I'd like to use pandas read_csv but I'm open to suggestions if that's not possible/there are easier ways.
So my question is how do you read data into a dataframe where the row begins with a date?
Thanks
You can use read_csv first, then try cast first and second column to_datetime with parameter errors='coerce' - it add NaT where are not dates. So last need filter rows with boolean indexing and mask created by notnull:
import pandas as pd
from pandas.compat import StringIO
temp=u"""yyyymmdd hhmmss lat lon name nprt depth ubas udir cabs cdir
hs tp lp theta sp wf
20140701 000000 -76.500 208.000 'grid_point' 1 332.2 2.8 201.9 0.00 0.0
0 0.10 1.48 3.40 183.19 30.16 0.89
1 0.10 1.48 3.40 183.21 29.66 0.90
20140701 000000 -74.500 251.000 'grid_point' 1 1.0 8.4 159.7 0.00 0.0
0 0.63 4.24 28.02 105.05 32.71 0.85
1 0.60 4.21 27.68 110.42 27.04 0.95
2 0.20 5.78 52.18 43.73 17.98 0.01
3 0.06 6.55 66.86 176.86 11.04 0.10
20140701 000000 -74.500 258.000 'grid_point' 0 1.0 7.7 137.0 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), delim_whitespace=True)
print (pd.to_datetime(df.iloc[:,0] + df.iloc[:,1], errors='coerce', format='%Y%m%d%H%M%S'))
0 NaT
1 2014-07-01
2 NaT
3 NaT
4 2014-07-01
5 NaT
6 NaT
7 NaT
8 NaT
9 2014-07-01
10 NaT
dtype: datetime64[ns]
mask = pd.to_datetime(df.iloc[:,0] +
df.iloc[:,1], errors='coerce', format='%Y%m%d%H%M%S')
.notnull()
print (mask)
print (mask)
0 False
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
dtype: bool
print (df[mask])
yyyymmdd hhmmss lat lon name nprt depth ubas udir \
1 20140701 000000 -76.500 208.000 'grid_point' 1 332.2 2.8 201.9
4 20140701 000000 -74.500 251.000 'grid_point' 1 1.0 8.4 159.7
9 20140701 000000 -74.500 258.000 'grid_point' 0 1.0 7.7 137.0
cabs cdir
1 0.0 0.0
4 0.0 0.0

pandas why does int64 - float64 column subtraction yield NaN's

I am confused by the results of pandas subtraction of two columns. When I subtract two float64 and int64 columns it yields several NaN entries. Why is this happening? What could be the cause of this strange behavior?
Final Updae: As N.Wouda pointed out, my problem was that the index columns did not match.
Y_predd.reset_index(drop=True,inplace=True)
Y_train_2.reset_index(drop=True,inplace=True)
solved my problem
Update 2: It seems like my index columns don't match, which makes sense because they are both sampled from the same data frome. How can I "start fresh" with new index coluns?
Update: Y_predd- Y_train_2.astype('float64') also yields NaN values. I am confused why this did not raise an error. They are the same size. Why could this be yielding NaN?
In [48]: Y_predd.size
Out[48]: 182527
In [49]: Y_train_2.astype('float64').size
Out[49]: 182527
Original documentation of error:
In [38]: Y_train_2
Out[38]:
66419 0
2319 0
114195 0
217532 0
131687 0
144024 0
94055 0
143479 0
143124 0
49910 0
109278 0
215905 1
127311 0
150365 0
117866 0
28702 0
168111 0
64625 0
207180 0
14555 0
179268 0
22021 1
120169 0
218769 0
259754 0
188296 1
63503 1
175104 0
218261 0
35453 0
..
112048 0
97294 0
68569 0
60333 0
184119 1
57632 0
153729 1
155353 0
114979 1
180634 0
42842 0
99979 0
243728 0
203679 0
244381 0
55646 0
35557 0
148977 0
164008 0
53227 1
219863 0
4625 0
155759 0
232463 0
167807 0
123638 0
230463 1
198219 0
128459 1
53911 0
Name: objective_for_classifier, dtype: int64
In [39]: Y_predd
Out[39]:
0 0.00
1 0.48
2 0.04
3 0.00
4 0.48
5 0.58
6 0.00
7 0.00
8 0.02
9 0.06
10 0.22
11 0.32
12 0.12
13 0.26
14 0.18
15 0.18
16 0.28
17 0.30
18 0.52
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 0.64
26 0.30
27 0.76
28 0.10
29 0.42
...
182497 0.60
182498 0.00
182499 0.06
182500 0.12
182501 0.00
182502 0.40
182503 0.70
182504 0.42
182505 0.54
182506 0.24
182507 0.56
182508 0.34
182509 0.10
182510 0.18
182511 0.06
182512 0.12
182513 0.00
182514 0.22
182515 0.08
182516 0.22
182517 0.00
182518 0.42
182519 0.02
182520 0.50
182521 0.00
182522 0.08
182523 0.16
182524 0.00
182525 0.32
182526 0.06
Name: prediction_method_used, dtype: float64
In [40]: Y_predd - Y_tr
Y_train_1 Y_train_2
In [40]: Y_predd - Y_train_2
Out[41]:
0 NaN
1 NaN
2 0.04
3 NaN
4 0.48
5 NaN
6 0.00
7 0.00
8 NaN
9 NaN
10 NaN
11 0.32
12 -0.88
13 -0.74
14 0.18
15 NaN
16 NaN
17 NaN
18 NaN
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 NaN
26 0.30
27 NaN
28 0.10
29 0.42
...
260705 NaN
260706 NaN
260709 NaN
260710 NaN
260711 NaN
260713 NaN
260715 NaN
260716 NaN
260718 NaN
260721 NaN
260722 NaN
260723 NaN
260724 NaN
260725 NaN
260726 NaN
260727 NaN
260731 NaN
260735 NaN
260737 NaN
260738 NaN
260739 NaN
260740 NaN
260742 NaN
260743 NaN
260745 NaN
260748 NaN
260749 NaN
260750 NaN
260751 NaN
260752 NaN
dtype: float64
Posting here so we can close the question, from the comments:
Are you sure each dataframe has the same index range?
You can reset the indices on both frames by df.reset_index(drop=True) and then subtract the frames as you were already doing. This process should result in the desired output.

Categories