Missing values in .csv after writing by pandas dataframe error - python

I have a .csv file:
20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
I am normalizing it using pandas dataframe, but i get missing values in .csv file:
.703280701968,0.867283950617,,,,0.0971635485818,-0.132770066385,,0.318518516666,-inf,-0.742913580247,-0.74703196347,-0.779350940252,-0.659592176966,-0.483438485804,0.565758716954,,,-inf,-0.274046377081,0.705774765311,-0.281481481478,-0.596841230258,,,1
0.104027493068,-0.0493827160494,,,,0.0199155099578,-0.0175015087508,,0.318518516666,-inf,-0.401580246914,-0.392694063927,-0.331530968381,-0.401165210674,-0.337539432177,0.426956186355,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.2494467914,0.254878294116,,0.318518516666,-inf,-0.0620246913541,-0.0547945205479,0.00470906912955,0.0370370365169,-0.183753943218,0.0159880797389,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.281231140616,0.286662643331,,0.318518516666,-inf,-0.0229135802474,-0.0164383561644,0.0392144605923,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.566083283042,0.571514785757,,0.318518516666,-inf,0.201086419753,0.199086757991,0.184362139917,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
My code :
import pandas as pd
df = pd.read_csv('pooja.csv',index_col=False)
df_norm = (df.ix[:, 1:-1] - df.ix[:, 1:-1].mean()) / (df.ix[:, 1:-1].max() - df.ix[:, 1:-1].min())
rslt = pd.concat([df_norm, df.ix[:,-1]], axis=1)
rslt.to_csv('example.csv',index=False,header=False)
What's wrong in code? Why values are missing in .csv file ?

You get many NaN, because divide 0 by 0. See broadcasting behaviour. Better explanation is here.
I use code from your previous question, because I think slicing with df.ix[:, 1:-1] is not necessary. After normalize with slicing I get empty DataFrame.
import pandas as pd
import numpy as np
import io
temp=u"""20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),index_col=None, header=None)
#print df
#filter only first 5 columns for testing
df = df.iloc[:, :5]
print df
0 1 2 3 4
0 20376.650 22398.290 4.80 0 1
1 7048.842 8421.754 1.44 0 1
2 3716.890 4927.620 0.12 0 1
3 3716.890 4927.620 0.00 0 1
4 3716.890 4927.620 0.00 0 1
#get max values by columns
print df.max()
0 20376.65
1 22398.29
2 4.80
3 0.00
4 1.00
dtype: float64
#get min values by columns
print df.min()
0 3716.89
1 4927.62
2 0.00
3 0.00
4 1.00
dtype: float64
#difference, you get 0
print (df.max() - df.min())
0 16659.76
1 17470.67
2 4.80
3 0.00
4 0.00
dtype: float64
print df - df.mean()
0 1 2 3 4
0 12661.4176 13277.7092 3.528 0 0
1 -666.3904 -698.8268 0.168 0 0
2 -3998.3424 -4192.9608 -1.152 0 0
3 -3998.3424 -4192.9608 -1.272 0 0
4 -3998.3424 -4192.9608 -1.272 0 0
#you get NaN, because divide columns 3 and 4 filled 0 to difference with index 3,4 filled 0
df_norm = (df - df.mean()) / (df.max() - df.min())
print df_norm
0 1 2 3 4
0 0.76 0.76 0.735 NaN NaN
1 -0.04 -0.04 0.035 NaN NaN
2 -0.24 -0.24 -0.240 NaN NaN
3 -0.24 -0.24 -0.265 NaN NaN
4 -0.24 -0.24 -0.265 NaN NaN
Last if you generate to_csv, get from NaN "", because parameter na_rep has default value "":
print df_norm.to_csv(index=False, header=False, na_rep="")
0.76,0.76,0.735,,
-0.04,-0.04,0.035,,
-0.24,-0.24,-0.24,,
-0.24,-0.24,-0.265,,
-0.24,-0.24,-0.265,,
If you change value of na_rep:
#change na_rep to * for testing
print df_norm.to_csv(index=False, header=False, na_rep="*")
0.76,0.76,0.735,*,*
-0.04,-0.04,0.035,*,*
-0.24,-0.24,-0.24,*,*
-0.24,-0.24,-0.265,*,*
-0.24,-0.24,-0.265,*,*

Related

How can I join conditions of one line with samples of another in pandas/python?

There is a software that exports a table as the follow example:
import pandas as pd
s0 = ',,,,Cond1,,,,Cond2,,'.split(',')
s1 = 'Gene name,Description,Anova,FoldChange,Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7'.split(',')
s2 = 'HK1,Hexokinase,0.05,1.00,1.5,1.0,0.5,1.0,0,0,0'.split(',')
df= pd.DataFrame( data= (s0, s1, s2))
0 1 2 3 4 5 6 \
0 Cond1
1 Gene name Description Anova FoldChange Sample1 Sample2 Sample3
2 HK1 Hexokinase 0.05 1.00 1.5 1.0 0.5
7 8 9 10
0 Cond2
1 Sample4 Sample5 Sample6 Sample7
2 1.0 0 0 0
H
owever, the organization of this table is not straightforward and, therefore, is hard to analyze the conditions.
I would like to produce data frames, which the conditions are matched with it respectively sample.
It should be something like the output of the following code:
import pandas as pd
s1 = 'Gene name,Description,Anova,FoldChange,Sample1.Cond1,Sample2.Cond1,Sample3.Cond1,Sample4.Cond1,Sample5.Cond2,Sample6.Cond2,Sample7.Cond2'.split(',')
s2 = 'HK1,Hexokinase,0.05,1.00,1.5,1.0,0.5,1.0,0,0,0'.split(',')
df= pd.DataFrame(data= (s1, s2))
0 1 2 3 4 5 \
0 Gene name Description Anova FoldChange Sample1.Cond1 Sample2.Cond1
1 HK1 Hexokinase 0.05 1.00 1.5 1.0
6 7 8 9 10
0 Sample3.Cond1 Sample4.Cond1 Sample5.Cond2 Sample6.Cond2 Sample7.Cond2
1 0.5 1.0 0 0 0
Enter NaN Values ​​in row 0 with Series.where. Then fill them with Series.ffill.
Finally, you can use Series.str.cat to join both rows:
df.iloc[0]= df.iloc[1].str.cat( df.iloc[0]
.where(df.iloc[0].notnull() & df.iloc[0].ne(''))
.ffill(),'.' ).fillna(df.iloc[1])
df=df.drop(1).reset_index(drop=True)
print(df)
Output:
0 1 2 3 4 5 \
0 Gene name Description Anova FoldChange Sample1.Cond1 Sample2.Cond1
1 HK1 Hexokinase 0.05 1.00 1.5 1.0
6 7 8 9 10
0 Sample3.Cond1 Sample4.Cond1 Sample5.Cond2 Sample6.Cond2 Sample7.Cond2
1 0.5 1.0 0 0 0

Pandas - split columns and include counts

I have the following dataframe:
doc_id is_fulltext
1243 dok:1 1
3310 dok:1 1
4370 dok:1 1
14403 dok:1020 1
17252 dok:1020 1
15977 dok:1020 0
16480 dok:1020 1
16252 dok:1020 1
468 dok:103 1
128 dok:1030 0
1673 dok:1038 1
I would like to split the is_fulltext column into two columns and count the occurrences of the docs at the same time.
Desired Output:
doc_id fulltext non-fulltext
0 dok:1 3 0
1 dok:1020 4 1
2 dok:103 1 0
3 dok:1030 0 1
4 dok:1038 1 0
I followed the procedure of Pandas - Create columns from column value, and fill with count
That post shows several alternatives, suggesting Categorical or reindex. I tried the following:
cats = ['fulltext', 'non_fulltext']
df_sorted['is_fulltext'] = pd.Categorical(df_sorted['is_fulltext'], categories=cats)
new_df = df_sorted.groupby(['doc_id', 'is_fulltext']).size().unstack(fill_value=0)
Here I get a ValueError:
ValueError: Length of passed values is 17446, index implies 0
Then I tried this method
cats = ['fulltext', 'non_fulltext']
new_df = df_sorted.groupby(['doc_id','is_fulltext']).size().unstack(fill_value=0).reindex(columns=cats).reset_index()
While this seems to have worked fine in the original post, my counts are filled with NANs (see below). I read by now that this happens when using reindex and categorical, but I wonder why it seems to have worked in the original post. And how can I solve this? Can anyone help? Thank you!
doc_id fulltext non-fulltext
0 dok:1 NaN NaN
1 dok:1020 NaN NaN
2 dok:103 NaN NaN
3 dok:1030 NaN NaN
4 dok:1038 NaN NaN
You could GroupBy the doc_id, apply pd.value_counts to each group and unstack:
(df.groupby('doc_id').is_fulltext.apply(pd.value_counts)
.unstack()
.fillna(0)
.rename(columns={0:'non-fulltext', 1:'fulltext'})
.reset_index())
doc_id non-fulltext fulltext
0 dok:1 0.0 3.0
1 dok:1020 1.0 4.0
2 dok:103 0.0 1.0
3 dok:1030 1.0 0.0
4 dok:1038 0.0 1.0
Or similarly to your own method, if performance is an issue do instead:
df.groupby(['doc_id','is_fulltext']).size()
.unstack(fill_value=0)
.rename(columns={0:'fulltext',1:'non_fulltext'})
.reset_index()
is_fulltext doc_id fulltext non_fulltext
0 dok:1 0 3
1 dok:1020 1 4
2 dok:103 0 1
3 dok:1030 1 0
4 dok:1038 0 1
I don't know if it's the best approach, but this should work for you:
import pandas as pd
df = pd.DataFrame({"doc_id":["id1", "id2", "id1", "id2"],
"is_fulltext":[1, 0, 1, 1]})
df_grouped = df.groupby("doc_id").sum().reset_index()
df_grouped["non_fulltext"] = df.groupby("doc_id").count().reset_index()["is_fulltext"] - df_grouped["is_fulltext"]
df_grouped
And the output is:
doc_id is_fulltext non_fulltext
0 id1 2 0
1 id2 1 1

Pandas: How to maintain the type of columns with nan?

For example,I have a df with nan and use the following method to fillna.
import pandas as pd
a = [[2.0, 10, 4.2], ['b', 70, 0.03], ['x', ]]
df = pd.DataFrame(a)
print(df)
df.fillna(int(0),inplace=True)
print('fillna df\n',df)
dtype_df = df.dtypes.reset_index()
OUTPUT:
0 1 2
0 2 10.0 4.20
1 b 70.0 0.03
2 x NaN NaN
fillna df
0 1 2
0 2 10.0 4.20
1 b 70.0 0.03
2 x 0.0 0.00
col type
0 0 object
1 1 float64
2 2 float64
Actually,I want the column 1 maintain the type of int instead of float.
My desired output:
fillna df
0 1 2
0 2 10 4.20
1 b 70 0.03
2 x 0 0.00
col type
0 0 object
1 1 int64
2 2 float64
So how to do it?
Try adding downcast='infer' to downcast any eligible columns:
df.fillna(0, downcast='infer')
0 1 2
0 2 10 4.20
1 b 70 0.03
2 x 0 0.00
And the corresponding dtypes are
0 object
1 int64
2 float64
dtype: object

pandas count over multiple columns

I have a dataframe looking like this
Measure1 Measure2 Measure3 ...
0 1 3
1 3 2
3 0
I'd like to count the occurrences of the values over the columns to produce:
Measure Count Percentage
0 2 0.25
1 2 0.25
2 1 0.125
3 3 0.373
With
outcome_measure_count = cdss_data.groupby(key_columns=['Measure1'],operations={'count': agg.COUNT()}).sort('count', ascending=True)
I only get the first column (actually using graphlab package, but I'd prefer pandas)
Could someone help me?
You can generate the counts by flattening the df using ravel and value_counts, from this you can construct the final df:
In [230]:
import io
import pandas as pd
​
t="""Measure1 Measure2 Measure3
0 1 3
1 3 2
3 0 0"""
​
df = pd.read_csv(io.StringIO(t), sep='\s+')
df
Out[230]:
Measure1 Measure2 Measure3
0 0 1 3
1 1 3 2
2 3 0 0
In [240]:
count = pd.Series(df.squeeze().values.ravel()).value_counts()
pd.DataFrame({'Measure': count.index, 'Count':count.values, 'Percentage':(count/count.sum()).values})
Out[240]:
Count Measure Percentage
0 3 3 0.333333
1 3 0 0.333333
2 2 1 0.222222
3 1 2 0.111111
I inserted a 0 just to make the df shape correct but you should get the point
In [68]: df=DataFrame({'m1':[0,1,3], 'm2':[1,3,0], 'm3':[3,2, np.nan]})
In [69]: df
Out[69]:
m1 m2 m3
0 0 1 3.0
1 1 3 2.0
2 3 0 NaN
In [70]: df=df.apply(Series.value_counts).sum(1).to_frame(name='Count')
In [71]: df
Out[71]:
Count
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [72]: df.index.name='Measure'
In [73]: df
Out[73]:
Count
Measure
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [74]: df['Percentage']=df.Count.div(df.Count.sum())
In [75]: df
Out[75]:
Count Percentage
Measure
0.0 2.0 0.250
1.0 2.0 0.250
2.0 1.0 0.125
3.0 3.0 0.375

Slicing based on dates Pandas Dataframe

I have a large dataframe with dates, store number, units sold, and rain precipitation totals. It looks like this...
date store_nbr units preciptotal
2014-10-11 1 0 0.00
2014-10-12 1 0 0.01
2014-10-13 1 2 0.00
2014-10-14 1 1 2.13
2014-10-15 1 0 0.00
2014-10-16 1 0 0.87
2014-10-17 1 3 0.01
2014-10-18 1 0 0.40
I want to select a three day window around any date that has a precipitation total greater than 1. For this small example, I would want get back the first 7 rows, the 3 days before 2014-10-14, the three days after 2014-10-14, and 2014-10-14 because it has a preciptotal greater than 1.
Here are two ways you could build the selection mask without looping over the index values:
You could find the rows where preciptotal is greater than 1:
mask = (df['preciptotal'] > 1)
and then use scipy.ndimage.binary_dilation to expand the mask to a 7-day window:
import scipy.ndimage as ndimage
import pandas as pd
df = df = pd.read_table('data', sep='\s+')
mask = (df['preciptotal'] > 1)
mask = ndimage.binary_dilation(mask, iterations=3)
df.loc[mask]
yields
date store_nbr units preciptotal
0 2014-10-11 1 0 0.00
1 2014-10-12 1 0 0.01
2 2014-10-13 1 2 0.00
3 2014-10-14 1 1 2.13
4 2014-10-15 1 0 0.00
5 2014-10-16 1 0 0.87
6 2014-10-17 1 3 0.01
Or, using NumPy (but without the scipy dependency), you could use mask.shift with np.logical_and.reduce:
mask = (df['preciptotal'] > 1)
mask = ~np.logical_and.reduce([(~mask).shift(i) for i in range(-3, 4)]).astype(bool)
# array([ True, True, True, True, True, True, True, False], dtype=bool)
For a specific value you can do this:
In [84]:
idx = df[df['preciptotal'] > 1].index[0]
df.iloc[idx-3: idx+4]
Out[84]:
date store_nbr units preciptotal
0 2014-10-11 1 0 0.00
1 2014-10-12 1 0 0.01
2 2014-10-13 1 2 0.00
3 2014-10-14 1 1 2.13
4 2014-10-15 1 0 0.00
5 2014-10-16 1 0 0.87
6 2014-10-17 1 3 0.01
For the more general case you can get an array of indices where the condition is met
idx_vals = df[df['preciptotal'] > 1].index
then you can generate slices or iterate over the array values:
for idx in idx_values:
df.iloc[idx-3: idx+4]
This assumes your index is a 0 based int64 index which your sample is

Categories