Missing values in .csv after writing by pandas dataframe error - python
I have a .csv file:
20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
I am normalizing it using pandas dataframe, but i get missing values in .csv file:
.703280701968,0.867283950617,,,,0.0971635485818,-0.132770066385,,0.318518516666,-inf,-0.742913580247,-0.74703196347,-0.779350940252,-0.659592176966,-0.483438485804,0.565758716954,,,-inf,-0.274046377081,0.705774765311,-0.281481481478,-0.596841230258,,,1
0.104027493068,-0.0493827160494,,,,0.0199155099578,-0.0175015087508,,0.318518516666,-inf,-0.401580246914,-0.392694063927,-0.331530968381,-0.401165210674,-0.337539432177,0.426956186355,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.2494467914,0.254878294116,,0.318518516666,-inf,-0.0620246913541,-0.0547945205479,0.00470906912955,0.0370370365169,-0.183753943218,0.0159880797389,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.281231140616,0.286662643331,,0.318518516666,-inf,-0.0229135802474,-0.0164383561644,0.0392144605923,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
0.104027493068,-0.132716049383,,,,-0.566083283042,0.571514785757,,0.318518516666,-inf,0.201086419753,0.199086757991,0.184362139917,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1
My code :
import pandas as pd
df = pd.read_csv('pooja.csv',index_col=False)
df_norm = (df.ix[:, 1:-1] - df.ix[:, 1:-1].mean()) / (df.ix[:, 1:-1].max() - df.ix[:, 1:-1].min())
rslt = pd.concat([df_norm, df.ix[:,-1]], axis=1)
rslt.to_csv('example.csv',index=False,header=False)
What's wrong in code? Why values are missing in .csv file ?
You get many NaN, because divide 0 by 0. See broadcasting behaviour. Better explanation is here.
I use code from your previous question, because I think slicing with df.ix[:, 1:-1] is not necessary. After normalize with slicing I get empty DataFrame.
import pandas as pd
import numpy as np
import io
temp=u"""20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),index_col=None, header=None)
#print df
#filter only first 5 columns for testing
df = df.iloc[:, :5]
print df
0 1 2 3 4
0 20376.650 22398.290 4.80 0 1
1 7048.842 8421.754 1.44 0 1
2 3716.890 4927.620 0.12 0 1
3 3716.890 4927.620 0.00 0 1
4 3716.890 4927.620 0.00 0 1
#get max values by columns
print df.max()
0 20376.65
1 22398.29
2 4.80
3 0.00
4 1.00
dtype: float64
#get min values by columns
print df.min()
0 3716.89
1 4927.62
2 0.00
3 0.00
4 1.00
dtype: float64
#difference, you get 0
print (df.max() - df.min())
0 16659.76
1 17470.67
2 4.80
3 0.00
4 0.00
dtype: float64
print df - df.mean()
0 1 2 3 4
0 12661.4176 13277.7092 3.528 0 0
1 -666.3904 -698.8268 0.168 0 0
2 -3998.3424 -4192.9608 -1.152 0 0
3 -3998.3424 -4192.9608 -1.272 0 0
4 -3998.3424 -4192.9608 -1.272 0 0
#you get NaN, because divide columns 3 and 4 filled 0 to difference with index 3,4 filled 0
df_norm = (df - df.mean()) / (df.max() - df.min())
print df_norm
0 1 2 3 4
0 0.76 0.76 0.735 NaN NaN
1 -0.04 -0.04 0.035 NaN NaN
2 -0.24 -0.24 -0.240 NaN NaN
3 -0.24 -0.24 -0.265 NaN NaN
4 -0.24 -0.24 -0.265 NaN NaN
Last if you generate to_csv, get from NaN "", because parameter na_rep has default value "":
print df_norm.to_csv(index=False, header=False, na_rep="")
0.76,0.76,0.735,,
-0.04,-0.04,0.035,,
-0.24,-0.24,-0.24,,
-0.24,-0.24,-0.265,,
-0.24,-0.24,-0.265,,
If you change value of na_rep:
#change na_rep to * for testing
print df_norm.to_csv(index=False, header=False, na_rep="*")
0.76,0.76,0.735,*,*
-0.04,-0.04,0.035,*,*
-0.24,-0.24,-0.24,*,*
-0.24,-0.24,-0.265,*,*
-0.24,-0.24,-0.265,*,*
Related
How can I join conditions of one line with samples of another in pandas/python?
There is a software that exports a table as the follow example: import pandas as pd s0 = ',,,,Cond1,,,,Cond2,,'.split(',') s1 = 'Gene name,Description,Anova,FoldChange,Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7'.split(',') s2 = 'HK1,Hexokinase,0.05,1.00,1.5,1.0,0.5,1.0,0,0,0'.split(',') df= pd.DataFrame( data= (s0, s1, s2)) 0 1 2 3 4 5 6 \ 0 Cond1 1 Gene name Description Anova FoldChange Sample1 Sample2 Sample3 2 HK1 Hexokinase 0.05 1.00 1.5 1.0 0.5 7 8 9 10 0 Cond2 1 Sample4 Sample5 Sample6 Sample7 2 1.0 0 0 0 H owever, the organization of this table is not straightforward and, therefore, is hard to analyze the conditions. I would like to produce data frames, which the conditions are matched with it respectively sample. It should be something like the output of the following code: import pandas as pd s1 = 'Gene name,Description,Anova,FoldChange,Sample1.Cond1,Sample2.Cond1,Sample3.Cond1,Sample4.Cond1,Sample5.Cond2,Sample6.Cond2,Sample7.Cond2'.split(',') s2 = 'HK1,Hexokinase,0.05,1.00,1.5,1.0,0.5,1.0,0,0,0'.split(',') df= pd.DataFrame(data= (s1, s2)) 0 1 2 3 4 5 \ 0 Gene name Description Anova FoldChange Sample1.Cond1 Sample2.Cond1 1 HK1 Hexokinase 0.05 1.00 1.5 1.0 6 7 8 9 10 0 Sample3.Cond1 Sample4.Cond1 Sample5.Cond2 Sample6.Cond2 Sample7.Cond2 1 0.5 1.0 0 0 0
Enter NaN Values ββin row 0 with Series.where. Then fill them with Series.ffill. Finally, you can use Series.str.cat to join both rows: df.iloc[0]= df.iloc[1].str.cat( df.iloc[0] .where(df.iloc[0].notnull() & df.iloc[0].ne('')) .ffill(),'.' ).fillna(df.iloc[1]) df=df.drop(1).reset_index(drop=True) print(df) Output: 0 1 2 3 4 5 \ 0 Gene name Description Anova FoldChange Sample1.Cond1 Sample2.Cond1 1 HK1 Hexokinase 0.05 1.00 1.5 1.0 6 7 8 9 10 0 Sample3.Cond1 Sample4.Cond1 Sample5.Cond2 Sample6.Cond2 Sample7.Cond2 1 0.5 1.0 0 0 0
Pandas - split columns and include counts
I have the following dataframe: doc_id is_fulltext 1243 dok:1 1 3310 dok:1 1 4370 dok:1 1 14403 dok:1020 1 17252 dok:1020 1 15977 dok:1020 0 16480 dok:1020 1 16252 dok:1020 1 468 dok:103 1 128 dok:1030 0 1673 dok:1038 1 I would like to split the is_fulltext column into two columns and count the occurrences of the docs at the same time. Desired Output: doc_id fulltext non-fulltext 0 dok:1 3 0 1 dok:1020 4 1 2 dok:103 1 0 3 dok:1030 0 1 4 dok:1038 1 0 I followed the procedure of Pandas - Create columns from column value, and fill with count That post shows several alternatives, suggesting Categorical or reindex. I tried the following: cats = ['fulltext', 'non_fulltext'] df_sorted['is_fulltext'] = pd.Categorical(df_sorted['is_fulltext'], categories=cats) new_df = df_sorted.groupby(['doc_id', 'is_fulltext']).size().unstack(fill_value=0) Here I get a ValueError: ValueError: Length of passed values is 17446, index implies 0 Then I tried this method cats = ['fulltext', 'non_fulltext'] new_df = df_sorted.groupby(['doc_id','is_fulltext']).size().unstack(fill_value=0).reindex(columns=cats).reset_index() While this seems to have worked fine in the original post, my counts are filled with NANs (see below). I read by now that this happens when using reindex and categorical, but I wonder why it seems to have worked in the original post. And how can I solve this? Can anyone help? Thank you! doc_id fulltext non-fulltext 0 dok:1 NaN NaN 1 dok:1020 NaN NaN 2 dok:103 NaN NaN 3 dok:1030 NaN NaN 4 dok:1038 NaN NaN
You could GroupBy the doc_id, apply pd.value_counts to each group and unstack: (df.groupby('doc_id').is_fulltext.apply(pd.value_counts) .unstack() .fillna(0) .rename(columns={0:'non-fulltext', 1:'fulltext'}) .reset_index()) doc_id non-fulltext fulltext 0 dok:1 0.0 3.0 1 dok:1020 1.0 4.0 2 dok:103 0.0 1.0 3 dok:1030 1.0 0.0 4 dok:1038 0.0 1.0 Or similarly to your own method, if performance is an issue do instead: df.groupby(['doc_id','is_fulltext']).size() .unstack(fill_value=0) .rename(columns={0:'fulltext',1:'non_fulltext'}) .reset_index() is_fulltext doc_id fulltext non_fulltext 0 dok:1 0 3 1 dok:1020 1 4 2 dok:103 0 1 3 dok:1030 1 0 4 dok:1038 0 1
I don't know if it's the best approach, but this should work for you: import pandas as pd df = pd.DataFrame({"doc_id":["id1", "id2", "id1", "id2"], "is_fulltext":[1, 0, 1, 1]}) df_grouped = df.groupby("doc_id").sum().reset_index() df_grouped["non_fulltext"] = df.groupby("doc_id").count().reset_index()["is_fulltext"] - df_grouped["is_fulltext"] df_grouped And the output is: doc_id is_fulltext non_fulltext 0 id1 2 0 1 id2 1 1
Pandas: How to maintain the type of columns with nan?
For example,I have a df with nan and use the following method to fillna. import pandas as pd a = [[2.0, 10, 4.2], ['b', 70, 0.03], ['x', ]] df = pd.DataFrame(a) print(df) df.fillna(int(0),inplace=True) print('fillna df\n',df) dtype_df = df.dtypes.reset_index() OUTPUT: 0 1 2 0 2 10.0 4.20 1 b 70.0 0.03 2 x NaN NaN fillna df 0 1 2 0 2 10.0 4.20 1 b 70.0 0.03 2 x 0.0 0.00 col type 0 0 object 1 1 float64 2 2 float64 Actually,I want the column 1 maintain the type of int instead of float. My desired output: fillna df 0 1 2 0 2 10 4.20 1 b 70 0.03 2 x 0 0.00 col type 0 0 object 1 1 int64 2 2 float64 So how to do it?
Try adding downcast='infer' to downcast any eligible columns: df.fillna(0, downcast='infer') 0 1 2 0 2 10 4.20 1 b 70 0.03 2 x 0 0.00 And the corresponding dtypes are 0 object 1 int64 2 float64 dtype: object
pandas count over multiple columns
I have a dataframe looking like this Measure1 Measure2 Measure3 ... 0 1 3 1 3 2 3 0 I'd like to count the occurrences of the values over the columns to produce: Measure Count Percentage 0 2 0.25 1 2 0.25 2 1 0.125 3 3 0.373 With outcome_measure_count = cdss_data.groupby(key_columns=['Measure1'],operations={'count': agg.COUNT()}).sort('count', ascending=True) I only get the first column (actually using graphlab package, but I'd prefer pandas) Could someone help me?
You can generate the counts by flattening the df using ravel and value_counts, from this you can construct the final df: In [230]: import io import pandas as pd β t="""Measure1 Measure2 Measure3 0 1 3 1 3 2 3 0 0""" β df = pd.read_csv(io.StringIO(t), sep='\s+') df Out[230]: Measure1 Measure2 Measure3 0 0 1 3 1 1 3 2 2 3 0 0 In [240]: count = pd.Series(df.squeeze().values.ravel()).value_counts() pd.DataFrame({'Measure': count.index, 'Count':count.values, 'Percentage':(count/count.sum()).values}) Out[240]: Count Measure Percentage 0 3 3 0.333333 1 3 0 0.333333 2 2 1 0.222222 3 1 2 0.111111 I inserted a 0 just to make the df shape correct but you should get the point
In [68]: df=DataFrame({'m1':[0,1,3], 'm2':[1,3,0], 'm3':[3,2, np.nan]}) In [69]: df Out[69]: m1 m2 m3 0 0 1 3.0 1 1 3 2.0 2 3 0 NaN In [70]: df=df.apply(Series.value_counts).sum(1).to_frame(name='Count') In [71]: df Out[71]: Count 0.0 2.0 1.0 2.0 2.0 1.0 3.0 3.0 In [72]: df.index.name='Measure' In [73]: df Out[73]: Count Measure 0.0 2.0 1.0 2.0 2.0 1.0 3.0 3.0 In [74]: df['Percentage']=df.Count.div(df.Count.sum()) In [75]: df Out[75]: Count Percentage Measure 0.0 2.0 0.250 1.0 2.0 0.250 2.0 1.0 0.125 3.0 3.0 0.375
Slicing based on dates Pandas Dataframe
I have a large dataframe with dates, store number, units sold, and rain precipitation totals. It looks like this... date store_nbr units preciptotal 2014-10-11 1 0 0.00 2014-10-12 1 0 0.01 2014-10-13 1 2 0.00 2014-10-14 1 1 2.13 2014-10-15 1 0 0.00 2014-10-16 1 0 0.87 2014-10-17 1 3 0.01 2014-10-18 1 0 0.40 I want to select a three day window around any date that has a precipitation total greater than 1. For this small example, I would want get back the first 7 rows, the 3 days before 2014-10-14, the three days after 2014-10-14, and 2014-10-14 because it has a preciptotal greater than 1.
Here are two ways you could build the selection mask without looping over the index values: You could find the rows where preciptotal is greater than 1: mask = (df['preciptotal'] > 1) and then use scipy.ndimage.binary_dilation to expand the mask to a 7-day window: import scipy.ndimage as ndimage import pandas as pd df = df = pd.read_table('data', sep='\s+') mask = (df['preciptotal'] > 1) mask = ndimage.binary_dilation(mask, iterations=3) df.loc[mask] yields date store_nbr units preciptotal 0 2014-10-11 1 0 0.00 1 2014-10-12 1 0 0.01 2 2014-10-13 1 2 0.00 3 2014-10-14 1 1 2.13 4 2014-10-15 1 0 0.00 5 2014-10-16 1 0 0.87 6 2014-10-17 1 3 0.01 Or, using NumPy (but without the scipy dependency), you could use mask.shift with np.logical_and.reduce: mask = (df['preciptotal'] > 1) mask = ~np.logical_and.reduce([(~mask).shift(i) for i in range(-3, 4)]).astype(bool) # array([ True, True, True, True, True, True, True, False], dtype=bool)
For a specific value you can do this: In [84]: idx = df[df['preciptotal'] > 1].index[0] df.iloc[idx-3: idx+4] Out[84]: date store_nbr units preciptotal 0 2014-10-11 1 0 0.00 1 2014-10-12 1 0 0.01 2 2014-10-13 1 2 0.00 3 2014-10-14 1 1 2.13 4 2014-10-15 1 0 0.00 5 2014-10-16 1 0 0.87 6 2014-10-17 1 3 0.01 For the more general case you can get an array of indices where the condition is met idx_vals = df[df['preciptotal'] > 1].index then you can generate slices or iterate over the array values: for idx in idx_values: df.iloc[idx-3: idx+4] This assumes your index is a 0 based int64 index which your sample is