Pandas Pivot causes all values to be NaN - python

I have a pandas dataframe consisting of 12 columns and 900 entries which looks like this:
In [1]: df
Out[2]:
Id BestInGen Ceiling Fitness Floor Generation Name Precision Runid SolutionId Timestamp Value
0 1 True 2.5 2.416582e+11 0.500 1 H1001Thickness1 0.010 20180214142319 4 2018-02-14 14:28:41.391908 0.500
1 2 False 0.1 2.830500e+11 0.015 1 H6512Diameter8 0.005 20180214142319 3 2018-02-14 14:28:41.423109 0.015
2 3 False 2.5 2.830500e+11 0.500 1 H2201Thickness1 0.010 20180214142319 3 2018-02-14 14:28:41.423109 0.500
3 4 False 0.1 2.830500e+11 0.015 1 H2201Diameter1 0.005 20180214142319 3 2018-02-14 14:28:41.423109 0.015
4 5 False 2.5 2.830500e+11 0.500 1 H2201Thickness2 0.010 20180214142319 3 2018-02-14 14:28:41.423109 0.500
I want to pivot this dataframe such that 'Name' is turned into columns, and the rows populated by 'Value'.
Currently I have tried the following:
dfPivot = df.pivot(index='Id',columns='Name',values='Value)
I thought this would create the results I need, and that has been the case for the other threads ive seen. But in my case the following happens
In [3]: dfPivot
Out [4]:
Name H1001Diameter1 H1001Diameter10 H1001Diameter12
Id
1 Nan Nan Nan
And the same continues to the end of the dataframe, all values being Nan. The original datatype is a float64, and there are no Nans in the original data.
Any pointers on how to solve this? Sorry if this is a noob question, or please let me know if you need me to edit my question/example.

try
pd.pivot_table(df[['Id', 'Name', 'Value']],
index='Id',
columns=['Name'],
values=['Value'],
aggfunc=lambda x: x)
This assumes that you don't duplicate values. Otherwise, you need to edit the aggfunc to do proper aggregation

Related

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25

Calculation is done only on part of the table

I am trying to calculate the kurtosis and skewness over a data and I managaed to create table but for some reason teh result is only for few columns and not for the whole fields.
For example, as you cann see, I have many fields (columns):
I calculate the skenwess and kurtosis using the next code:
sk=pd.DataFrame(data.skew())
kr=pd.DataFrame(data.kurtosis())
sk['kr']=kr
sk.rename(columns ={0: 'sk'}, inplace =True)
but then I get result that contains about half of the data I have:
I have tried to do head(10) but it doesn't change the fact that some columns dissapeard.
How can I calculte this for all the columns?
It is really hard to reproduce the error since you did not give the original data. Probably your dataframe contains non-numerical values in the missing columns which would result in this behavior.
dat = {"1": {'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"2":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"3":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"4":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"5":{'lg1':0.12, 'lg2':0.23, 'lg3': 'po', 'lg4':0.45}}
df = pd.DataFrame.from_dict(dat).T
print(df)
lg1 lg2 lg3 lg4
1 0.12 0.23 0.34 0.45
2 0.12 0.23 0.34 0.45
3 0.12 0.23 0.34 0.45
4 0.12 0.23 0.34 0.45
5 0.12 0.23 po 0.45
print(df.kurtosis())
lg1 0
lg2 0
lg4 0
The solution would be to preprocess the data.
One word of advice would be to check for consistency in the error, i.e. are always the same lines missing?

How do I apply a lambda function on pandas slices, and return the same format as the input data frame?

I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91

Returning a subset of a dataframe using a conditional statement

I'm fairly new to python so I apologize in advance if this is a rookie mistake. I'm using python 3.4. Here's the problem:
I have a pandas dataframe with a datetimeindex and multiple named columns like so:
>>>df
'a' 'b' 'c'
1949-01-08 42.915 0 1.448
1949-01-09 19.395 0 0.062
1949-01-10 1.077 0.05 0.000
1949-01-11 0.000 0.038 0.000
1949-01-12 0.012 0.194 0.000
1949-01-13 0.000 0 0.125
1949-01-14 0.000 0.157 0.007
1949-01-15 0.000 0.003 0.000
I am trying to extract a subset using both the year from the datetimeindex and a conditional statement on the values:
>>>df['1949':'1980'][df > 0]
'a' 'b' 'c'
1949-01-08 42.915 NaN 1.448
1949-01-09 19.395 NaN 0.062
1949-01-10 1.077 0.05 NaN
1949-01-11 NaN 0.038 NaN
1949-01-12 0.012 0.194 NaN
1949-01-13 NaN NaN 0.125
1949-01-14 NaN 0.157 0.007
1949-01-15 NaN 0.003 NaN
My final goal is to find percentiles of this subset, however np.percentile cannot handle NaNs. I have tried using the dataframe quantile method but there are a couple of missing data points which cause it to drop the whole column. It seems like it would be simple to use a conditional statement to select values without returning NaNs, but I can't seem to find anything that will return a smaller subset without the NaNs. Any help or suggestions would be much appreciated. Thanks!
I don't know what exactly result you expect.
You can use df >= 0 to keep 0 in columns.
df['1949':'1980'][df >= 0]
You can use .fillna(0) to change NaN into 0
df['1949':'1980'][df > 0].fillna(0)
You can use .dropna() to remove rows with any NaN - but this way probably you get empty result.
df['1949':'1980'][df > 0].dropna()

Pandas: Reindex Unsorts Dataframe

I'm having some trouble sorting and then resetting my Index in Pandas:
dfm = dfm.sort(['delt'],ascending=False)
dfm = dfm.reindex(index=range(1,len(dfm)))
The dataframe returns unsorted after I reindex. My ultimate goal is to have a sorted dataframe with index numbers from 1 --> len(dfm) so if there's a better way to do that, I wouldn't mind,
Thanks!
Instead of reindexing, just change the actual index:
dfm.index = range(1,len(dfm) + 1)
Then that wont change the order, just the index
I think you're misunderstanding what reindex does. It uses the passed index to select values along the axis passed, then fills with NaN wherever your passed index doesn't match up with the current index. What you're interested in is just setting the index to something else:
In [12]: df = DataFrame(randn(10, 2), columns=['a', 'delt'])
In [13]: df
Out[13]:
a delt
0 0.222 -0.964
1 0.038 -0.367
2 0.293 1.349
3 0.604 -0.855
4 -0.455 -0.594
5 0.795 0.013
6 -0.080 -0.235
7 0.671 1.405
8 0.436 0.415
9 0.840 1.174
In [14]: df.reindex(index=arange(1, len(df) + 1))
Out[14]:
a delt
1 0.038 -0.367
2 0.293 1.349
3 0.604 -0.855
4 -0.455 -0.594
5 0.795 0.013
6 -0.080 -0.235
7 0.671 1.405
8 0.436 0.415
9 0.840 1.174
10 NaN NaN
In [16]: df.index = arange(1, len(df) + 1)
In [17]: df
Out[17]:
a delt
1 0.222 -0.964
2 0.038 -0.367
3 0.293 1.349
4 0.604 -0.855
5 -0.455 -0.594
6 0.795 0.013
7 -0.080 -0.235
8 0.671 1.405
9 0.436 0.415
10 0.840 1.174
Remember, if you want len(df) to be in the index you have to add 1 to the endpoint since Python doesn't include endpoints when constructing ranges.

Categories