I am iterating and as a result of a single iteration I acquire a pandas series object which looks like this:
DE_AT 118.55
DE_CZ 62.73
PL_DE 263.36
PL_SK 315.07
dtype: float64
Sometimes I might get different names and lengths of this series for example I might get:
DE_AT 118.55
DE_CZ 62.73
PL_DE 263.36
PL_NL 315.07
PL_UK 420
dtype: float64
Now I want to create a dataframe from these series objects when iterating such that I will have all names as the index, from these two series objects I would like to get:
index 1 2
DE_AT 118.55 118.55
DE_CZ 62.73 62.73
PL_DE 263.36 263.36
PL_SK 315.07 NaN
PL_NL NaN 315.07
PL_UK NaN 420
Or maybe I can store them in a list and later create a dataframe?
Basic outer join of two series:
s1=pd.Series(index=["DE_AT","DE_CZ","PL_DE", "PL_SK"], data=[1,2,3,4]).to_frame()
s2=pd.Series(index=["DE_AT","DE_CZ","PL_DE", "PL_NL", "PL_UK"], data=[1,2,3,4,5]).to_frame()
s1.join(s2, how="outer",lsuffix="1",rsuffix="2")
Output:
index
00
01
DE_AT
1.0
1.0
DE_CZ
2.0
2.0
PL_DE
3.0
3.0
PL_NL
NaN
4.0
PL_SK
4.0
NaN
PL_UK
NaN
5.0
In a dataframe, with some empty(NaN) values in some rows - Example below
s = pd.DataFrame([[39877380,158232151,20], [39877380,332086469,], [39877380,39877381,14], [39877380,39877383,8], [73516838,6439138,1], [73516838,6500551,], [735571896,203559638,], [735571896,282186552,], [736453090,6126187,], [673117474,12196071,], [673117474,12209800,], [673117474,618058747,6]], columns=['start','end','total'])
When I groupby start and end columns
s.groupby(['start', 'end']).total.sum()
the output I get is
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
735571896 203559638 nan
282186552 nan
736453090 6126187 nan
I want to exclude all the groups of start where all values with end is 'nan' - Expected output -
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
I tried with dropna(), but it is removing all the nan values and not nan groups.
I am newbie in python and pandas. Can someone help me in this? thank you
In newer pandas versions is necessary use min_count=1 for missing values if use sum:
s1 = s.groupby(['start', 'end']).total.sum(min_count=1)
#oldier pandas version solution
#s1 = s.groupby(['start', 'end']).total.sum()
Then is possible filter if at least one non missing value per first level by Series.notna with GroupBy.transform and GroupBy.any, filtering is by boolean indexing:
s2 = s1[s1.notna().groupby(level=0).transform('any')]
#oldier pandas version solution
#s2 = s1[s1.notnull().groupby(level=0).transform('any')]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
Or is possible get unique values of first level index values by MultiIndex.get_level_values and filtering by DataFrame.loc:
idx = s1.index.get_level_values(0)
s2 = s1.loc[idx[s1.notna()].unique()]
#oldier pandas version solution
#s2 = s1.loc[idx[s1.notnull()].unique()]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]
You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]
A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!
I want to resample a pandas dataframe and apply different functions to different columns. The problem is that I cannot properly process a column with strings. I would like to apply a function that merges the string with a delimiter such as " - ". This is a data example:
import pandas as pd
import numpy as np
idx = pd.date_range('2017-01-31', '2017-02-03')
data=list([[1,10,"ok"],[2,20,"merge"],[3,30,"us"]])
dates=pd.DatetimeIndex(['2017-01-31','2017-02-03','2017-02-03'])
d=pd.DataFrame(data, index=,columns=list('ABC'))
A B C
2017-01-31 1 10 ok
2017-02-03 2 20 merge
2017-02-03 3 30 us
Resampling the numeric columns A and B with a sum and mean aggregator works. Column C however kind of works with sum (but it gets placed on the second place, which might mean that something fails).
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': sum})
A C B
2017-01-31 1.0 a 10.0
2017-02-01 NaN 0 NaN
2017-02-02 NaN 0 NaN
2017-02-03 5.0 merge us 25.0
I would like to get this:
...
2017-02-03 5.0 merge - us 25.0
I tried using lambda in different ways but without success (not shown).
If I may ask a second related question: I can do some post processing for this, but how to fill missing cells in different columns with zeros or ""?
Your agg function for column 'C' should be a join
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': ' - '.join})
A B C
2017-01-31 1.0 10.0 ok
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 5.0 25.0 merge - us