Adding items to empty pandas DataFrame - python

I want to dynamically extend an empty pandas DataFrame in the following way:
df=pd.DataFrame()
indices=['A','B','C']
colums=['C1','C2','C3']
for colum in colums:
for index in indices:
#df[index,column] = anyValue
Where both indices and colums can have arbitrary sizes which are not known in advance, i.e. I cannot create a DataFrame with the correct size in advance.
Which pandas function can I use for
#df[index,column] = anyValue
?

I think you can use loc:
df = pd.DataFrame()
df.loc[0,1] = 10
df.loc[2,8] = 100
print(df)
1 8
0 10.0 NaN
2 NaN 100.0
Faster solution with DataFrame.set_value:
df = pd.DataFrame()
indices = ['A', 'B', 'C']
columns = ['C1', 'C2', 'C3']
for column in columns:
for index in indices:
df.set_value(index, column, 1)
print(df)
C1 C2 C3
A 1.0 1.0 1.0
B 1.0 1.0 1.0
C 1.0 1.0 1.0

loc works very well, but...
For single assignments use at
df = pd.DataFrame()
indices = ['A', 'B', 'C']
columns = ['C1', 'C2', 'C3']
for column in columns:
for index in indices:
df.at[index, column] = 1
df
.at vs .loc vs .set_value timing

Related

Concatenate values in a dataframe with value in preceding column on same row - Python

I am trying to concatenate the values in a cell with values in its preceding cell on the same row i.e. one column before it throughout my dataframe. For sure, the first column values wont have anything to concatenate with. Also, my df has NaN values - which I have changed to None.
Any help would be appreciated.
Thanks in advance.
Try with add then cumsum
out = df.add('_').apply(lambda x : x[x.notna()].cumsum().str[:-1],axis=1)
Out[871]:
1 2 3 4 5
0 a a_b a_b_c a_b_c_d a_b_c_d_e
1 a a_e a_e_f NaN NaN
# Constructing the dataframe:
df = pd.DataFrame({'l0': list('aaab'),
'l1': list('begj'),
'l2': list('cfhk'),
'l3': ['d', np.nan, 'i', 'l'],
'l4': ['e', np.nan, np.nan, 'm']})
I am iterating through the columns one by one, using pandas.Series.str.cat, and replacing them in the original dataframe:
prev = df.iloc[:, 0]
for col in df.columns[1:]:
prev = prev.str.cat(df[col], sep='_')
df[col] = prev
Using a simple loop to keep vectorial efficiency:
df2 = df.copy()
for i in range(1, df.shape[1]):
df2.iloc[:, i] = df2.iloc[:, i-1]+'_'+df2.iloc[:, i]
output:
l0 l1 l2 l3 l4
0 a a_b a_b_c a_b_c_d a_b_c_d_e
1 a a_e a_e_f NaN NaN
2 a a_g a_g_h a_g_h_i NaN
3 b b_j b_j_k b_j_k_l b_j_k_l_m

Concatenate a list of data frames not having same columns to one single data frame

Df 1 has columns A B C D, Df2 has columns A B D. Df1 and Df2 are in a list. How do I concatenate them into 1 df?
Or can I directly append these dfs to one single df without using a list ?
Short answer: yes you can combine them into single pandas dataframe without that much work. Sample code:
import pandas as pd
df1 = [(1,2,3,4)]
df2 = [(9,9,9)]
df1 = pd.DataFrame(df1, columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(df2, columns=['A', 'B', 'D'])
df = pd.concat([df1, df2], sort=False)
Which results into:
>>> pd.concat([df1, df2], sort=False)
A B C D
0 1 2 3.0 4
0 9 9 NaN 9

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2

How to groupby time series data

I have a dataframe below,column B's dtype is datetime64.
A B
0 a 2016-09-13
1 b 2016-09-14
2 b 2016-09-15
3 a 2016-10-13
4 a 2016-10-14
I would like to groupby according to month(or in general year and day...)
so I would like to get count result below, key = column B.
a b
2016-09 1 2
2016-10 2 0
I tried groupby. but I couldn't figure out how to handle dtypes like datetime64...
How can I handle and group dtype datetime64?
If you set the index to the datetime you can use pd.TimeGrouper to sort by various time ranges. Example code:
# recreate dataframe
df = pd.DataFrame({'A': ['a', 'b', 'b', 'a', 'a'], 'B': ['2016-09-13', '2016-09-14', '2016-09-15',
'2016-10-13', '2016-10-14']})
df['B'] = pd.to_datetime(df['B'])
# set column B as index for use of TimeGrouper
df.set_index('B', inplace=True)
# Now do the magic of Ami Tavory's answer combined with timeGrouper:
df = df.groupby([pd.TimeGrouper('M'), 'A']).size().unstack().fillna(0)
This returns:
A a b
B
2016-09-30 1.0 2.0
2016-10-31 2.0 0.0
or alternatively (credits to ayhan) skip the setting to index step and use the following one-liner straight after creating the dataframe:
# recreate dataframe
df = pd.DataFrame({'A': ['a', 'b', 'b', 'a', 'a'], 'B': ['2016-09-13', '2016-09-14', '2016-09-15',
'2016-10-13', '2016-10-14']})
df['B'] = pd.to_datetime(df['B'])
df = df.groupby([pd.Grouper(key='B', freq='M'), 'A']).size().unstack().fillna(0)
which returns the same answer
Say you start with
In [247]: df = pd.DataFrame({'A': ['a', 'b', 'b', 'a', 'a'], 'B': ['2016-09-13', '2016-09-14', '2016-09-15', '2016-10-13', '2016-10-14']})
In [248]: df.B = pd.to_datetime(df.B)
Then you can groupby-size, then unstack:
In [249]: df = df.groupby([df.B.dt.year.astype(str) + '-' + df.B.dt.month.astype(str), df.A]).size().unstack().fillna(0).astype(int)
Finally, you just need to make B a date again:
In [250]: df.index = pd.to_datetime(df.index)
In [251]: df
Out[251]:
A a b
B
2016-10-01 2 0
2016-09-01 1 2
Note that the final conversion to a date-time set a uniform day (you can't have a "dayless" object of this type).

Manipulate specific columns (sample features) conditional on another column's entries (feature value) using pandas/numpy dataframe

my input dataframe (shortened) looks like this:
>>> import numpy as np
>>> import pandas as pd
>>> df_in = pd.DataFrame([[1, 2, 'a', 3, 4], [6, 7, 'b', 8, 9]],
... columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_in
c1 c2 col c3 c4
0 1 2 a 3 4
1 6 7 b 8 9
It is supposed to be manipulated, i.e.
if row (sample) in column 'col' (feature) has a specific value (e.g., 'b' here),
then convert the entries in columns 'c1' and 'c2' in the same row to NumPy.NaNs.
Result wanted:
>>> df_out = pd.DataFrame([[1, 2, 'a', 3, 4], [np.nan, np.nan, np.nan, 8, 9]],
columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_out
c1 c2 col c3 c4
0 1 2 a 3 4
1 NaN NaN b 8 9
So far, I managed to get obtain desired result via the code
>>> dic = {'col' : ['c1', 'c2']} # auxiliary
>>> b_w = df_in[df_in['col'] == 'b'] # Subset with 'b' in 'col'
>>> b_w = b_w.drop(dic['col'], axis=1) # ...inject np.nan in 'c1', 'c2'
>>> b_wo = df_in[df_in['col'] != 'b'] # Subset without 'b' in 'col'
>>> df_out = pd.concat([b_w, b_wo]) # Both Subsets together again
>>> df_out
c1 c2 c3 c4 col
1 NaN NaN 8 9 b
0 1.0 2.0 3 4 a
Although I get what I want (the original data consists entirely of floats, don't
bother the mutation from int to float her), it is a rather inelegant
snippet of code. And my educated guess is that this could be done faster
by using the build-in functions from pandas and numpy, but I am unable to manage this.
Any suggestions how to code this in a fast and efficient way for daily use? Any help is highly appreciated. :)
You can condition on both the row and col positions to assign values using loc which supports both logic indexing and dimension name indexing:
df_in.loc[df_in.col == 'b', ['c1', 'c2']] = np.nan
df_in
# c1 c2 col c3 c4
# 0 1.0 2.0 a 3 4
# 1 NaN NaN b 8 9
When using pandas I would go for the solution provided by #Psidom.
However, for larger datasets it is faster when doing the whole pandas -> numpy -> pandas procedure, i.e. dataframe -> numpy.array -> dataframe (minus 10% process time for my setup). Without converting back to a dataframe, numpy is almost twice as fast for my dataset.
Solution for the question asked:
cols, df_out = df_in.columns, df_in.values
for i in [0, 1]:
df_out[df_out[:, 2] == 'b', i] = np.nan
df_out = pd.DataFrame(df_out, columns=cols)

Categories