Adding new column to DataFrame with values dependent on index ref - python

I want to add a new column to this DataFrame in Pandas where I assign a StoreID rolling thru the indexes:
It currently looks like this:
Unnamed: 12 Store
0 NaN 1
1 NaN 1
2 NaN 1
0 NaN 1
1 NaN 1
2 NaN 1
0 NaN 1
1 NaN 1
2 NaN 1
0 NaN 1
1 NaN 1
2 NaN 1
I want it to look like this:
Unnamed: 12 Store StoreID
0 NaN 1 1
1 NaN 1 1
2 NaN 1 1
0 NaN 1 2
1 NaN 1 2
2 NaN 1 2
0 NaN 1 5
1 NaN 1 5
2 NaN 1 5
0 NaN 1 11
1 NaN 1 11
2 NaN 1 11
The variable changes upon the index hitting 0. The report will have variable numbers of items - most of them being 100's of 1000s of records per store.
I can create a new column easily but I can't seem to work out how to do this!
Any help much appreciated - I'm just starting out with Python.

You can also get the cumsum of the diff of the indexes
df['g'] = (df.index.to_series().diff() < 0).cumsum()
0 0
1 0
2 0
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
2 3

Using np.ndarray.cumsum:
df['g'] = (df.index == 0).cumsum() - 1
print(df)
col Store g
0 NaN 1 0
1 NaN 1 0
2 NaN 1 0
0 NaN 1 1
1 NaN 1 1
2 NaN 1 1
0 NaN 1 2
1 NaN 1 2
2 NaN 1 2
0 NaN 1 3
1 NaN 1 3
2 NaN 1 3

IIUC Try cumcount
df.groupby(df.index).cumcount()
Out[11]:
0 0
1 0
2 0
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
2 3
dtype: int64

Thanks for everyone's reply. I have ended up solving the problem with:
table['STORE_ID'] = (table.index == 0).cumsum() - 1
then adding some logic to lookup the store_id based on the sequence:
table.loc[table['STORE_ID'] == 3, 'STORE_ID'] = 11
table.loc[table['STORE_ID'] == 2, 'STORE_ID'] = 3
table.loc[table['STORE_ID'] == 1, 'STORE_ID'] = 2
table.loc[table['STORE_ID'] == 0, 'STORE_ID'] = 1
I imagine there's a simpler solution to get to the Store_ID sequence quicker but this gets the job done for now.

Related

Turn columns' values to headers of columns with values 1 and 0 ( accordingly) [python]

I got a column of the form :
0 q4
1 4
2 3
3 1
4 2
5 1
6 5
7 1
8 3
The column represents the answers of users to a question of 5 choices (1-5).
I want to turn this into a matrix of 5 columns where the indexes are the 5 possible answers and the values are 1 or 0 according to the user's given answer.
Visualy i want a matrix of the form:
0 q4_1 q4_2 q4_3 q4_4 q4_5
1 Nan Nan Nan 1 Nan
2 Nan Nan 1 Nan Nan
3 1 Nan Nan Nan Nan
4 Nan 1 Nan Nan Nan
5 1 Nan Nan Nan Nan
for i in range(1,6):
df['q4_'+str(i)]=np.where(df.q4==i, 1, 0)
def df['q4']
Output:
>>> print(df)
q4_1 q4_2 q4_3 q4_4 q4_5
0 0 0 0 1 0
1 0 0 1 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
I think pivot is the way to go. You'd have to prepopulate the df with the info you want in the new table.
Also, I don't understand why you want only 5 rows but I added it as well in iloc. If you remove it, you will have this data for your entire index (up to 8).
import pandas as pd
df = pd.DataFrame({'q4': [4, 3, 1, 2, 1, 5, 1, 3]})
df.index += 1
df['values'] = 1
df = df.reset_index().pivot(index='q4', columns='index', values='values').T.iloc[:5]
prints
q4 1 2 3 4 5
index
1 NaN NaN NaN 1.0 NaN
2 NaN NaN 1.0 NaN NaN
3 1.0 NaN NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 1.0 NaN NaN NaN NaN

pandas calculate difference based on indicators grouped by a column

Here is my question. I don't know how to describe it, so I will just give an example.
a b k
0 0 0
0 1 1
0 2 0
0 3 0
0 4 1
0 5 0
1 0 0
1 1 1
1 2 0
1 3 1
1 4 0
Here, "a" is user id, "b" is time, and "k" is a binary indicator flag. "b" is consecutive for sure.
What I want to get is this:
a b k diff_b
0 0 0 nan
0 1 1 nan
0 2 0 1
0 3 0 2
0 4 1 3
0 5 0 1
1 0 0 nan
1 1 1 nan
1 2 0 1
1 3 1 2
1 4 0 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently.
Can anyone revise my title? I don't know how to describe it in english. So complex...
Thank you!
IIUC
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby('a').New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']# yield
Out[260]:
0 NaN
1 NaN
2 1.0
3 2.0
4 3.0
5 1.0
6 NaN
7 NaN
8 1.0
9 2.0
10 1.0
dtype: float64
create partitions of the data of rows after k == 1 up to the next k == 1 using cumsum, and shift, for each group of a
parts = df.groupby('a').k.apply(lambda x: x.shift().cumsum())
group by the df.a & parts and calculate the difference between b & b.min() within each group
vals = df.groupby([df.a, parts]).b.apply(lambda x: x-x.min()+1)
set values to null when part == 0 & assign back to the dataframe
df['diff_b'] = np.select([parts!=0], [vals], np.nan)
outputs:
a b k diff_b
0 0 0 0 NaN
1 0 1 1 NaN
2 0 2 0 1.0
3 0 3 0 2.0
4 0 4 1 3.0
5 0 5 0 1.0
6 1 0 0 NaN
7 1 1 1 NaN
8 1 2 0 1.0
9 1 3 1 2.0
10 1 4 0 1.0

fill missing days in pandas dataframe

Given the dataframe
df = pd.DataFrame(data=[[1,1,3],[1,2,6],[1,4,3],[2,2,6]],columns=['ID','Day','Value'])
df
Out[58]:
ID Day Value
0 1 1 3
1 1 2 6
2 1 4 3
3 2 2 6
As you can see for ID = 1 the Value related to Day3 is missing and for ID =2 the value related to Day1 is missing... I would like to fill these gaps adding np.nan and the missing day...
Out[59]:
ID Day Value
0 1 1 3.0
1 1 2 6.0
2 1 3 NaN
3 1 4 3.0
4 2 1 NaN
5 2 2 6.0
You'll need to define a custom function that performs some reindexing logic:
def f(x):
return x.set_index('Day').reindex(
np.arange(1, x.Day.max() + 1)
).Value
Now, perform a groupby + apply:
df.groupby('ID').apply(f).reset_index()
ID Day Value
0 1 1 3.0
1 1 2 6.0
2 1 3 NaN
3 1 4 3.0
4 2 1 NaN
5 2 2 6.0

How to extract a value from a list in Pandas

Hi I have a dataframe and looks like this:
0 1
0 0 [03/25/93]
1 1 [6/18/85]
2 2 [7/8/71]
3 3 [9/27/75]
4 4 []
5 5 []
How can I extract the value inside the list in another column of the DataFrame???
0 1
0 0 03/25/93
1 1 6/18/85
2 2 7/8/71
3 3 9/27/75
4 4 NaN
5 5 Nan
Thank you very much.
Use str[0]:
df[1] = df[1].str[0]
print (df)
0 1
0 0 03/25/93
1 1 6/18/85
2 2 7/8/71
3 3 9/27/75
4 4 NaN
5 5 NaN

reading multiple csv file into a big pandas data frame by appending the columns of different size

so i am creating some data frame in a loop and save them as a csv file. The data frame have the same columns but different length. i would like to be able to concatenate these data frames into a single data frame that has all the columns something like
df1
A B C
0 0 1 2
1 0 1 0
2 1.2 1 1
3 2 1 2
df2
A B C
0 0 1 2
1 0 1 0
2 0.2 1 2
df3
A B C
0 0 1 2
1 0 1 0
2 1.2 1 1
3 2 1 4
4 1 2 2
5 2.3 3 0
i would like to get something like
df_big
A B C A B C A B C
0 0 1 2 0 1 2 0 1 2
1 0 1 0 0 1 0 0 1 0
2 1.2 1 1 0.2 1 2 1.2 1 1
3 2 1 2 2 1 4
4 1 2 2
5 2.3 3 0
is this something that can be done in pandas?
You could use pd.concat:
df_big = pd.concat([df1, df2, df3], axis=1)
yields
A B C A B C A B C
0 0.0 1 2 0.0 1 2 0.0 1 2
1 0.0 1 0 0.0 1 0 0.0 1 0
2 1.2 1 1 0.2 1 2 1.2 1 1
3 2.0 1 2 NaN NaN NaN 2.0 1 4
4 NaN NaN NaN NaN NaN NaN 1.0 2 2
5 NaN NaN NaN NaN NaN NaN 2.3 3 0

Categories