I have the following dataframe:
doc_id is_fulltext
1243 dok:1 1
3310 dok:1 1
4370 dok:1 1
14403 dok:1020 1
17252 dok:1020 1
15977 dok:1020 0
16480 dok:1020 1
16252 dok:1020 1
468 dok:103 1
128 dok:1030 0
1673 dok:1038 1
I would like to split the is_fulltext column into two columns and count the occurrences of the docs at the same time.
Desired Output:
doc_id fulltext non-fulltext
0 dok:1 3 0
1 dok:1020 4 1
2 dok:103 1 0
3 dok:1030 0 1
4 dok:1038 1 0
I followed the procedure of Pandas - Create columns from column value, and fill with count
That post shows several alternatives, suggesting Categorical or reindex. I tried the following:
cats = ['fulltext', 'non_fulltext']
df_sorted['is_fulltext'] = pd.Categorical(df_sorted['is_fulltext'], categories=cats)
new_df = df_sorted.groupby(['doc_id', 'is_fulltext']).size().unstack(fill_value=0)
Here I get a ValueError:
ValueError: Length of passed values is 17446, index implies 0
Then I tried this method
cats = ['fulltext', 'non_fulltext']
new_df = df_sorted.groupby(['doc_id','is_fulltext']).size().unstack(fill_value=0).reindex(columns=cats).reset_index()
While this seems to have worked fine in the original post, my counts are filled with NANs (see below). I read by now that this happens when using reindex and categorical, but I wonder why it seems to have worked in the original post. And how can I solve this? Can anyone help? Thank you!
doc_id fulltext non-fulltext
0 dok:1 NaN NaN
1 dok:1020 NaN NaN
2 dok:103 NaN NaN
3 dok:1030 NaN NaN
4 dok:1038 NaN NaN
You could GroupBy the doc_id, apply pd.value_counts to each group and unstack:
(df.groupby('doc_id').is_fulltext.apply(pd.value_counts)
.unstack()
.fillna(0)
.rename(columns={0:'non-fulltext', 1:'fulltext'})
.reset_index())
doc_id non-fulltext fulltext
0 dok:1 0.0 3.0
1 dok:1020 1.0 4.0
2 dok:103 0.0 1.0
3 dok:1030 1.0 0.0
4 dok:1038 0.0 1.0
Or similarly to your own method, if performance is an issue do instead:
df.groupby(['doc_id','is_fulltext']).size()
.unstack(fill_value=0)
.rename(columns={0:'fulltext',1:'non_fulltext'})
.reset_index()
is_fulltext doc_id fulltext non_fulltext
0 dok:1 0 3
1 dok:1020 1 4
2 dok:103 0 1
3 dok:1030 1 0
4 dok:1038 0 1
I don't know if it's the best approach, but this should work for you:
import pandas as pd
df = pd.DataFrame({"doc_id":["id1", "id2", "id1", "id2"],
"is_fulltext":[1, 0, 1, 1]})
df_grouped = df.groupby("doc_id").sum().reset_index()
df_grouped["non_fulltext"] = df.groupby("doc_id").count().reset_index()["is_fulltext"] - df_grouped["is_fulltext"]
df_grouped
And the output is:
doc_id is_fulltext non_fulltext
0 id1 2 0
1 id2 1 1
I have a dataframe that looks like the following:
ID1 ID2 Date
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018
3 4 09/07/2018
etc.
What I need to do is to flag the first time that an ID in ID1 appears in ID2. In the above example this would look like
ID1 ID2 Date Flag
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018 Y
3 4 09/07/2018
I've used the following code to tell me if ID1 ever occurs in ID2:
ID2List= df['ID2'].tolist()
ID2List= list(set(IDList)) # dedupe list
df['ID1 is in ID2List'] = np.where(df[ID1].isin(ID2List), 'Yes', 'No')
But this only tells me that there is an occasion where ID1 appears in ID2 at some point but not the event at which this first occurs.
Any help?
One idea is to use next with a generator expression to calculate the indices of matches in ID1. Then compare with index and use argmax to get the index of the first True value:
idx = df.apply(lambda row: next((idx for idx, val in enumerate(df['ID1']) \
if row['ID2'] == val), 0), axis=1)
df.loc[(df.index > idx).argmax(), 'Flag'] = 'Y'
print(df)
ID1 ID2 Date Flag
0 1 2 01/01/2018 NaN
1 1 2 03/01/2018 NaN
2 1 2 04/05/2018 NaN
3 2 1 06/06/2018 Y
4 1 2 08/06/2018 NaN
5 3 4 09/07/2018 NaN
Given the following sample df:
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson NaN
2 1 1 Smith R
3 1 1 Smith NaN
4 0 1 Jackson X
5 1 1 Jackson NaN
6 1 1 Jackson NaN
I want to be able to fill the NaN values with the df['Value'] value associated with the given name in that row. My desired outcome is the following, which I know can be achieved like so:
df['Value'] = df['Value'].fillna(method='ffill')
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
However, this solution will not achieve the desired result if the names are not followed by one another in order. I also cannot sort by df['Name'], as the order is important. Is there an efficient means of simply filling a given NaN value by it's associated name value and assigning it to that?
It's also important to note that a given Name will always only have a single value associated with it. Thank you in advance.
You should use groupby and transform:
df['Value'] = df.groupby('Name')['Value'].transform('first')
df
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
Peter's answer is not correct because the first valid value may not always be the first in the group, in which case ffill will pollute the next group with the previous group's value.
ALollz's answer is fine, but dropna incurs some degree of overhead.
I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!
A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary
I am looking for a way to eliminate key errors that are caused by different column names in the data that gets loaded. So for example I might have columns like
dummy_df = pd.DataFrame(np.random.randint(0,5,size=(5, 2)), columns=['Test','Test_v2'])
Test Test_v2
0 0 3
1 0 0
2 1 2
3 4 0
4 4 4
How can I do s.th. like
dummy_df[ if_avail('Test') otherwise 'Test_v2']
It would be nice to be able passing a list, where it starts checking for existence in item order.
I think you can check columns names and select first matched column:
L = ['Test_v1','Test','Test_v2']
m = dummy_df.columns.isin(L)
first = dummy_df.columns[m].values[0]
s = dummy_df[first]
print (s)
0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32
Another solution is:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all').iloc[:, 0])
0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32
Explanation:
First reindex by list of columns names:
print (dummy_df.reindex(columns=L))
Test_v1 Test Test_v2
0 NaN 3 2
1 NaN 2 3
2 NaN 3 1
3 NaN 0 0
4 NaN 0 2
And remove all columns with all NaNs:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all'))
Test Test_v2
0 3 2
1 2 3
2 3 1
3 0 0
4 0 2
And last select first column by iloc:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all').iloc[:, 0])0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32