Grouping by value in column and getting another columns value - python

This is the seed DataSet:
In[1]: my_data =
[{'client':'A','product_s_n':'1','status':'in_store','month':'Jan'},
{'client':'A','product_s_n':'1','status':'sending', 'month':'Feb'},
{'client':'A','product_s_n':'2','status':'in_store','month':'Jan'},
{'client':'A','product_s_n':'2','status':'in_store','month':'Feb'},
{'client':'B','product_s_n':'3','status':'in_store','month':'Jan'},
{'client':'B','product_s_n':'3','status':'sending', 'month':'Feb'},
{'client':'B','product_s_n':'4','status':'in_store','month':'Jan'},
{'client':'B','product_s_n':'4','status':'in_store','month':'Feb'},
{'client':'C','product_s_n':'5','status':'in_store','month':'Jan'},
{'client':'C','product_s_n':'5','status':'sending', 'month':'Feb'}]
df = pd.DataFrame(my_data)
df
Out[1]:
client month product_s_n status
0 A Jan 1 in_store
1 A Feb 1 sending
2 A Jan 2 in_store
3 A Feb 2 in_store
4 B Jan 3 in_store
5 B Jan 4 in_store
6 B Feb 4 in_store
8 C Jan 5 sending
The question I want to ask this data is: what's the client for each product_serial_number? From the data in this example, this is how the resulting DataFrame would look like (I need a new DataFrame as a result):
product_s_n client
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C
As you may have noticed, the 'status' and 'month' fields are just for 'giving sense' and structure to the data in this sample dataset. Tried using groupby, with no success. Any ideas?
Thanks!

After calling df.groupby(['product_s_n']) you can restrict attention to a particular column by indexing with ['client']. You can then select the first value of client from each group by calling first().
>>> df.groupby(['product_s_n'])['client'].first()
product_s_n
1 A
2 A
3 B
4 B
5 C
Name: client, dtype: object

Related

df.dropna() modify rows indices

I was trying to remove the rows with nan values in a python dataframe and when i do so, i want the row identifiers to shift in such way that the identifiers in the new data frame start from 0 and are one number away from each other. By identifiers i mean the numbers at the left of the following example. Notice that this is not an actual column of my df. This is rather placed by default in every dataframe.
If my Df is like:
name toy born
0 a 1 2020
1 na 2 2020
2 c 5 2020
3 na 1 2020
4 asf 1 2020
i want after dropna()
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
I dont want, but this is what i get:
name toy born
0 a 1 2020
2 c 5 2020
4 asf 1 2020
You can simply add df.reset_index(drop=True)
By default, df.dropna and df.reset_index are not performed in place. Therefore, the complete answer would be as follows.
df = df.dropna().reset_index(drop=True)
Results and Explanations
The above code yields the following result.
>>> df = df.dropna().reset_index(drop=True)
>>> df
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
We use the argument drop=True to drop the index column. Otherwise, the result would look like this.
>>> df = df.dropna().reset_index()
>>> df
index name toy born
0 0 a 1 2020
1 2 c 5 2020
2 4 asf 1 2020

How to convert one row two columns dataframe into multiple rows two columns dataframe

I am new to Python.
I have a dataframe with two columns. One is ID column and the other is the
year and count information related to the ID.
I want to convert this format into multiple rows with the same ID.
The current dataframe looks like:
ID information
1 2014:Total:0, 2015:Total:1, 2016:Total:2
2 2017:Total:3, 2018:Total:1, 2019:Total:2
I expect the converted dataframe should like this:
ID Year Value
1 2014 0
1 2015 1
1 2016 2
2 2017 3
2 2018 1
2 2019 2
I tried to use the str.split method of pandas dataframe, but no luck.
Any suggestions would be appreciated.
Let us using explode :-) (New in pandas 0.25.0)
df.information=df.information.str.split(', ')
Yourdf=df[['ID']].join(df.information.explode().str.split(':',expand=True).drop(1,axis=1))
Yourdf
ID 0 2
0 1 2014 0
0 1 2015 1
0 1 2016 2
1 2 2017 3
1 2 2018 1
1 2 2019 2
Try using the below code, unlike #WenYoBen's answer this works for much lower versions as well:
df2 = pd.DataFrame(df['information'].str.split(', ', expand=True).apply(lambda x: x.str.split(':')).T.values.flatten().tolist(), columns=['Year', '', 'Value']).iloc[:, [0, 2]]
print(pd.DataFrame(sorted(df['ID'].tolist() * (len(df2) // 2)), columns=['ID']).join(df2))
Output:
ID Year Value
0 1 2014 0
1 1 2017 3
2 1 2015 1
3 2 2018 1
4 2 2016 2
5 2 2019 2

Is there a way to have column values in a DataFrame update automatically when other columns are updated

Is there a way to have a column in a Dataframe update automatically when an original entry in the Dataframe is modified? Suppose I have the following:
dataset=({'A':[1,2,3]})
df=pd.DataFrame(dataset)
df['B']=df['A'].cumprod()
In: df
Out[257]:
A B
0 1 1
1 2 2
2 3 6
If I change a values in A
df.iloc[1,0]=4
Column B does not change.
In: df
Out[260]:
A B
0 1 1
1 4 2
2 3 6
I am wondering whether there is some way to define B so that I would get:
Out[260]:
A B
0 1 1
1 4 4
2 6 24
I saw an answer on a thread from 2013 that said this functionality would be added, but can't seem to find anything other documentation on it. I have tried using pd.eval but that doesn't seem to have this functionality.

In pandas, how to display the most frequent diagnoses in dataframe, but only count 1 occurrence of the same diagnoses per patient

In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]
You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B
Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64
You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()

Python pandas join on with overwrite

I realize this question is similar to join or merge with overwrite in pandas, but the accepted answer does not work for me since I want to use the on='keys' from df.join().
I have a DataFrame df which looks like this:
keys values
0 0 0.088344
1 0 0.088344
2 0 0.088344
3 0 0.088344
4 0 0.088344
5 1 0.560857
6 1 0.560857
7 1 0.560857
8 2 0.978736
9 2 0.978736
10 2 0.978736
11 2 0.978736
12 2 0.978736
13 2 0.978736
14 2 0.978736
Then I have a Series s (which is a result from some df.groupy.apply()) with the same keys:
keys
0 0.183328
1 0.239322
2 0.574962
Name: new_values, dtype: float64
Basically I want to replace the 'values' in the df with the values in the Series, by keys so every keys block gets the same new value. Currently, I do it as follows:
df = df.join(s, on='keys')
df['values'] = df['new_values']
df = df.drop('new_values', axis=1)
The obtained (and desired) result is then:
keys values
0 0 0.183328
1 0 0.183328
2 0 0.183328
3 0 0.183328
4 0 0.183328
5 1 0.239322
6 1 0.239322
7 1 0.239322
8 2 0.574962
9 2 0.574962
10 2 0.574962
11 2 0.574962
12 2 0.574962
13 2 0.574962
14 2 0.574962
That is, I add it as a new column and by using on='keys' it gets the corrects shape. Then I assign values to be new_values and remove the new_values column. This of course works perfectly, the only problem being that I find it extremely ugly.
Is there a better way to do this?
You could try something like:
df = df[df.columns[df.columns!='values']].join(s, on='keys')
Make sure s is named 'values' instead of 'new_values'.
To my knowledge, pandas doesn't have the ability to join with "force overwrite" or "overwrite with warning".

Categories