In pandas DataFrame, how to add column showing random selection result? - python

I've seen everywhere how to randomly select DataFrame rows in pandas (with and without numpy). What I haven't found is how to add a column to a DataFrame that indicates whether a row was randomly selected. Specifically, I need to
1) group rows by values in column A
2) randomly select 10 rows in each group without replacement
3) add a column B to indicate whether the each row was selected (TRUE/FALSE).
The result should be the original DataFrame (i.e., ungrouped) with an added column of TRUE/FALSE for every row (meaning, within its group, the row was selected during random selection).
I'm using python 3.6.2, pandas 0.20.3, numpy 1.13.1.
Edit in response to comments:
For this small sample of data, let's now say randomly select 2 rows without replacement per grouping by ImageType. Yes, the data sample does not have at least 2 of every ImageType. Please understand that the small dataset is to prevent making a really long post.
The data looks like this (there are thousands of rows):
+-----------+---------------------+
| ImageType | FileName |
+-----------+---------------------+
| 9 | PIC_001_01_0_9.JPG |
| 9 | PIC_022_17_0_9.JPG |
| 38 | PIC_100_00_0_38.jpg |
| 9 | PIC_293_12_0_9.JPG |
| 9 | PIC_381_14_0_9.JPG |
| 33 | PIC_001_17_2_33.JPG |
| 9 | PIC_012_07_0_9.JPG |
| 28 | PIC_306_00_0_28.jpg |
| 28 | PIC_178_08_0_28.JPG |
| 26 | PIC_225_11_0_26.JPG |
| 18 | PIC_087_16_0_18.JPG |
| 9 | PIC_089_18_0_9.JPG |
| 19 | PIC_090_18_0_19.JPG |
| 9 | PIC_091_18_0_9.JPG |
| 19 | PIC_092_18_2_19.JPG |
| 23 | PIC_270_14_0_23.JPG |
| 13 | PIC_271_14_0_13.JPG |
+-----------+---------------------+
The code is only a read from .csv, but to recreate the sample data above:
import pandas as pd
df = pd.DataFrame({'ImageType': ['9','9','38','9','9','33','9','28','28','26',
'18','9','19','9','19','23','13'],
'FileName': ['PIC_001_01_0_9.JPG','PIC_022_17_0_9.JPG',
'PIC_100_00_0_38.jpg','PIC_293_12_0_9.JPG',
'PIC_381_14_0_9.JPG','PIC_001_17_2_33.JPG',
'PIC_012_07_0_9.JPG','PIC_306_00_0_28.jpg',
'PIC_178_08_0_28.JPG','PIC_225_11_0_26.JPG',
'PIC_087_16_0_18.JPG','PIC_089_18_0_9.JPG',
'PIC_090_18_0_19.JPG','PIC_091_18_0_9.JPG',
'PIC_092_18_2_19.JPG','PIC_270_14_0_23.JPG',
'PIC_271_14_0_13.JPG']})
# group by ImageType
# select 2 rows randomly in each group, without replacement
# add a column to original DataFrame to indicate selected rows

def get_sample(df, n=2):
if len(df) <= n:
df['Sampled'] = True
else:
s = df.sample(n=n)
df['Sampled'] = df.apply(lambda x: x.name in s.index, axis=1)
return df
grouped = df.groupby('ImageType')
new_df = grouped.apply(get_sample)
print(new_df)
FileName ImageType Sampled
0 PIC_001_01_0_9.JPG 9 False
1 PIC_022_17_0_9.JPG 9 False
2 PIC_100_00_0_38.jpg 38 True
3 PIC_293_12_0_9.JPG 9 True
4 PIC_381_14_0_9.JPG 9 False
5 PIC_001_17_2_33.JPG 33 True
6 PIC_012_07_0_9.JPG 9 False
7 PIC_306_00_0_28.jpg 28 True
8 PIC_178_08_0_28.JPG 28 True
9 PIC_225_11_0_26.JPG 26 True
10 PIC_087_16_0_18.JPG 18 True
11 PIC_089_18_0_9.JPG 9 True
12 PIC_090_18_0_19.JPG 19 True
13 PIC_091_18_0_9.JPG 9 False
14 PIC_092_18_2_19.JPG 19 True
15 PIC_270_14_0_23.JPG 23 True
16 PIC_271_14_0_13.JPG 13 True
If the number of choices in the group is less than the sample number it will sample all of them.

Related

Finding frequency of items in cell of column pandas

I have DataFrame with almost 500 rows and 3 columns.
One of the columns has a string of dates and each cell has a unique date, but some cell have a common date and some cells are seem empty.
I'm trying to find the frequency of each day in a cell
df|Number_of_dates | Date
--|--------------------|---------------------
0 | 0.0 | []
1 | 3.0 | ['2006-01-01' '2006-03-22' '2019-07-29']
2 | 8.0 | ['2006-01-01' '2006-04-13' '2006-07-18' '2006-...
3 | 1.0 | ['2006-07-18']
4 | 1.0 | ['2019-07-29']
5 | 0.0 | []
6 | 397.0 | ['2019-01-02' '2019-01-03' '2019-01-04' '2019-...
Result:
df_1 |Date | Frequency
-----|------------ |---------------------
0 | 2006-01-01 |2
1 | 2006-03-22 |1
2 | 2006-04-13 |1
3 | 2006-07-18 |2
4 | 2019-07-29 |3
It would be very helpful if you could provide some guidance.
Thanks in advance
additional information:
I noticed that each cell has a string value instead of a list
Sample DataFrame
d = {"Date":[ "['2005-02-02' '2005-05-04' '2005-08-03' '2005-11-02' '2006-02-01' '2006-05-03']",
"['2006-01-31' '2006-02-01' '2006-03-16'\n '2006-06-13']",
"['2005-10-12' '2005-10-13' '2005-10-14'\n '2005-10-17']",
"[]",
"['2005-07-25' '2005-07-26' '2005-07-27'\n '2005-07-28' '2005-07-29' '2005-08-01' '2005-08-02' '2005-08-03'\n '2005-08-04' '2005-08-05']",
"['2005-03-15' '2005-03-16' '2005-03-17'\n '2005-03-18' '2005-03-21' '2005-03-22' '2005-03-23' '2005-03-24' \n'2005-03-28' '2005-03-29' '2005-03-30' '2005-03-31' '2005-04-01'\n '2005-04-04']",
"['2005-03-16' '2005-03-17' '2005-07-27'\n '2006-06-13']",
"['2005-02-02' '2005-05-04' '2005-03-16' '2005-03-17']",
"[]"
]
}
df = pd.DataFrame(d)
Use DataFrame.explode with GroupBy.size:
#create list from sample data
df['Date'] = df['Date'].str.strip('[]').str.split()
df_1 = df.explode('Date').groupby('Date').size().reset_index(name='Frequency')
print (df_1.head(10))
Date Frequency
0 '2005-02-02' 2
1 '2005-03-15' 1
2 '2005-03-16' 3
3 '2005-03-17' 3
4 '2005-03-18' 1
5 '2005-03-21' 1
6 '2005-03-22' 1
7 '2005-03-23' 1
8 '2005-03-24' 1
9 '2005-03-28' 1

Elegant way to read emacs org-mode tables into a python pandas dataframe

When working with python pandas I often like to create tables with emacs org-mode. To read the table I do something like
import pandas as pd
from numpy import *
D = pd.read_csv('file.dat',sep='|')
D = D.drop(D.columns[0], axis=1)
D = D.drop(D.columns[-1], axis=1)
D = D.rename(columns=lambda x: x.strip())
Is there a more elegant (in particular shorter) way to read the org-mode table into a pandas dataframe? Maybe there is also an elegant way to keep table and python source in the same org-file.
Here's an answer to the modified question (keeping the table and the source code in the Org mode file). I've stolen the pandas part from Quang Hoang's answer:
* foo
Here's a table:
#+NAME: foo
| a | b | c |
|----+-----+------|
| 1 | 1 | 1 |
| 2 | 4 | 8 |
| 3 | 9 | 27 |
| 4 | 16 | 64 |
| 5 | 25 | 125 |
| 6 | 36 | 216 |
| 7 | 49 | 343 |
| 8 | 64 | 512 |
| 9 | 81 | 729 |
| 10 | 100 | 1000 |
#+TBLFM: $2=pow($1, 2) :: $3 = pow($1, 3)
Here's a source block that initializes the variable `tbl' with the table `foo' above
and does to it some pandas things as suggested by Quang Hoang in his answer.
To evaluate the code block, press `C-C C-c' in the code block.
You will then get the result below:
#+begin_src python :var tbl=foo :results output
import pandas as pd
D = pd.DataFrame(tbl).iloc[:, 1:-1]
print(D)
#+end_src
#+RESULTS:
#+begin_example
1
0 1
1 4
2 9
3 16
4 25
5 36
6 49
7 64
8 81
9 100
#+end_example
See the Org manual for (much) more information about source blocks.
EDIT: To preserve the column names (the first row of the table), you can add :colnames no to the source block header. The column names themselves
are obtained inside the source block as tbl[0] and one can use that in the DataFrame constructor to label the columns as follows (n.b. In contrast to above, the DataFrame is the complete table. I just use a couple of different methods to select pieces of it to print out, including the D.c method you asked about in a comment):
#+begin_src python :var tbl=foo :results output :colnames no
import pandas as pd
D = pd.DataFrame(tbl, columns=tbl[0])
print(D.c)
print("===========")
print(D.iloc[1:, 0:-1])
#+end_src
#+RESULTS:
#+begin_example
0 c
1 1
2 8
3 27
4 64
5 125
6 216
7 343
8 512
9 729
10 1000
Name: c, dtype: object
===========
a b
1 1 1
2 2 4
3 3 9
4 4 16
5 5 25
6 6 36
7 7 49
8 8 64
9 9 81
10 10 100
#+end_example
Why :colnames no (and not :colnames yes) is needed to add the column names is something that I have never understood: one of these days, I should post a question on the Org mode mailing list about it...
Try with
D = pd.read_csv('file.dat', sep='\s*\|\s*').iloc[:, 1:-1]

How do you identify which IDs have an increasing value over time in another column in a Python dataframe?

Lets say I have a data frame with 3 columns:
| id | value | date |
+====+=======+===========+
| 1 | 50 | 1-Feb-19 |
+----+-------+-----------+
| 1 | 100 | 5-Feb-19 |
+----+-------+-----------+
| 1 | 200 | 6-Jun-19 |
+----+-------+-----------+
| 1 | 500 | 1-Dec-19 |
+----+-------+-----------+
| 2 | 10 | 6-Jul-19 |
+----+-------+-----------+
| 3 | 500 | 1-Mar-19 |
+----+-------+-----------+
| 3 | 200 | 5-Apr-19 |
+----+-------+-----------+
| 3 | 100 | 30-Jun-19 |
+----+-------+-----------+
| 3 | 10 | 25-Dec-19 |
+----+-------+-----------+
ID column contains the ID of a particular person.
Value column contains the value of their transaction.
Date column contains the date of their transaction.
Is there a way in Python to identify ID 1 as the ID with the increasing value of transactions over time?
I'm looking for some way I can extract ID 1 as my desired ID with increasing value of transactions, filter out ID 2 because it doesn't have enough transactions to analyze a trend and also filter out ID 3 as it's trend of transactions is declining over time.
Perhaps group by the id, and check that the sorted values are the same whether sorted by values or by date:
>>> df.groupby('id').apply( lambda x:
... (
... x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value']
... ).all()
... )
id
1 True
2 True
3 False
dtype: bool
EDIT:
To make id=2 not True, we can do this instead:
>>> df.groupby('id').apply( lambda x:
... (
... (x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value'])
... & (len(x) > 1)
... ).all()
... )
id
1 True
2 False
3 False
dtype: bool
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
df = df.groupby('id').new.agg(['last'])
df
Output:
last
id
1 increase
2 --
3 decrease
Only increasing ID:
increasingList = df[(df['last']=='increase')].index.values
print(increasingList)
Result:
[1]
Assuming this won't happen
1 50
1 100
1 50
If so, then:
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'increase',
np.where(x.diff()<0,'decrease','--')))
df
Output:
value new
id
1 50 --
1 100 increase
1 200 increase
2 10 --
3 500 --
3 300 decrease
3 100 decrease
Concat strings:
df = df.groupby(['id'])['new'].apply(lambda x: ','.join(x)).reset_index()
df
Intermediate Result:
id new
0 1 --,increase,increase
1 2 --
2 3 --,decrease,decrease
Check if decrease exist in a row / only "--" exists. Drop them
df = df.drop(df[df['new'].str.contains("dec")].index.values)
df = df.drop(df[(df['new']=='--')].index.values)
df
Result:
id new
0 1 --,increase,increase

How to use a regular expression in pandas dataframe with different records in a column?

I'm stuck with a little problem with python and regular expressions.
I got a pandas table with records with a different
different order of construction, see below.
+----------------------------------------------+
| Total |
+----------------------------------------------+
| Total Price: 4 x 2 = 8 |
| Total Price 200 Price_per_piece 10 Amount 20 |
+----------------------------------------------+
I want to separate the records in the ‘Total’ column to 3 other columns like below.
Do I need first to split those columns in 2 subset and to do different regular expressions or do you guys have some other solutions/ideas?
+-------+-----------------+--------+
| Total | Price_per_piece | Amount |
+-------+-----------------+--------+
| 8 | 4 | 2 |
| 200 | 10 | 20 |
+-------+-----------------+--------+
Try this one:
dtotal = ({"Total":["Total Price: 4 x 2 = 8","Total Price 200 Price_per_piece 10 Amount 20"]})
dt = pd.DataFrame(dtotal)
data = []
for item in dt['Total']:
regex = re.findall(r"(\d+)\D+(\d+)\D+(\d+)",item)
regex = (map(list,regex))
data.append(list(map(int,list(regex)[0])))
dftotal = pd.DataFrame(data, columns=['Total','Price_per_piece','Amount'])
print(dftotal)
Output:
Total Price_per_piece Amount
0 4 2 8
1 200 10 20

Use pandas groupby.size() results for arithmetical operation

I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN

Categories