I've seen everywhere how to randomly select DataFrame rows in pandas (with and without numpy). What I haven't found is how to add a column to a DataFrame that indicates whether a row was randomly selected. Specifically, I need to
1) group rows by values in column A
2) randomly select 10 rows in each group without replacement
3) add a column B to indicate whether the each row was selected (TRUE/FALSE).
The result should be the original DataFrame (i.e., ungrouped) with an added column of TRUE/FALSE for every row (meaning, within its group, the row was selected during random selection).
I'm using python 3.6.2, pandas 0.20.3, numpy 1.13.1.
Edit in response to comments:
For this small sample of data, let's now say randomly select 2 rows without replacement per grouping by ImageType. Yes, the data sample does not have at least 2 of every ImageType. Please understand that the small dataset is to prevent making a really long post.
The data looks like this (there are thousands of rows):
+-----------+---------------------+
| ImageType | FileName |
+-----------+---------------------+
| 9 | PIC_001_01_0_9.JPG |
| 9 | PIC_022_17_0_9.JPG |
| 38 | PIC_100_00_0_38.jpg |
| 9 | PIC_293_12_0_9.JPG |
| 9 | PIC_381_14_0_9.JPG |
| 33 | PIC_001_17_2_33.JPG |
| 9 | PIC_012_07_0_9.JPG |
| 28 | PIC_306_00_0_28.jpg |
| 28 | PIC_178_08_0_28.JPG |
| 26 | PIC_225_11_0_26.JPG |
| 18 | PIC_087_16_0_18.JPG |
| 9 | PIC_089_18_0_9.JPG |
| 19 | PIC_090_18_0_19.JPG |
| 9 | PIC_091_18_0_9.JPG |
| 19 | PIC_092_18_2_19.JPG |
| 23 | PIC_270_14_0_23.JPG |
| 13 | PIC_271_14_0_13.JPG |
+-----------+---------------------+
The code is only a read from .csv, but to recreate the sample data above:
import pandas as pd
df = pd.DataFrame({'ImageType': ['9','9','38','9','9','33','9','28','28','26',
'18','9','19','9','19','23','13'],
'FileName': ['PIC_001_01_0_9.JPG','PIC_022_17_0_9.JPG',
'PIC_100_00_0_38.jpg','PIC_293_12_0_9.JPG',
'PIC_381_14_0_9.JPG','PIC_001_17_2_33.JPG',
'PIC_012_07_0_9.JPG','PIC_306_00_0_28.jpg',
'PIC_178_08_0_28.JPG','PIC_225_11_0_26.JPG',
'PIC_087_16_0_18.JPG','PIC_089_18_0_9.JPG',
'PIC_090_18_0_19.JPG','PIC_091_18_0_9.JPG',
'PIC_092_18_2_19.JPG','PIC_270_14_0_23.JPG',
'PIC_271_14_0_13.JPG']})
# group by ImageType
# select 2 rows randomly in each group, without replacement
# add a column to original DataFrame to indicate selected rows
def get_sample(df, n=2):
if len(df) <= n:
df['Sampled'] = True
else:
s = df.sample(n=n)
df['Sampled'] = df.apply(lambda x: x.name in s.index, axis=1)
return df
grouped = df.groupby('ImageType')
new_df = grouped.apply(get_sample)
print(new_df)
FileName ImageType Sampled
0 PIC_001_01_0_9.JPG 9 False
1 PIC_022_17_0_9.JPG 9 False
2 PIC_100_00_0_38.jpg 38 True
3 PIC_293_12_0_9.JPG 9 True
4 PIC_381_14_0_9.JPG 9 False
5 PIC_001_17_2_33.JPG 33 True
6 PIC_012_07_0_9.JPG 9 False
7 PIC_306_00_0_28.jpg 28 True
8 PIC_178_08_0_28.JPG 28 True
9 PIC_225_11_0_26.JPG 26 True
10 PIC_087_16_0_18.JPG 18 True
11 PIC_089_18_0_9.JPG 9 True
12 PIC_090_18_0_19.JPG 19 True
13 PIC_091_18_0_9.JPG 9 False
14 PIC_092_18_2_19.JPG 19 True
15 PIC_270_14_0_23.JPG 23 True
16 PIC_271_14_0_13.JPG 13 True
If the number of choices in the group is less than the sample number it will sample all of them.
Related
I have DataFrame with almost 500 rows and 3 columns.
One of the columns has a string of dates and each cell has a unique date, but some cell have a common date and some cells are seem empty.
I'm trying to find the frequency of each day in a cell
df|Number_of_dates | Date
--|--------------------|---------------------
0 | 0.0 | []
1 | 3.0 | ['2006-01-01' '2006-03-22' '2019-07-29']
2 | 8.0 | ['2006-01-01' '2006-04-13' '2006-07-18' '2006-...
3 | 1.0 | ['2006-07-18']
4 | 1.0 | ['2019-07-29']
5 | 0.0 | []
6 | 397.0 | ['2019-01-02' '2019-01-03' '2019-01-04' '2019-...
Result:
df_1 |Date | Frequency
-----|------------ |---------------------
0 | 2006-01-01 |2
1 | 2006-03-22 |1
2 | 2006-04-13 |1
3 | 2006-07-18 |2
4 | 2019-07-29 |3
It would be very helpful if you could provide some guidance.
Thanks in advance
additional information:
I noticed that each cell has a string value instead of a list
Sample DataFrame
d = {"Date":[ "['2005-02-02' '2005-05-04' '2005-08-03' '2005-11-02' '2006-02-01' '2006-05-03']",
"['2006-01-31' '2006-02-01' '2006-03-16'\n '2006-06-13']",
"['2005-10-12' '2005-10-13' '2005-10-14'\n '2005-10-17']",
"[]",
"['2005-07-25' '2005-07-26' '2005-07-27'\n '2005-07-28' '2005-07-29' '2005-08-01' '2005-08-02' '2005-08-03'\n '2005-08-04' '2005-08-05']",
"['2005-03-15' '2005-03-16' '2005-03-17'\n '2005-03-18' '2005-03-21' '2005-03-22' '2005-03-23' '2005-03-24' \n'2005-03-28' '2005-03-29' '2005-03-30' '2005-03-31' '2005-04-01'\n '2005-04-04']",
"['2005-03-16' '2005-03-17' '2005-07-27'\n '2006-06-13']",
"['2005-02-02' '2005-05-04' '2005-03-16' '2005-03-17']",
"[]"
]
}
df = pd.DataFrame(d)
Use DataFrame.explode with GroupBy.size:
#create list from sample data
df['Date'] = df['Date'].str.strip('[]').str.split()
df_1 = df.explode('Date').groupby('Date').size().reset_index(name='Frequency')
print (df_1.head(10))
Date Frequency
0 '2005-02-02' 2
1 '2005-03-15' 1
2 '2005-03-16' 3
3 '2005-03-17' 3
4 '2005-03-18' 1
5 '2005-03-21' 1
6 '2005-03-22' 1
7 '2005-03-23' 1
8 '2005-03-24' 1
9 '2005-03-28' 1
When working with python pandas I often like to create tables with emacs org-mode. To read the table I do something like
import pandas as pd
from numpy import *
D = pd.read_csv('file.dat',sep='|')
D = D.drop(D.columns[0], axis=1)
D = D.drop(D.columns[-1], axis=1)
D = D.rename(columns=lambda x: x.strip())
Is there a more elegant (in particular shorter) way to read the org-mode table into a pandas dataframe? Maybe there is also an elegant way to keep table and python source in the same org-file.
Here's an answer to the modified question (keeping the table and the source code in the Org mode file). I've stolen the pandas part from Quang Hoang's answer:
* foo
Here's a table:
#+NAME: foo
| a | b | c |
|----+-----+------|
| 1 | 1 | 1 |
| 2 | 4 | 8 |
| 3 | 9 | 27 |
| 4 | 16 | 64 |
| 5 | 25 | 125 |
| 6 | 36 | 216 |
| 7 | 49 | 343 |
| 8 | 64 | 512 |
| 9 | 81 | 729 |
| 10 | 100 | 1000 |
#+TBLFM: $2=pow($1, 2) :: $3 = pow($1, 3)
Here's a source block that initializes the variable `tbl' with the table `foo' above
and does to it some pandas things as suggested by Quang Hoang in his answer.
To evaluate the code block, press `C-C C-c' in the code block.
You will then get the result below:
#+begin_src python :var tbl=foo :results output
import pandas as pd
D = pd.DataFrame(tbl).iloc[:, 1:-1]
print(D)
#+end_src
#+RESULTS:
#+begin_example
1
0 1
1 4
2 9
3 16
4 25
5 36
6 49
7 64
8 81
9 100
#+end_example
See the Org manual for (much) more information about source blocks.
EDIT: To preserve the column names (the first row of the table), you can add :colnames no to the source block header. The column names themselves
are obtained inside the source block as tbl[0] and one can use that in the DataFrame constructor to label the columns as follows (n.b. In contrast to above, the DataFrame is the complete table. I just use a couple of different methods to select pieces of it to print out, including the D.c method you asked about in a comment):
#+begin_src python :var tbl=foo :results output :colnames no
import pandas as pd
D = pd.DataFrame(tbl, columns=tbl[0])
print(D.c)
print("===========")
print(D.iloc[1:, 0:-1])
#+end_src
#+RESULTS:
#+begin_example
0 c
1 1
2 8
3 27
4 64
5 125
6 216
7 343
8 512
9 729
10 1000
Name: c, dtype: object
===========
a b
1 1 1
2 2 4
3 3 9
4 4 16
5 5 25
6 6 36
7 7 49
8 8 64
9 9 81
10 10 100
#+end_example
Why :colnames no (and not :colnames yes) is needed to add the column names is something that I have never understood: one of these days, I should post a question on the Org mode mailing list about it...
Try with
D = pd.read_csv('file.dat', sep='\s*\|\s*').iloc[:, 1:-1]
Lets say I have a data frame with 3 columns:
| id | value | date |
+====+=======+===========+
| 1 | 50 | 1-Feb-19 |
+----+-------+-----------+
| 1 | 100 | 5-Feb-19 |
+----+-------+-----------+
| 1 | 200 | 6-Jun-19 |
+----+-------+-----------+
| 1 | 500 | 1-Dec-19 |
+----+-------+-----------+
| 2 | 10 | 6-Jul-19 |
+----+-------+-----------+
| 3 | 500 | 1-Mar-19 |
+----+-------+-----------+
| 3 | 200 | 5-Apr-19 |
+----+-------+-----------+
| 3 | 100 | 30-Jun-19 |
+----+-------+-----------+
| 3 | 10 | 25-Dec-19 |
+----+-------+-----------+
ID column contains the ID of a particular person.
Value column contains the value of their transaction.
Date column contains the date of their transaction.
Is there a way in Python to identify ID 1 as the ID with the increasing value of transactions over time?
I'm looking for some way I can extract ID 1 as my desired ID with increasing value of transactions, filter out ID 2 because it doesn't have enough transactions to analyze a trend and also filter out ID 3 as it's trend of transactions is declining over time.
Perhaps group by the id, and check that the sorted values are the same whether sorted by values or by date:
>>> df.groupby('id').apply( lambda x:
... (
... x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value']
... ).all()
... )
id
1 True
2 True
3 False
dtype: bool
EDIT:
To make id=2 not True, we can do this instead:
>>> df.groupby('id').apply( lambda x:
... (
... (x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value'])
... & (len(x) > 1)
... ).all()
... )
id
1 True
2 False
3 False
dtype: bool
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
df = df.groupby('id').new.agg(['last'])
df
Output:
last
id
1 increase
2 --
3 decrease
Only increasing ID:
increasingList = df[(df['last']=='increase')].index.values
print(increasingList)
Result:
[1]
Assuming this won't happen
1 50
1 100
1 50
If so, then:
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'increase',
np.where(x.diff()<0,'decrease','--')))
df
Output:
value new
id
1 50 --
1 100 increase
1 200 increase
2 10 --
3 500 --
3 300 decrease
3 100 decrease
Concat strings:
df = df.groupby(['id'])['new'].apply(lambda x: ','.join(x)).reset_index()
df
Intermediate Result:
id new
0 1 --,increase,increase
1 2 --
2 3 --,decrease,decrease
Check if decrease exist in a row / only "--" exists. Drop them
df = df.drop(df[df['new'].str.contains("dec")].index.values)
df = df.drop(df[(df['new']=='--')].index.values)
df
Result:
id new
0 1 --,increase,increase
I'm stuck with a little problem with python and regular expressions.
I got a pandas table with records with a different
different order of construction, see below.
+----------------------------------------------+
| Total |
+----------------------------------------------+
| Total Price: 4 x 2 = 8 |
| Total Price 200 Price_per_piece 10 Amount 20 |
+----------------------------------------------+
I want to separate the records in the ‘Total’ column to 3 other columns like below.
Do I need first to split those columns in 2 subset and to do different regular expressions or do you guys have some other solutions/ideas?
+-------+-----------------+--------+
| Total | Price_per_piece | Amount |
+-------+-----------------+--------+
| 8 | 4 | 2 |
| 200 | 10 | 20 |
+-------+-----------------+--------+
Try this one:
dtotal = ({"Total":["Total Price: 4 x 2 = 8","Total Price 200 Price_per_piece 10 Amount 20"]})
dt = pd.DataFrame(dtotal)
data = []
for item in dt['Total']:
regex = re.findall(r"(\d+)\D+(\d+)\D+(\d+)",item)
regex = (map(list,regex))
data.append(list(map(int,list(regex)[0])))
dftotal = pd.DataFrame(data, columns=['Total','Price_per_piece','Amount'])
print(dftotal)
Output:
Total Price_per_piece Amount
0 4 2 8
1 200 10 20
I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN