Random sampling and Pandas dataframes

Random sampling and Pandas dataframes - python

I have the following dataframe, cr_df, which shows the rate at which ID1 converts to ID2
ID1 ID2 Conversion Rate
0 1 A 0.046562
1 1 B 0.315975
2 1 C 0.577998
3 1 D 0.059465
4 2 A 0.6
5 2 B 0.4
Then I have another dataframe, raw_df, in the format of ID1 such as:
ID1 Value
0 1 100
1 2 200
My goal is to output a dataframe final_df, in the ID2 format that looks something like:
ID2 Value
0 C 100
1 A 200
Where the mapping from ID1 consists of selecting a random value between 0 and 1 and picking the ID2 based off the conversion rates.
How can I achieve this in pandas? (Do I need to use .apply?)

Given this setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'ID1': [1]*4+[2]*2, 'ID2':list('ABCDAB'),
'Conversion Rate': [0.046562, 0.315975, 0.577998, 0.059465, 0.6, 0.4]})
raw_df = pd.DataFrame({'ID1': [1,2], 'Value':[100, 200]})
you could define a function random_id2:
def random_id2(x):
return np.random.choice(x['ID2'], p=x['Conversion Rate'].values)
and use groupby/apply:
id2 = df.groupby(['ID1']).apply(random_id2)
to obtain the Series
ID1
1 C
2 A
dtype: object
You could then build final_df by mapping raw_df['ID1'] values to id2 values:
final_df = raw_df.copy()
final_df['ID1'] = final_df['ID1'].map(id2)
final_df = final_df.rename(columns={'ID1': 'ID2'})
import numpy as np
import pandas as pd
df = pd.DataFrame({
'ID1': [1]*4+[2]*2, 'ID2':list('ABCDAB'),
'Conversion Rate': [0.046562, 0.315975, 0.577998, 0.059465, 0.6, 0.4]})
raw_df = pd.DataFrame({'ID1': [1,2], 'Value':[100, 200]})
def random_id2(x):
return np.random.choice(x['ID2'], p=x['Conversion Rate'].values)
id2 = df.groupby(['ID1']).apply(random_id2)
final_df = raw_df.copy()
final_df['ID1'] = final_df['ID1'].map(id2)
final_df = final_df.rename(columns={'ID1': 'ID2'})
print(final_df)
yields
ID2 Value
0 C 100
1 A 200

You can do a combination of the following:
To make a weighted random choice of the rows, use the answer in this question; specifically, make a weighted selection of range(len(df)) with the weights given by df[Conversion Rate].
To select the rows with the given indices, see here.
To join the resulting dataframe with the second one, use merge

Related

Name group of columns and rows in Pandas DataFrame

I would like to give a name to groups of columns and rows in my Pandas DataFrame to achieve the same result as a merged Excel table:
However, I can't find any way to give an overarching name to groups of columns/rows like what is shown.
I tried wrapping the tables in an array, but the dataframes don't display:
labels = ['a', 'b', 'c']
df = pd.DataFrame(np.ones((3,3)), index=labels, columns=labels)
labeledRowsCols = pd.DataFrame([df, df])
labeledRowsCols = pd.DataFrame(labeledRowsCols.T, index=['actual'], columns=['predicted 1', 'predicted 2'])
print(labeledRowsCols)
predicted 1 predicted 2
actual NaN NaN

You can set hierarchical indices for both the rows and columns.
import pandas as pd
df = pd.DataFrame([[3,1,0,3,1,0],[0,3,0,0,3,0],[2,1,3,2,1,3]])
col_ix = pd.MultiIndex.from_product([['Predicted: Set 1', 'Predicted: Set 2'], list('abc')])
row_ix = pd.MultiIndex.from_product([['True label'], list('abc')])
df = df.set_index(row_ix)
df.columns = col_ix
df
# returns:
Predicted: Set 1 Predicted: Set 2
a b c a b c
True label a 3 1 0 3 1 0
b 0 3 0 0 3 0
c 2 1 3 2 1 3
Exporting this to Excel should have the merged cells as in your example.

Pandas aggregate statistics as new columns

I have a dataframe df with 3 columns: A is an object id, B is a flag, and C is a value measured on object A with flag B.
I want to compute the avarage value of C grouped by [A,B] and store the results as three new columns:
C0: meanC (or NaN) when B = 0
C1: meanC (or NaN) when B = 1
C2: meanC (or NaN) when B = 2
Below there's an example of how I am trying to transform the dataframe df into res.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A":[0,0,0,0,0,1,2,2,3,3,3],
"B":[0,1,2,0,1,2,0,2,0,1,1],
"C":[.654,.123,1.45,6.1,0.322,1.77,9.234,2.54,1,6.77,6.438]})
grouped = df.groupby(["A","B"]).agg("mean")
# how to transform grouped into res?
res = pd.DataFrame({
"A":[0,1,2,3],
"C0":[3.377,np.nan,9.234,1],
"C1":[0.2225,np.nan,np.nan,6.604],
"C2":[1.45,1.77,2.54,np.nan]})

Add unstack with add_prefix:
res = df.groupby(["A","B"])['C'].mean().unstack().add_prefix('C').reset_index()
Or use pivot_table with default mean aggregate function:
res = df.pivot_table(index="A",columns="B",values='C').add_prefix('C').reset_index()
print (res)
B A C0 C1 C2
0 0 3.377 0.2225 1.45
1 1 NaN NaN 1.77
2 2 9.234 NaN 2.54
3 3 1.000 6.6040 NaN

How to select and order columns in a dataframe using an array in Python

I have a fairly lage dataframe, df2 (~50,000 rows x 2,000 columns). The column headings are sample names. Separately, I have a dataframe, df1, with a list of samples I want to include in my analysis as the df1 index. I want to use the list of samples from df1 index to select only the columns from df2 for those selected samples, discarding the rest. I also want to preserve the sample order from the df1 index.
Example data:
# df1
data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
'Year':[2012, 2014, 2015]}
df1 = pd.DataFrame(data1)
df1.set_index('Sample')
# df2
data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df2 = pd.DataFrame(data2)
df2.set_index('Num')
First I generate the list of samples I want from the index of df1, e.g.
samples = df1['Sample'].tolist()
'samples' is then,
['Sample_A', 'Sample_D', 'Sample_E']
And using 'samples', my desired output dataframe, df3, should look like:
index Sample_A Sample_D
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
But if I use
df3 = df2[samples]
Then I get the error message:
"['Sample_E'] not in index"
So how do I ignore samples that are not found in df2 to avoid this error message?
UPDATE
The solution that worked -
# 1. Define samples to use from df1
samples = df1['Sample'].tolist()
# Only include samples that are found in df2 as well
final_samples = list(set(list(df2.columns)) & set(samples ))
# Make new df with columns corresponding to final_samples
df3 = df2.loc[:, final_samples]

try like this..
df = pd.read_csv("data.csv", usecols=['Sample_A','Sample_D']).fillna('')
print(df)
Selecting all of the rows and some columns, It is possible to select all of the rows by using a single colon.
>>> df.loc[:, ['Sample_A','Sample_D']]
Your answer from the dataset you provided:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num').loc[:, ['Sample_A','Sample_D']]
Sample_A Sample_D
Num
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
=====================================
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
OR
>>> df3 = df2.reindex(columns=samples)
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN

You can do it this way. They columns array is in Order which you actually want.
import pandas as pd
data = {'index': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df = pd.DataFrame(data)
df.set_index('index')
df1 = df[['index']+['Sample_A','Sample_D']]
output:
index Sample_A Sample_D
0 Value_1 0 0
1 Value_2 1 0
2 Value_3 0 1
3 Value_4 0 1
4 Value_5 1 0
but to ignore the missing columns take the columns only belong df on which you're doing analysis.
samples = ['index', 'Sample_A', 'Sample_D','Extra_Sample']
final_samples = list(set(list(df1.columns)) & set(samples ))
Now you can pass the final_samples which is having only df2 columns.
df3 = df2[final_samples]

How to factorize two data frame meanwhile with python-pandas?

I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!

Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20

Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.

Which data structure in Python to use to replace Excel 2-dim array of strings/amounts?

I am using xlwings to replace my VB code with Python but since I am not an experienced programmer I was wondering - which data structure to use?
Data is in .xls in 2 columns and has the following form; In VB I lift this into a basic two dimensional array arrCampaignsAmounts(i, j):
Col 1: 'market_channel_campaign_product'; Col 2: '2334.43 $'
Then I concatenate words from 4 columns on another sheet into a similar 'string', into another 2-dim array arrStrings(i, j):
'Austria_Facebook_Winter_Active vacation'; 'rowNumber'
Finally, I search strings from 1. array within strings from 2. array; if found I write amounts into rowNumber from arrStrings(i, 2).
Would I use 4 lists for this task?
Two dictionaries?
Something else?

Definitely use pandas Dataframes. Here are references and very simple Dataframe examples.
#reference: http://pandas.pydata.org/pandas-docs/stable/10min.html
#reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.
import numpy as np
import pandas as pd
def df_dupes(df_in):
'''
Returns [object,count] pairs for each unique item in the dataframe.
'''
# import pandas
if isinstance(df_in, list) or isinstance(df_in, tuple):
import pandas as pd
df_in = pd.DataFrame(df_in)
return df_in.groupby(df_in.columns.tolist(),as_index=False).size()
def df_filter_example(df):
'''
In [96]: df
Out[96]:
A B C D
0 1 4 9 1
1 4 5 0 2
2 5 5 1 0
3 1 3 9 6
'''
import pandas as pd
df=pd.DataFrame([[1,4,9,1],[4,5,0,2],[5,5,1,0],[1,3,9,6]],columns=['A','B','C','D'])
return df[(df.A == 1) & (df.D == 6)]
def df_compare(df1, df2, compare_col_list, join_type):
'''
df_compare compares 2 dataframes.
Returns left, right, inner or outer join
df1 is the first/left dataframe
df2 is the second/right dataframe
compare_col_list is a lsit of column names that must match between df1 and df2
join_type = 'inner', 'left', 'right' or 'outer'
'''
import pandas as pd
return pd.merge(df1, df2, how=join_type,
on=compare_col_list)
def df_compare_examples():
import numpy as np
import pandas as pd
df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9 '''
df2=pd.DataFrame([[4,5,6],[7,8,9],[10,11,12]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 4 5 6
1 7 8 9
2 10 11 12 '''
# One can see that df1 contains 1 row ([1,2,3]) not in df3 and
# df2 contains 1 rown([10,11,12]) not in df1.
# Assume c1 is not relevant to the comparison. So, we merge on cols 2 and 3.
df_merge = pd.merge(df1,df2,how='outer',on=['c2','c3'])
print(df_merge)
''' c1_x c2 c3 c1_y
0 1 2 3 NaN
1 4 5 6 4
2 7 8 9 7
3 NaN 11 12 10 '''
''' One can see that columns c2 and c3 are returned. We also received
columns c1_x and c1_y, where c1_X is the value of column c1
in the first dataframe and c1_y is the value of c1 in the second
dataframe. As such,
any row that contains c1_y = NaN is a row from df1 not in df2 &
any row that contains c1_x = NaN is a row from df2 not in df1. '''
df1_unique = pd.merge(df1,df2,how='left',on=['c2','c3'])
df1_unique = df1_unique[df1_unique['c1_y'].isnull()]
print(df1_unique)
df2_unique = pd.merge(df1,df2,how='right',on=['c2','c3'])
print(df2_unique)
df_common = pd.merge(df1,df2,how='inner',on=['c2','c3'])
print(df_common)
def delete_column_example():
print 'create df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
print 'drop (delete/remove) column'
col_name = 'b'
df.drop(col_name, axis=1, inplace=True) # or df = df.drop('col_name, 1)
def delete_rows_example():
print '\n\ncreate df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['col_1','col_2','col_3'])
print(df)
print '\n\nappend rows'
df= df.append(pd.DataFrame([[11,22,33]], columns=['col_1','col_2','col_3']))
print(df)
print '\n\ndelete rows where (based on) column value'
df = df[df.col_1 == 4]
print(df)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Random sampling and Pandas dataframes - python

Related

Name group of columns and rows in Pandas DataFrame

Pandas aggregate statistics as new columns

How to select and order columns in a dataframe using an array in Python

How to factorize two data frame meanwhile with python-pandas?

Which data structure in Python to use to replace Excel 2-dim array of strings/amounts?

Categories

Resources