Generating random dataframes using unique elements from an existing dataframe using pandas - python

I am trying to do some data manipulations using pandas. I have an excel file with two columns x,y . The number of elements in x corresponds to number of connections(n_arrows) it makes with an element in column y. The number of unique elements in column x corresponds to the number of unique points(n_nodes). What i want to do is to generate a random data frame(10^4 times) with the unique elements in column x and elements in column y? The code i was trying to work on is attached. Any suggestion will be appreciated
import pandas as pd
import numpy as np
df = pd.read_csv('/home/amit/Desktop/playing_with_pandas.csv')
num_nodes = df.drop_duplicates(subset='x', keep="last")
n_arrows = [32] #32 rows corresponds to 32
n_nodes = [10]
n_arrows_random = np.random.randn(df.x)

Here are 2 methods:
Solution 1: If you need the x and y values to be independently random:
Given a sample df (thanks #AmiTavory):
df = pd.DataFrame({'x': [1, 1, 1, 2], 'y': [1, 2, 3, 4]})
Using numpy.random.choice, you can do this to select random values from your x column and random values from your y column:
def simulate_df(df, size_of_simulated_df):
return pd.DataFrame({'x':np.random.choice(df.x, size_of_simulated_df),
'y':np.random.choice(df.y, size_of_simulated_df)})
>>> simulate_df(df, 10)
x y
0 1 3
1 1 3
2 1 4
3 1 4
4 2 1
5 2 3
6 1 2
7 1 4
8 1 2
9 1 3
The function simulate_df returns random values sampled from your original dataframe in the x and y columns. The size of your simulated dataframe can be controlled by the argument size_of_simulated_df, which should be an integer representing the number of rows you want.
Solution 2: As per your comments, based on your task, you might want to return a dataframe of random rows, maintaining the x->y correspondence. Here is a vectorized pandas way to do that:
def simulate_df(df=df, size_of_simulated_df=10):
return df.sample(size_of_simulated_df, replace=True).reset_index(drop=True)
>>> simulate_df()
x y
0 1 2
1 2 4
2 2 4
3 2 4
4 1 1
5 1 3
6 1 3
7 1 1
8 1 1
9 1 3
Assigning your simulated Dataframes for future reference:
In the likely scenario you want to do some sort of calculation on your simulated dataframes, I'd recommend saving them to some sort of dictionary structure using a loop like this:
dict_of_dfs = {}
for i in range(100):
dict_of_dfs['df'+str(i)] = simulate_df(df, len(df))
Or a dictionary comprehension like this:
dict_of_dfs = {'df'+str(i): simulate_df(df, (len(df))) for i in range(100)}
You can then access any one of your simulated dataframes in the same way you would access any dictionary value:
# Access the 48th simulated dataframe:
>>> dict_of_dfs['df47']
x y
0 1 4
1 2 1
2 1 4
3 2 3

Related

Creating Data Frame with repeating values that repeat

I'm trying to create a dataframe in Pandas that has two variables ("date" and "time_of_day" where "date" is 120 observations long with 30 days (each day has four observations: 1,1,1,1; 2,2,2,2; etc.) and then the second variable "time_of_day) repeats 30 times with values of 1,2,3,4.
The closest I found to this question was here: How to create a series of numbers using Pandas in Python, which got me the below code, but I'm receiving an error that it must be a 1-dimensional array.
df = pd.DataFrame({'date': np.tile([pd.Series(range(1,31))],4), 'time_of_day': pd.Series(np.tile([1, 2, 3, 4],30 ))})
So the final dataframe would look something like
date
time_of_day
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
Thanks much!
you need once np.repeat and once np.tile
df = pd.DataFrame({'date': np.repeat(range(1,31),4),
'time_of_day': np.tile([1, 2, 3, 4],30)})
print(df.head(10))
date time_of_day
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 3 1
9 3 2
or you could use pd.MultiIndex.from_product, same result.
df = (
pd.MultiIndex.from_product([range(1,31), range(1,5)],
names=['date','time_of_day'])
.to_frame(index=False)
)
or product from itertools
from itertools import product
df = pd.DataFrame(product(range(1,31), range(1,5)), columns=['date','time_of_day'])
New feature in merge cross
out = pd.DataFrame(range(1,31)).merge(pd.DataFrame([1, 2, 3, 4]),how='cross')

Pandas Dataframe Advanced Split

I have a big DataFrame I need to split into two (A and B), with the same number of rows from a certain column value in A and in B. That column has over 700 unique values, all of them strings. I leave an example:
DataFrame
Price Type
1 X
2 Y
3 Y
4 X
5 X
6 X
7 Y
8 Y
When splitting it (randomly), I should get two values of X, and two values of Y in DataFrame A and DataFrame B, like:
A
Price Type
1 X
5 X
2 Y
3 Y
B
Price Type
4 X
6 X
7 Y
8 Y
Thanks in advance!
You can use groupby().cumcount() to enumerate the rows within Type, then %2 to divide rows into two groups:
df['groups'] = df.groupby('Type').cumcount()%2
A,B = df[df['groups']==0], df[df['groups']==1]
Output:
**A**
Price Type groups
0 1 X 0
1 2 Y 0
4 5 X 0
6 7 Y 0
**B**
Price Type groups
2 3 Y 1
3 4 X 1
5 6 X 1
7 8 Y 1
Could you group by the value of Type and assign A/B to half of the group as a new column, then copy only rows with the label A/B assigned? If you need an exact split you could base it off the size of the group
You can you use "arry_split" feature of numpy library like below:
import numpy as np
df_split = np.array_split(df, 2)
df1 = df_split[0]
df2 = df_split[1]

Making a Multiindexed Pandas Dataframe Non-Symmetric

I have a multi-indexed dataframe which looks roughly like this:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
>>> Output
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
4 5 1 1 5
In this dataframe, the zero-th row and fifth row are symmetric in the sense that if the entire A and B columns of the zero-th row are flipped, it becomes identical to the fifth one. Similarly, the second row is symmetric with itself.
I am planning to remove these rows from my original dataframe, thus making it 'non-symmetric'. The specific plans are as follow:
If a row with higher index is symmetric with a row with lower index, keep the lower one and remove the higher one. For example, from the above dataframe, keep the zero-th row and remove the fifth row.
If a row is symmetric with itself, remove that row. For example, from the above dataframe, remove the second row.
My attempt was to first zip the four lists into a tuple list, remove the symmetric tuples by a simple if-statement, unzip them, and merge them back into a dataframe. However, this turned out to be inefficient, making it unscalable for large dataframes.
How can I achieve this in an efficient manner? I guess utilizing several built-in pandas methods is necessary, but it seems quite complicated.
Namudon'tdie,
Try this solution:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
test['idx'] = test.index * 2 # adding auxiliary column 'idx' (all even)
test2 = test.iloc[:, [2,3,0,1,4]] # creating flipped DF
test2.columns = test.columns # fixing column names
test2['idx'] = test2.index * 2 + 1 # for flipped DF column 'idx' is all odd
df = pd.concat([test, test2])
df = df.sort_values (by='idx')
df = df.set_index('idx')
print(df)
A B
a b a b
idx
0 1 5 5 1
1 5 1 1 5
2 2 4 2 4
3 2 4 2 4
4 3 3 3 3
5 3 3 3 3
6 4 2 4 2
7 4 2 4 2
8 5 1 1 5
9 1 5 5 1
df = df.drop_duplicates() # remove rows with duplicates
df = df[df.index%2 == 0] # remove rows with odd idx (flipped)
df = df.reset_index()[['A', 'B']]
print(df)
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
The idea is to create flipped rows with odd indexes, so that they will be placed under their original rows after reindexing. Then delete duplicates, keeping rows with lower indices. For cleanup simply delete remaining rows with odd indices.
Note that row [3,3,3,3] stayed. There should be a separate filter to take care of self-symmetric rows. Since your definition of self-symmetric is unclear (other rows have certain degree of symmetry too), I leave this part to you. Should be straightforward.

Delete columns with extremely unequally distributed values from pandas dataframe

Given the following code:
import numpy as np
import pandas as pd
arr = np.array([
[1,2,9,1,1,1],
[2,3,3,1,0,1],
[1,4,2,1,2,1],
[2,3,1,1,2,1],
[1,2,3,1,8,1],
[2,2,5,1,1,1],
[1,3,8,7,4,1],
[2,4,7,8,3,3]
])
# 1,2,3,4,5,6 <- Number of the columns.
df = pd.DataFrame(arr)
for _ in df.columns.values:
print {x: list(df[_]).count(x) for x in set(df[_])}
I want to delete from the dataframe all the columns in which one value occurs more often than all the other values of the column together. In this case I would like to drop the columns 4 and 6 (see comment) since the number 1 occurs more often than all the other numbers in these columns together (6 > 2 in column 4 and 7 > 1 in column 6). I don't want to drop the first column (4 = 4). How would I do that?
Another option is to do a value counts on each column and if the maximum of the count is smaller or equal to half of the number of rows of the data frame, then select it:
df.loc[:, df.apply(lambda col: max(col.value_counts()) <= df.shape[0]/2)]
# 0 1 2 4
#0 1 2 9 1
#1 2 3 3 0
#2 1 4 2 2
#3 2 3 1 2
#4 1 2 3 8
#5 2 2 5 1
#6 1 3 8 4
#7 2 4 7 3

Pandas: outer product of row and col sums

In Pandas, I am trying to manually code a chi-square test. I am comparing row 0 with row 1 in the dataframe below.
data
2 3 5 10 30
0 3 0 6 5 0
1 33324 15833 58305 54402 38920
For this, I need to calculate the expected cell counts for each cell as: cell(i,j) = rowSum(i)*colSum(j) / sumAll. In R, I can do this simply by taking the outer() products:
Exp_counts <- outer(rowSums(data), colSums(data), "*")/sum(data) # Expected cell counts
I used numpy's outer product function to imitate the outcome of the above R code:
import numpy as np
pd.DataFrame(np.outer(data.sum(axis=1),data.sum(axis=0))/ (data.sum().sum()), index=data.index, columns=data.columns.values)
2 3 5 10 30
0 2 1 4 3 2
1 33324 15831 58306 54403 38917
Is it possible to achieve this with a Pandas function?
A Complete solution using only Pandas built-in methods:
def outer_product(row):
numerator = df.sum(1).mul(row.sum(0))
denominator = df.sum(0).sum(0)
return (numerator.floordiv(denominator))
df.apply(outer_product)
Timings: For 1 million rows of DF.

Categories