Pandas Dataframe Advanced Split - python

I have a big DataFrame I need to split into two (A and B), with the same number of rows from a certain column value in A and in B. That column has over 700 unique values, all of them strings. I leave an example:
DataFrame
Price Type
1 X
2 Y
3 Y
4 X
5 X
6 X
7 Y
8 Y
When splitting it (randomly), I should get two values of X, and two values of Y in DataFrame A and DataFrame B, like:
A
Price Type
1 X
5 X
2 Y
3 Y
B
Price Type
4 X
6 X
7 Y
8 Y
Thanks in advance!

You can use groupby().cumcount() to enumerate the rows within Type, then %2 to divide rows into two groups:
df['groups'] = df.groupby('Type').cumcount()%2
A,B = df[df['groups']==0], df[df['groups']==1]
Output:
**A**
Price Type groups
0 1 X 0
1 2 Y 0
4 5 X 0
6 7 Y 0
**B**
Price Type groups
2 3 Y 1
3 4 X 1
5 6 X 1
7 8 Y 1

Could you group by the value of Type and assign A/B to half of the group as a new column, then copy only rows with the label A/B assigned? If you need an exact split you could base it off the size of the group

You can you use "arry_split" feature of numpy library like below:
import numpy as np
df_split = np.array_split(df, 2)
df1 = df_split[0]
df2 = df_split[1]

Related

Drop rows grouped by string value - pandas

I'm aiming to drop rows in a pandas df where a row is equal to a specific value. However, I want to extend this so it also drops associated rows grouped by a separate column. For instance, I want to drop all rows where Label == A,D, but I also to drop associated rows in Num from the same group.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,1,2,2,3,3,4,4,4],
'Label' : ['X','X','A','Y','Y','Y','Y','Y','Y','D'],
})
df = df.groupby('Num').filter(lambda x: (x['Label'].isin['A','D']).any())
intended output:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y
You are close, just add negation:
df.groupby('Num').filter(lambda x: ~x['Label'].isin(['A','D']).any())
Output:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y
Let us try use isin without groupby
out = df.loc[~df.Num.isin(df.loc[df.Label.isin(['A','D']),'Num'])]
Out[108]:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y

Merge And overwrite common columns in two DataFrames

I have a dataframe A with 80 columns, and I did group by A and Sum 20 columns
E.g.
New_df=A.groupby(['X','Y','Z'])['a','b','c',......].sum().reset_Index()--------(1)
Then I want to overwrite the values in columns which are present in A with the New_df columns value which are common.
You can do:
cols1=set(A.columns.tolist())
cols2=set(New_df.columns.tolist())
common_cols = list(cols1.intersection(cols2))
A[common_cols]=New_df[common_cols]
to find the columns that the two df's have in common , then replace those in the first with the columns from the second.
This will give you results for example given an initial A:
x y
0 1 a
1 2 b
2 3 c
and New_df:
z y
0 4 d
1 5 e
2 6 f
And we wind up with final 'A', with y column taken from New_df:
x y
0 1 d
1 2 e
2 3 f

Generating random dataframes using unique elements from an existing dataframe using pandas

I am trying to do some data manipulations using pandas. I have an excel file with two columns x,y . The number of elements in x corresponds to number of connections(n_arrows) it makes with an element in column y. The number of unique elements in column x corresponds to the number of unique points(n_nodes). What i want to do is to generate a random data frame(10^4 times) with the unique elements in column x and elements in column y? The code i was trying to work on is attached. Any suggestion will be appreciated
import pandas as pd
import numpy as np
df = pd.read_csv('/home/amit/Desktop/playing_with_pandas.csv')
num_nodes = df.drop_duplicates(subset='x', keep="last")
n_arrows = [32] #32 rows corresponds to 32
n_nodes = [10]
n_arrows_random = np.random.randn(df.x)
Here are 2 methods:
Solution 1: If you need the x and y values to be independently random:
Given a sample df (thanks #AmiTavory):
df = pd.DataFrame({'x': [1, 1, 1, 2], 'y': [1, 2, 3, 4]})
Using numpy.random.choice, you can do this to select random values from your x column and random values from your y column:
def simulate_df(df, size_of_simulated_df):
return pd.DataFrame({'x':np.random.choice(df.x, size_of_simulated_df),
'y':np.random.choice(df.y, size_of_simulated_df)})
>>> simulate_df(df, 10)
x y
0 1 3
1 1 3
2 1 4
3 1 4
4 2 1
5 2 3
6 1 2
7 1 4
8 1 2
9 1 3
The function simulate_df returns random values sampled from your original dataframe in the x and y columns. The size of your simulated dataframe can be controlled by the argument size_of_simulated_df, which should be an integer representing the number of rows you want.
Solution 2: As per your comments, based on your task, you might want to return a dataframe of random rows, maintaining the x->y correspondence. Here is a vectorized pandas way to do that:
def simulate_df(df=df, size_of_simulated_df=10):
return df.sample(size_of_simulated_df, replace=True).reset_index(drop=True)
>>> simulate_df()
x y
0 1 2
1 2 4
2 2 4
3 2 4
4 1 1
5 1 3
6 1 3
7 1 1
8 1 1
9 1 3
Assigning your simulated Dataframes for future reference:
In the likely scenario you want to do some sort of calculation on your simulated dataframes, I'd recommend saving them to some sort of dictionary structure using a loop like this:
dict_of_dfs = {}
for i in range(100):
dict_of_dfs['df'+str(i)] = simulate_df(df, len(df))
Or a dictionary comprehension like this:
dict_of_dfs = {'df'+str(i): simulate_df(df, (len(df))) for i in range(100)}
You can then access any one of your simulated dataframes in the same way you would access any dictionary value:
# Access the 48th simulated dataframe:
>>> dict_of_dfs['df47']
x y
0 1 4
1 2 1
2 1 4
3 2 3

Summing a column in Pandas Groupby

I want to group by column A, and sum over column C and return the results immediately into the dataframe. I know that I need to use groupby, and I know that I need to use sum, but I cannot figure out how to get these functions to interact seamlessly and in one line of code.
Have
A B C
0 x text 3
1 x text 7
2 y text 5
Want
A B C D
0 x text 3 10
1 x text 7 10
2 y text 5 5
call transform on the groubpy to add the aggregated column back to the original df:
In [28]:
df['D'] = df.groupby('A')['C'].transform('sum')
df
Out[28]:
A B C D
0 x text 3 10
1 x text 7 10
2 y text 5 5
transform returns a series with its index aligned to the original df so you can add it as a new column

Python: create a new column from existing columns

I am trying to create a new column based on both columns. Say I want to create a new column z, and it should be the value of y when it is not missing and be the value of x when y is indeed missing. So in this case, I expect z to be [1, 8, 10, 8].
x y
0 1 NaN
1 2 8
2 4 10
3 8 NaN
You can use apply with option axis=1. Then your solution is pretty concise.
df[z] = df.apply(lambda row: row.y if pd.notnull(row.y) else row.x, axis=1)
The new column 'z' get its values from column 'y' using df['z'] = df['y']. This brings over the missing values so fill them in using fillna using column 'x'. Chain these two actions:
>>> df['z'] = df['y'].fillna(df['x'])
>>> df
x y z
0 1 NaN 1
1 2 8 8
2 4 10 10
3 8 NaN 8
Use np.where:
In [3]:
df['z'] = np.where(df['y'].isnull(), df['x'], df['y'])
df
Out[3]:
x y z
0 1 NaN 1
1 2 8 8
2 4 10 10
3 8 NaN 8
Here it uses the boolean condition and if true returns df['x'] else df['y']
Let's say DataFrame is called df. First copy the y column.
df["z"] = df["y"].copy()
Then set the nan locations of z to the locations in x where the nans are in z.
import numpy as np
df.z[np.isnan(df.z)]=df.x[np.isnan(df.z)]
>>> df
x y z
0 1 NaN 1
1 2 8 8
2 4 10 10
3 8 NaN 8
I'm not sure if I understand the question, but would this be what you're looking for?
"if y[i]" will skip if the value is none.
for i in range(len(x));
if y[i]:
z.append(y[i])
else:
z.append(x[i])
The update method does almost exactly this. The only caveat is that update will do so in place so you must first create a copy:
df['z'] = df.x.copy()
df.z.update(df.y)
In the above example you start with x and replace each value with the corresponding value from y, as long as the new value is not NaN.

Categories