Adding multiple columns randomly to a dataframe from columns in another dataframe - python

I've looked everywhere but can't find a solution.
Let's say I have two tables:
Year
1
2
3
4
and
ID Value
1 10
2 50
3 25
4 20
5 40
I need to pick randomly from both columns of the 2nd table to add to the first table - so if ID=3 is picked randomly as a column to add to the first table, I also add Value=25 i.e. end up with something like:
Year ID Value
1 3 25
2 1 10
3 1 10
4 5 40
5 2 50

IIUC, do you want?
df_year[['ID', 'Value']] = df_id.sample(n=len(df_year), replace=True).to_numpy()
Output:
Year ID Value
0 1 4 20
1 2 4 20
2 3 2 50
3 4 3 25

Related

Change value of column based on specific id in pandas dataframe

I have the below sorted dataframe and I want to set the last value of each id in the id column to 0
id value
1 500
1 50
1 36
2 45
2 150
2 70
2 20
2 10
I am able to set the last value of the entire id column to 0 using df['value'].iloc[-1] = 0. How can I set the last value of both id : 1 and id : 2 to get the below output.
id value
1 500
1 50
1 0
2 45
2 150
2 70
2 20
2 0
you can do drop_duplicates and keep last to get the last row of each id. Use the index of these rows and set the value to 0
df.loc[df['id'].drop_duplicates(keep='last').index, 'value'] = 0
print(df)
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0
df.loc[~df.id.duplicated('last'),'value']=0
Broken down
m=df.id.duplicated('last')
df.loc[~m,'value']=0
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0
How it works
m=df.id.duplicated('last')# Selects the last duplicated in column id
~m reverses that and hence last duplicated becomes true
df.loc[~m,'value']# loc accessor allows us to reach the True value in the nominated column to write with 0
If you are willing to use numpy here is a fast solution:
import numpy as np
# Recreate example
df = pd.DataFrame({
"id":[1,1,1,2,2,2,2,2],
"value": [500,50,36,45,150,70,20,10]
})
# Solution
df["value"] = np.where(~df.id.duplicated(keep="last"),0,df["value"].to_numpy())

Assign the frequency of each value to dataframe with new column

I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe
df['Frequency'] = df.groupby('Category').transform('count')
Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values ​​is the count. So you can use map or pandas.Series.replace to create a series with the category values ​​replaced by those in the count. And finally assign this series to the frequency column
you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)

I want to add new column on the basis of another column data in pandas

I have multiple csv file which i merged together after that in order to identify individual csv data in all merged csv file i wish to create a new column in pandas where the new column should be called serial.
I want a new column serial in the pandas and it should me numbered on the basis of data in Sequence column (For example-111111111,2222222222,33333333 for every new one in csv ).I had Attached snapshot of csv file also.
Sequence Number
1
2
3
4
5
1
2
1
2
3
4
I want output Like this-
Serial Sequence Number
1 1
1 2
1 3
1 4
1 5
2 1
2 2
3 1
3 2
3 3
3 4
Use DataFrame.insert for column in first position filled with boolean mask for compare by 1 with Series.eq (==) and cumulative sum by Series.cumsum:
df.insert(0, 'Serial', df['Sequence Number'].eq(1).cumsum())
print (df)
Serial Sequence Number
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 2 1
6 2 2
7 3 1
8 3 2
9 3 3
10 3 4

Calculate difference sequentially by groups in pandas

I'm trying to calculate the difference between two columns sequentially as efficiently as possible. My DataFrame looks like this:
category sales initial_stock
1 2 20
1 6 20
1 1 20
2 4 30
2 6 30
2 5 30
2 7 30
And I want to calculate a variable final_stock, like this:
category sales initial_stock final_stock
1 2 20 18
1 6 20 12
1 1 20 11
2 4 30 26
2 6 30 20
2 5 30 15
2 7 30 8
Thus, final_stock first equals initial_stock - sales and the it equals final_stock.shift() - sales, for each category. I managed to do this with for loops, but it is quite slow and my feeling says there's probably a one or two liner solution to this problem. Do you have any ideas?
Thanks
Use groupby and cumsum on "sales" to get the cumulative stock sold per category, then subtract from "initial_stock":
df['final_stock'] = df['initial_stock'] - df.groupby('category')['sales'].cumsum()
df
category sales initial_stock final_stock
0 1 2 20 18
1 1 6 20 12
2 1 1 20 11
3 2 4 30 26
4 2 6 30 20
5 2 5 30 15
6 2 7 30 8

Creating panda column based off of values from other columns

So the I'm working with a panda dataframe that looks like this:
Current Panda Table
I want to turn sum all of the times for each individual property on a given week, my idea is to append this to the data frame like this:
Dataframe2
Then to simplify things I'd create a new data frame that looks like this:
Property Name Week Total_weekly_time
A 1 60
A 2 xx
B 1 xx
etc. etc.
I'm new to pandas, trying to learn the ins and outs. Any answers must appreciated as well as references to learn pandas better.
I think you need transform if need new column with same dimension as df after groupby:
df['Total_weekly_time'] = df.groupby(['Property Name', 'Week #'])['Duration']
.transform('sum')
print (df)
Property Name Week # Duration Total_weekly_time
0 A 1 10 60
1 A 1 10 60
2 A 2 5 5
3 B 1 20 70
4 B 1 20 70
5 B 1 20 70
6 C 2 10 10
7 C 3 30 50
8 A 1 40 60
9 A 4 40 40
10 B 1 5 70
11 B 1 5 70
12 C 3 10 50
13 C 3 10 50
Pandas docs

Categories