I have a dataframe df which looks like this:
CustomerId Age
1 25
2 18
3 45
4 57
5 34
I have a list called "Price" which looks like this:
Price = [123,345,1212,11,677]
I want to add that list to the dataframe. Here is my code:
df['Price'] = Price
It seems to work but when I print the dataframe the field called "Price" contains all the metadata information such as Name, Type... as well as the value of the Price.
How can I create a column called "Price" containing only the values of the Price list so that the dataframe looks like:
CustomerId Age Price
1 25 123
2 18 345
3 45 1212
4 57 11
5 34 677
In my Opinion, the most elegant solution is to use assign:
df.assign(Price=Price)
CustomerId Age Price
1 25 123
2 18 345
3 45 1212
4 57 11
5 34 677
note that assign actually returns a DataFrame.
Assign creates a new Column 'Price' (left Price) with the content of the list Price (right Price)
import pandas as pd
df['Price'] = pd.Series(Price)
if you use this you will not get the error if you have less values in the series than to your dataframe otherwise you will get the error that will tell you have less vales in the list and that can not be appended.
I copy pasted your example into a dataframe using pandas.read_clipboard and then added the column like this:
import pandas as pd
df = pd.read_clipboard()
Price = [123,345,1212,11,677]
df.loc[:,'Price'] = Price
df
Generating this:
CustomerId Age Price
0 1 25 123
1 2 18 345
2 3 45 1212
3 4 57 11
4 5 34 677
You can add pandas series as column.
import pandas as pd
df['Price'] = pd.Series(Price)
Related
I have some questions about combining the first 2 columns in pandas/python with n/a value
long story: I need to read an excel and alter those changes. I can not change anything in excel, so any change has been done by python.
Here is the excel input
and the expected expect output will be
I manage to read it in, but when I try to combine the first 2 columns, I have some problems. since in excel, the first row is merged, so once it is read in. only one row has value, but the rest of the row is all N/A.
such as below:
Year number 2016
Month Jan
Month 2016-01
Grade 1 100
NaN 2 99
NaN 3 98
NaN 4 96
NaN 5 92
NaN Total 485
Is there any function that can easily help me to combine the first two columns and make it as below:
Year 2016
Month Jan
Month 2016-01
Grade 1 100
Grade 2 99
Grade 3 98
Grade 4 96
Grade 5 92
Grade Total 485
Anything will be really appreciated.
I searched and google the key word for so long but did not find any answer that fits my situation here.
d = '''
Year,number,2016
Month,,Jan
Month,,2016-01
Grade,1, 100
NaN,2, 99
NaN,3, 98
NaN,4, 96
NaN,5, 92
NaN,Total,485
'''
df = pd.read_csv(StringIO(d))
df
df['Year'] = df.Year.fillna(method='ffill')
df = df.fillna('') # skip this step if your data from excel does not have nan in col 2.
df['Year'] = df.Year + ' ' + df.number.astype('str')
df = df.drop('number',axis=1)
df
I have a dataframe with data from ecommerce panel.
It has orders and returns mixed together.
Each row has orderID - it's the same number for normal orders and for corresponding returns that come back from customers.
My data looks like this:
orderID
Shop
Revenue
Note
44
0
-32
Return
45
0
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
For each return I want to find a 'Shop' column value that corresponds to original order.
For example : 'orderID' == 44 comes twice: once as return (with 'Shop' == 0) and once as normal order (with 'Shop' == 1).
I want to replace all the 0 values with 'Shop' column with values from earlier orders
My desired output looks like this:
orderID
Shop
Revenue
Note
44
1
-32
Return
45
3
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
I know how to do it in Google Sheets (first I filter table removing 'Shop'==0 values and then I vlookup for numbers in this filtered array)
I know how to filter this table using Pandas but I don't know how to write it.
I assume that I will need to write a temporary column first, where I store both types of values - for normal orders (just copied) and for returns.
Original dataframe is 1 000 000+ rows
My data in .csv is available here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vQAJ4tMc_Bcvv-4FsUy3E7sG0m9hm-nLTVLj-LwlSEns-YJ1pbq6gSKp5mj5lZqRI2EgHOsOutwnn1I/pub?gid=0&single=true&output=csv
Thank you for any advice!
IIUC, using map:
m = df.query('Shop != 0').set_index('orderID')['Shop']
df['Shop'] = df['orderID'].map(m)
print(df)
Output:
orderID Shop Revenue Note
0 44 1 -32 Return
1 45 3 -100 Return
2 44 1 14 NaN
3 45 3 20 Something else
4 46 2 50 NaN
5 47 1 80 Something
6 48 2 222 NaN
Create a pd.Series using query to filter out zero shops then set_index and map shops to orderID​.
This works if there is a 1-1 shop to order mapping. If you have multiple shops per order, then you'll need logic to determine which shop valid.
If you have duplicate order to the same shop, then you need to drop_duplicates first.
I have the following dataframe:
Name rollNumber external_roll_number testDate marks
0 John 34 234 2021-04-28 15
1 John 34 234 2021-03-28 25
I would like to convert it like this:
Name rollNumber external_roll_number testMonth marks testMonth marks
0 John 34 234 April 15 March 25
If the above is not possible then I would atleast want it to be like this:
Name rollNumber external_roll_number testDate marks testDate marks
0 John 34 234 2021-04-28 15 2021-03-28 25
How can I convert my dataframe to the desired output? This change will be based on the Name column of the rows.
EDIT 1
I tried using pivot_table like this but I did not get the desired result.
merged_df_pivot = pd.pivot_table(merged_df, index=["name", "testDate"], aggfunc="first", dropna=False).fillna("")
When I try to iterate through the merged_df_pivot like this:
for index, details in merged_df_pivot.iterrows():
I am again getting two rows and also I was not able to add the new testMonth column by the above method.
core is unstack() month to be columns
detail then to re-structure month-by month marks columns to required structure
generally consider bad practice to have duplicate column names, hence have suffixed them
df = pd.read_csv(io.StringIO(""" Name rollNumber external_roll_number testDate marks
0 John 34 234 2021-04-28 15
1 John 34 234 2021-03-28 25
"""), sep="\s+")
df["testDate"] =pd.to_datetime(df["testDate"])
df = df.assign(testMonth = df["testDate"].dt.strftime("%B")).drop(columns="testDate")
dft = (df.set_index([c for c in df.columns if c!="marks"])
.unstack("testMonth") # make month a column
.droplevel(0, axis=1) # remove unneeded level in columns
# create columns for months from column names and rename marks columns
.pipe(lambda d: d.assign(**{f"testMonth_{i+1}":c
for i,c in enumerate(d.columns)}).rename(columns={c:f"marks_{i+1}"
for i,c in enumerate(d.columns)}))
.reset_index()
)
output
Name
rollNumber
external_roll_number
marks_1
marks_2
testMonth_1
testMonth_2
0
John
34
234
15
25
April
March
I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.
Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2
There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79
You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64
Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.