I have a subset of a dataframe here:
data = {'Name': ['ch1', 'ch2', 'ch3', 'ch4', 'ch5', 'ch6'],
'Time': [1,2,3,4,5,6],
'Week' : [1, 2, 3, 2, 3, 2]
}
dfx = pd.DataFrame(data)
I need to sum up all the times for each week so Week 1 time is 1, Week 2 time is 2+4+6, and Week 3 is 3+5. I also need it to look through the 'Week' column and find all the different weeks, so for this example there are 3 but for another dataframe it could be 2 or 4.
End result is look through a column in a dataframe, find the unique values (1,2,3,...n), groupby be each of those values into rows and sum up the time for each of those values.
I have tried a handful of ways but nothing is really working how I would like. I appreciate any help or ideas.
Expected Output:
Sum
Week 1: 1 1
Week 2: 2 4 6 12
Week 3: 3 5 8
The output can be either individual dataframes of the data or one dataframe that has all three rows with all the numbers and the sum at the end.
import pandas as pd
data = {'Name': ['ch1', 'ch2', 'ch3', 'ch4', 'ch5', 'ch6'],
'Time': [1,2,3,4,5,6],
'Week' : [1, 2, 3, 2, 3, 2]
}
dfx = pd.DataFrame(data)
dfx = dfx.groupby('Week')['Time'].sum()
print(dfx)
output:
Week
1 1
2 12
3 8
You can groupby "Week", select column "Time", and you can pass multiple functions (such as list constructor and sum) to Groupby.agg to do the things you want:
out = dfx.groupby('Week')['Time'].agg(Times=list, Total=sum)
Output:
Times Total
Week
1 [1] 1
2 [2, 4, 6] 12
3 [3, 5] 8
Related
This is my source DataFrame
df = pd.DataFrame({'uid': [1, 2, 3, 5, 6],
'grades': [69.233627, 70.130900, 83.357011, 88.206387, 74.342212]})
This is my target DataFrame
df2 = pd.DataFrame({'uid': [1, 2, 9],
'grades': [0.0,0.0,0.0]})
I'm trying to update the target DataFrame with values from source DataFrame that meet the condition.
for i in df2['uid']:
if (len(df[df['uid']==i])>0):
df2.loc[df2['uid']==i, 'grades']=df.loc[df['uid']==i, 'grades']
I've got what I need
>>> df2
uid grades
0 1 69.233627
1 2 70.130900
2 9 0.000000
I'd just like to know is there a simpler way to do the job?
Use DataFrame.update with set both index by uid columns:
df = df.set_index('uid')
df2 = df2.set_index('uid')
df2.update(df)
df2 = df2.reset_index()
print (df2)
uid grades
0 1 69.233627
1 2 70.130900
2 9 0.000000
I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1
I have two pandas dataframes that are loaded from CSV files. Each has two columns, column A is an id and is the same value and order in both CSVs. Column B is a numerical value.
I need to create a new CSV with column A identical to the first two and with column B, the average of the two initial CSVs.
I am creating two dataframes like
df1=pd.read_csv(path).set_index('A')
df2=pd.read_csv(otherPath).set_index('A')
If I do
newDf = (df1['B'] + df2['B'])/2
newDf.to_csv(...)
then the newDF has the ids in the wrong order in column A
If i do
df1['B'] = (df1['B'] + df2['B'])/2
df1.to_csv(...)
I get an error on the first line saying "Value Error: cannot reindex from a duplicate axis"
It seems like this should be trivial, what am I doing wrong?
Try using merge instead of setting an index.
I.e. We have these dataframes:
df1 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [3, 4, 5, 6, 7]})
df2 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [7, 4, 3, 10, 23]})
Then we merge them and create a new column with the mean of both B columns.
together = df1.merge(df2, on='A')
together.loc[:, "mean"] = (together['B_x']+ together['B_y']) / 2
together = together[['A', 'mean']]
And the together is:
A mean
0 1 5.0
1 2 4.0
2 3 4.0
3 4 8.0
4 5 15.0
This question already has an answer here:
Replace values in a pandas series via dictionary efficiently
(1 answer)
Closed 4 years ago.
Suppose you have a data frame
df = pd.DataFrame({'a':[1,2,3,4],'b':[2,4,6,8],'c':[2,4,5,6]})
and you want to replace specific values in columns 'a' and 'c' (but not 'b'). For example, replacing 2 with 20, and 4 with 40.
The following will not work since it is setting values on a copy of a slice of the DataFrame:
df[['a','c']].replace({2:20, 4:40}, inplace=True)
A loop will work:
for col in ['a','c']:
df[col].replace({2:20, 4:40},inplace=True)
But a loop seems inefficient. Is there a better way to do this?
According to the documentation on replace, you can specify a dictionary for each column:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 4, 6, 8], 'c': [2, 4, 5, 6]})
lookup = {col : {2: 20, 4: 40} for col in ['a', 'c']}
df.replace(lookup, inplace=True)
print(df)
Output
a b c
0 1 2 20
1 20 4 40
2 3 6 5
3 40 8 6
So I got this DataFrame, built in a way so that for column id equal to 2, we have two different values in column num and my_date:
import pandas as pd
a = pd.DataFrame({'id': [1, 2, 3, 2],
'my_date': [datetime(2017, 1, i) for i in range(1, 4)] + [datetime(2017, 1, 1)],
'num': [2, 3, 1, 4]
})
For convenience, this is the DataFrame in a readable visual:
If I want to count the number of unique values for each id, I'd do
grouped_a = a.groupby('id').agg({'my_date': pd.Series.nunique,
'num': pd.Series.nunique}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
which gives this weird (?) result:
Looks like the counting unique values on the datetime (which in Pandas converts to a datetime64[ns]) type is not working?
It is bug, see github 14423.
But you can use SeriesGroupBy.nunique which works nice:
grouped_a = a.groupby('id').agg({'my_date': 'nunique',
'num': 'nunique'}).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1
If DataFrame have only 3 columns, you can use:
grouped_a = a.groupby('id').agg(['nunique']).reset_index()
grouped_a.columns = ['id', 'num_unique_num', 'num_unique_my_date']
print (grouped_a)
id num_unique_num num_unique_my_date
0 1 1 1
1 2 2 2
2 3 1 1