This question already has an answer here:
Grouping a dataframe by X columns
(1 answer)
Closed 5 years ago.
I have a DataFrame with 40 columns (columns 0 through 39) and I want to group them four at a time:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.binomial(1, 0.2, (100, 40)))
new_df["0-3"] = df[0] + df[1] + df[2] + df[3]
new_df["4-7"] = df[4] + df[5] + df[6] + df[7]
...
new_df["36-39"] = df[36] + df[37] + df[38] + df[39]
Can I do this in a single statement (or in a better way than summing them separately)? The column names in the new DataFrame are not important.
You could select out the columns and sum on the row axis, like this.
df['0-3'] = df.loc[:, 0:3].sum(axis=1)
A couple things to note:
Summing like this will ignore missing data while df[0] + df[1] ... propagates it. Pass skipna=False if you want that behavior.
Not necessarily any performance benefit, may actually be a little slower.
Here's another way to do it:
new_df = df.transpose()
new_df['Group'] = new_df.index / 4
new_df = new_df.groupby('Group').sum().transpose()
Note that the divide-by operation here is integer division, not floating-point division.
I don't know if it is the best way to go but I ended up using MultiIndex:
df.columns = pd.MultiIndex.from_product((range(10), range(4)))
new_df = df.groupby(level=0, axis=1).sum()
Update: Probably because of the index, this was faster than the alternatives. The same can be done with df.groupby(df.columns//4, axis=1).sum() faster if you take into account the time for constructing the index. However, the index change is a one time operation and I update the df and take the sum thousands of times so using a MultiIndex was faster for me.
Consider a list comprehension:
df = # your data
df_slices = [df.iloc[x:x+4] for x in range(10)]
Or more generally
df_slices = [df.iloc[x:x+4] for x in range(len(df.columns)/4)]
Related
I have the following data in the format below (see below)
I next perform recasting, groupby and averaging (see code) to reduce data dimensionality.
df_mod=pd.read_csv('wet_bulb_hr.csv')
#Mod Date
df_mod['wbt_date'] = pd.to_datetime(df_mod['wbt_date'])
#Mod Time
df_mod['wbt_time'] = df_mod['wbt_time'].astype('int')
df_mod['wbt_date'] = df_mod['wbt_date'] + \
pd.to_timedelta(df_mod['wbt_time']-1, unit='h')
df_mod['wet_bulb_temperature'] = \
df_mod['wet_bulb_temperature'].astype('float')
df = df_mod
df = df.drop(['wbt_time','_id'], axis = 1)
#df_novel = df.mean()
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
After writing to an output file, I get an output that looks like this.
Investigating further, I can understand why. All my processing has resulted in a dataframe of shape 1 but what I really need is the 2 wbt_date columns to be exported as well. This does not seem to happen due to the groupby function
My question: How do I generate an index and have the groupby wbt_date columns as a new single column such that the output is:
You can flatten MultiIndex to Index in YYYY-MM by list comprehension:
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
df.index = [f'{y}-{m}' for y, m in df.index]
df = df.rename_axis('date').reset_index()
Or use month period by Series.dt.to_period:
df = df.groupby([df.wbt_date.dt.to_period('m')).mean().reset_index()
Try this,
# rename exisiting index & on reset will get added as new column.
df.index.rename("wbt_year", inplace=True)
df.reset_index(inplace=True)
df['month'] = df['wbt_year'].astype(str) + "-" + df['wbt_date'].astype(str)
Output,
>>> df['month']
0 2019-0
1 2018-1
2 2017-2
Say a list where elements are pandas df.
lst = [df1, df2, df3, df4]
Every df is time series with DT index. df1 & df2 have values at every 15 mins and df3 & df4 have values at every hour. I want to concate all but before that i need to do some changes in df3 & df4.
First is resampling and change the name of columns which i did with this code:
[df.resample('15min').bfill() for df in lst] #this code works for all but, i want this only in df3 & df4- code 1
for df in lst[2:4]:
df.resample('15min').bfill() #this run but does nothing - code 2
for df in lst[0:2]: #same column name for df1, df2 - code 3
df.columns = ['heat']
for df in lst[2:4]:
df.columns = ['energy'] #same column name for df3, df4 - code 4
Do i need to make an object to save the values of 2nd code? And is there a better way to combine the code 1,3,4 in one liner while doing slicing?
This is same as, when i want to divide every df with a value i have to give list comprehension, the for loop doesn't work.
lst = [x/1000 for x in lst] #this works
for x in lst:
x.values / 1000 #this doesn't
If you want to use a list-comprehension for resampling you can
lst = [df.resample('15min').bfill() for df in lst]
which will not affect df1 and df2 since they are already 15-minutes spaced.
If you want to rename column name in one cycle:
for i,df in enumerate(lst):
df.columns = ['heat'] if i<2 else ['energy']
I don't think you can both resample and rename columns in a single list-comprehension.
It seems that you expect that operations are done in-place (here is a good description of the difference).
In list comprehension the operation returns the value that is stored in a list. E.g.
For [x/1000 for x in lst] - x/1000 is such an operation
If you do not store the result of an operation somewhere it is not going to be available (retrievable):
for x in lst:
x.values / 1000
The result of x.values / 1000 is not stored anywhere. However, you can place it in a different list, like this way:
another_list = []
for x in lst:
another_list.append(x.values / 1000)
Then the results of x.values / 1000 operation are available in another_list in the very same way as in lst list.
Therefore, the solution to the issues you have should be this:
for df in lst[2:4]:
df = df.resample('15min').bfill()
I'm new to python Pandas. I faced a problem to find the difference for 2 lists within a Pandas DataFrame.
Example Input with ; separator:
ColA; ColB
A,B,C,D; B,C,D
A,C,E,F; A,C,F
Expected Output:
ColA; ColB; ColC
A,B,C,D; B,C,D; A
A,C,E,F; A,C,F; E
What I want to do is similar to:
df['ColC'] = np.setdiff1d(df['ColA'].str.split(','), df['ColB'].str.split(','))
But it returns an error:
raise ValueError('Length of values does not match length of index',data,index,len(data),len(index))
Kindly advise
You can apply a lambda function on the DataFrame to find the difference like this:
import pandas as pd
# creating DataFrame (can also be loaded from a file)
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
# apply a lambda function to get the difference
df['ColC'] = df[['ColA','ColB']].apply(lambda x: [i for i in x[0] if i not in x[1]], axis=1)
Please notice! this will find the asymmetric difference ColA - ColB
Result:
A lot faster way to do this would be a simple set subtract:
import pandas as pd
#Creating a dataframe
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
#Finding the difference
df['ColC']= df['ColA'].map(set)-df['ColB'].map(set)
As the dataframe grows in row numbers, it will be computationally pretty expensive to do any row by row operation.
include_cols_path = sys.argv[5]
with open(include_cols_path) as f:
include_cols = f.read().splitlines()
include_cols is a list of strings
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True).toPandas()
df1 is a dataframe of a large file. I would like to only retain the columns with names that contain any of the strings in include_cols.
final_cols = [col for col in df.columns.values if col in include_cols]
df = df[final_cols]
Doing this in pandas is certainly a dupe. However, it seems that you are converting a spark DataFrame to a pandas DataFrame.
Instead of performing the (expensive) collect operation and then filtering the columns you want, it's better to just filter on the spark side using select():
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True)
pandas_df = df1.select(include_cols).toPandas()
You should also think about whether or not converting to a pandas DataFrame is really what you want to do. Just about anything you can do in pandas can also be done in spark.
EDIT
I misunderstood your question originally. Based on your comments, I think this is what you're looking for:
selected_columns = [c for c in df1.columns if any([x in c for x in include_cols])]
pandas_df = df1.select(selected_columns).toPandas()
Explanation:
Iterate through the columns in df1 and keep only those for which at least one of the strings in include_cols is contained in the column name. The any() functions returns True if at least one of the conditions is True.
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
For example:
df1 = pd.DataFrame(data=np.random.random((5, 5)), columns=list('ABCDE'))
include_cols = ['A', 'C', 'Z']
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
>>> A C
0 0.247271 0.761153
1 0.390240 0.050055
2 0.333401 0.823384
3 0.821196 0.929520
4 0.210226 0.406168
The '|'.join(include_cols) part will create an or condition with all elements of the input list. In the above example A|C|Z. This conditions will be True if one of the element is contained in the column names using the .contains() method on the column names.
I am attempting to write a function that will sum a set of specified columns in a pandas DataFrame.
First, some background. The data each have a column with a name (e.g., "var") and a number next to that name in sequential order (e.g., "var1, var2"). I know I can sum, say, 5 columns together with the following code:
import pandas as pd
data = pd.read_csv('data_file.csv')
data['var_total'] = data.var1 + data.var2 + data.var3 + data.var4 + data.var5
However, this can be repetitive when you have var1-var30 to sum. I figured there must be some elegant solution to summing them more quickly, since the column names are predictable and uniform. Is there a function I can write or a built-in pandas function that will let me sum these more quickly?
You could do something like this:
data['var_total'] = data.filter(regex='var[0-9]+').sum(axis=1)
This will first filter your dataframe to retain only the columns that start with var and are followed by one or more numbers. Then it will sum across the resulting filtered DataFrame.
I think you're looking for the filter method of DataFrame; you can pass it either a string or a regular expression, and it will just return the columns whose names match it. Then you can just call sum or whatever else you want on the resulting columns:
pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']})
othercol var1 var2
0 abc 1 2
pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']}).filter(like='var')
var1 var2
0 1 2
pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']}).filter(like='var').sum(axis=1)
0 3
By the way note that I've called sum(axis=1) to return the row-wise sums, by default, sum will return the sum of the columns.
Even if you are writing out all the column names there a couple of ways to do the sum a bit more elegantly:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1': np.random.randint(1, 10, 10),
'var2': np.random.randint(1, 10, 10),
'var3': np.random.randint(1, 10, 10)})
# Use the sum method:
df[['var1', 'var2', 'var3']].sum(axis='columns')
# Use eval
df.eval('var1 + var2 + var3')
Then you can always use the standard Python tools for manipulating strings to put together the list of column names:
cols = ['var' + str(n) for n in range(1, 3 + 1)]
cols
Out[9]: ['var1', 'var2', 'var3']
df[cols].sum(axis='columns')