I have a pandas dataframe having a lot of actual (column name ending with _act) and projected columns (column name ending with _proj). Other than actual and projected there's also a date column. Now I want to add an error column (in that order, i.e., beside its projected column) for all of them. Sample dataframe:
date a_act a_proj b_act b_proj .... z_act z_proj
2020 10 5 9 11 .... 3 -1
.
.
What I want:
date a_act a_proj a_error b_act b_proj b_error .... z_act z_proj z_error
2020 10 5 5 9 11 -2 .... 3 -1 4
.
.
What's the best way to achieve this, as I have a lot of actual and projected columns?
You could do:
df = df.set_index('date')
# create new columns
columns = df.columns[df.columns.str.endswith('act')].str.replace('act', 'error')
# compute differences
diffs = pd.DataFrame(data=df.values[:, ::2] - df.values[:, 1::2], index=df.index, columns=columns)
# concat
res = pd.concat((df, diffs), axis=1)
# reorder columns
res = res.reindex(sorted(res.columns), axis=1)
print(res)
Output
a_act a_error a_proj b_act b_error b_proj z_act z_error z_proj
date
2020 10 5 5 9 -2 11 3 4 -1
Related
I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])
I want to drop a group (all rows in the group) if the sum of values in a group is equal to a certain value.
The following code provides an example:
>>> df = pd.DataFrame(randn(10,10), index=pd.date_range('20130101',periods=10,freq='T'))
>>> df = pd.DataFrame(df.stack(), columns=['Values'])
>>> df.index.names = ['Time', 'Group']
>>> df.head(12)
Values
Time Group
2013-01-01 00:00:00 0 0.541795
1 0.060798
2 0.074224
3 -0.006818
4 1.211791
5 -0.066994
6 -1.019984
7 -0.558134
8 2.006748
9 2.737199
2013-01-01 00:01:00 0 1.655502
1 0.376214
>>> df['Values'].groupby('Group').sum()
Group
0 3.754481
1 -5.234744
2 -2.000393
3 0.991431
4 3.930547
5 -3.137915
6 -1.260719
7 0.145757
8 -1.832132
9 4.258525
Name: Values, dtype: float64
So the question is; how can I for instance drop all group rows where the grouped sum is negative? In my actual dataset I want to drop the groups where the sum or mean is zero.
Using GroupBy + transform with sum, followed by Boolean indexing:
res = df[df.groupby('Group')['Values'].transform('sum') > 0]
From the pandas documentation, filtration seems more suitable:
df2 = df.groupby('Group').filter(lambda g: g['Values'].sum() >= 0)
(Old answer):
This worked for me:
# Change the index to *just* the `Group` column
df.reset_index(inplace=True)
df.set_index('Group', inplace=True)
# Then create a filter using the groupby object
gb = df['Values'].groupby('Group')
gb_sum = gb.sum()
val_filter = gb_sum[gb_sum >= 0].index
# Print results
print(df.loc[val_filter])
The condition on which you filter can be changed accordingly.
I wrote a small class to compute some statistics through bootstrap without replacement. For those not familiar with this technique, you get n random subsamples of some data, compute the desired statistic (lets say the median) on each subsample, and then compare the values across subsamples. This allows you to get a measure of variance on the obtained median over the dataset.
I implemented this in a class but reduced it to a MWE given by the following function
import numpy as np
import pandas as pd
def bootstrap_median(df, n=5000, fraction=0.1):
if isinstance(df, pd.DataFrame):
columns = df.columns
else:
columns = None
# Get the values as a ndarray
arr = np.array(df.values)
# Get the bootstrap sample through random permutations
sample_len = int(len(arr)*fraction)
if sample_len<1:
sample_len = 1
sample = []
for n_sample in range(n):
sample.append(arr[np.random.permutation(len(arr))[:sample_len]])
sample = np.array(sample)
# Compute the median on each sample
temp = np.median(sample, axis=1)
# Get the mean and std of the estimate across samples
m = np.mean(temp, axis=0)
s = np.std(temp, axis=0)/np.sqrt(len(sample))
# Convert output to DataFrames if necesary and return
if columns:
m = pd.DataFrame(data=m[None, ...], columns=columns)
s = pd.DataFrame(data=s[None, ...], columns=columns)
return m, s
This function returns the mean and standard deviation across the medians computed on each bootstrap sample.
Now consider this example DataFrame
data = np.arange(20)
group = np.tile(np.array([1, 2]).reshape(-1,1), (1,10)).flatten()
df = pd.DataFrame.from_dict({'data': data, 'group': group})
print(df)
print(bootstrap_median(df['data']))
this prints
data group
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
(9.5161999999999995, 0.056585753613431718)
So far so good because bootstrap_median returns a tuple of two elements. However, if I do this after a groupby
In: df.groupby('group')['data'].apply(bootstrap_median)
Out:
group
1 (4.5356, 0.0409710449952)
2 (14.5006, 0.0403772204095)
The values inside each cell are tuples, as one would expect from apply. I can unpack the result into two DataFrame's by iterating over elements like this:
index = []
data1 = []
data2 = []
for g, (m, s) in out.iteritems():
index.append(g)
data1.append(m)
data2.append(s)
dfm = pd.DataFrame(data=data1, index=index, columns=['E[median]'])
dfm.index.name = 'group'
dfs = pd.DataFrame(data=data2, index=index, columns=['std[median]'])
dfs.index.name = 'group'
thus
In: dfm
Out:
E[median]
group
1 4.5356
2 14.5006
In: dfs
Out:
std[median]
group
1 0.0409710449952
2 0.0403772204095
This is a bit cumbersome and my question is if there is a more pandas native way to "unpack" a dataframe whose values are tuples into separate DataFrame's
This question seemed related but it concerned string regex replacements and not unpacking true tuples.
I think you need change:
return m, s
to:
return pd.Series([m, s], index=['m','s'])
And then get:
df1 = df.groupby('group')['data'].apply(bootstrap_median)
print (df1)
group
1 m 4.480400
s 0.040542
2 m 14.565200
s 0.040373
Name: data, dtype: float64
So is possible select by xs:
print (df1.xs('s', level=1))
group
1 0.040542
2 0.040373
Name: data, dtype: float64
print (df1.xs('m', level=1))
group
1 4.4804
2 14.5652
Name: data, dtype: float64
Also if need one column DataFrame add to_frame:
df1 = df.groupby('group')['data'].apply(bootstrap_median).to_frame()
print (df1)
data
group
1 m 4.476800
s 0.041100
2 m 14.468400
s 0.040719
print (df1.xs('s', level=1))
data
group
1 0.041100
2 0.040719
print (df1.xs('m', level=1))
data
group
1 4.4768
2 14.4684
I have two csv files.Depending upon the value of a cell in csv file 1 I should be able to search that value in a column of csv file 2 and get he corresponding value from other column in csv file 2.
I am sorry if this very confusing.It will probably get clear by illustration
CSV file 1
Car Mileage
A 8
B 6
C 10
CSV file 2
Score Mileage(Min) Mileage(Max)
1 1 3
2 4 6
3 7 9
4 10 12
5 13 15
And my desired output CSV file is something like this
Car Mileage Score
A 8 3
B 6 2
C 10 4
Car A is given a score of 3 depending upon its mileage 8 and then looking that mileage in csv file 2 in what range it falls and then getting corresponding score value for that range.
Any help will be appreciated
Thanks in advance
As of writing this, the current stable release is v0.21.
To read your files, use pd.read_csv -
df0 = pd.read_csv('file1.csv')
df1 = pd.read_csv('file2.csv')
df0
Car Mileage
0 A 8
1 B 6
2 C 10
df1
Score Mileage(Min) Mileage(Max)
0 1 1 3
1 2 4 6
2 3 7 9
3 4 10 12
4 5 13 15
To find the Score, use pd.IntervalIndex by calling IntervalIndex.from_tuples. This should be really fast -
v = df1.loc[:, 'Mileage(Min)':'Mileage(Max)'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, closed='both') # you can also use `from_arrays`
df0['Score'] = df1.iloc[idx.get_indexer(df0.Mileage.values), 'Score'].values
df0
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
Other methods of creating an IntervalIndex are outlined here.
To write your result, use pd.DataFrame.to_csv -
df0.to_csv('file3.csv')
Here's a high level outline of what I've done here.
First, read in your CSV files
Use pd.IntervalIndex to build an interval index tree. So, searching is now logarithmic in complexity.
Use idx.get_indexer to find the index of each value in the tree
Use the index to locate the Score value in df1, and assign this back to df0. Note that I call .values, otherwise, the values will be misaligned when assigning back.
Write your result back to CSV
For more information on Intervalindex, take a look at this SO Q/A - Finding matching interval(s) in pandas Intervalindex
Note that IntervalIndex is new in v0.20, so if you have an older version, make sure you update your version with
pip install --upgrade pandas
You can use IntervalIndex, new in version 0.20.0+:
First create DataFrames by read_csv:
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
Create IntervalIndex by from_arrays:
s = pd.IntervalIndex.from_arrays(df2['Mileage(Min)'], df2['Mileage(Max)'], 'both')
print (s)
IntervalIndex([[1, 3], [4, 6], [7, 9], [10, 12], [13, 15]]
closed='both',
dtype='interval[int64]')
Select Mileage values by intervalindex and set to new column by array created by values, because else indices are not aligned and get:
TypeError: incompatible index of inserted column with frame index
df1['Score'] = df2.set_index(s).loc[df1['Mileage'], 'Score'].values
print (df1)
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
And last write to file by to_csv:
df1.to_csv('file3.csv', index=False)
Setup
data = [(1,1,3), (2,4,6), (3,7,9), (4,10,12), (5,13,15)]
df = pd.DataFrame(data, columns=['Score','MMin','MMax'])
car_data = [('A', 8), ('B', 6), ('C', 10)]
car = pd.DataFrame(car_data, columns=['Car','Mileage'])
def find_score(x, df):
result = -99
for idx, row in df.iterrows():
if x >= row.MMin and x <= row.MMax:
result = row.Score
return result
car['Score'] = car.Mileage.apply(lambda x: find_score(x, df))
Which yields
In [58]: car
Out[58]:
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
I have one dataframe (df1) like the following:
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
I have another very big dataframe (df2) which has a column named Absolute_Time. Absolute_Time has the format as ATime of df1. So what I want to do is, for example, for all Absolute_Time's that lay in the range of row 0 to row 1 of ETime of df1, I want to subtract row 0 of Difference of df1 and so on.
Here's an attempt to accomplish what you might be looking for, starting with:
print(df1)
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
next creating a new DataFrame with random times within the range of df1:
df2 = pd.DataFrame({'Absolute Time':[randrange(start=df1.ATime.iloc[0], stop=df1.ATime.iloc[-1]) for i in range(100)]})
df2 = df2.sort_values('Absolute Time').reset_index(drop=True)
np.searchsorted provides you with the index positions where df2 should be inserted in df1 (for the columns in question):
df2.index = np.searchsorted(df1.ATime.values, df2.loc[:, 'Absolute Time'].values)
Assigning the new index and merging produces a new DataFrame. Filling the missing Difference values forward allows to subtract in the next step:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='left').fillna(method='ffill').dropna().astype(int)
df['Absolute Time Adjusted'] = df['Absolute Time'].sub(df.Difference)
print(df.head())
ATime ETime Difference Absolute Time \
1 1444911144979 1715672 1444909429307 1444911018916
1 1444911144979 1715672 1444909429307 1444911138087
2 1444911285683 1856374 1444909429309 1444911138087
3 1444911432742 2003430 1444909429312 1444911303233
3 1444911432742 2003430 1444909429312 1444911359690
Absolute Time Adjusted
1 1589609
1 1708780
2 1708778
3 1873921
3 1930378