So I have an excel file that looks like this
Name R s l2 max_amplitude ref_amplitude
R_0.3_s_0.5_l2_0.1 0.3 0.5 0.1 1.45131445 1.45131445
R_0.3_s_0.5_l2_0.6 0.3 0.5 0.6 3.52145743 3.52145743
...
R_1.1_s_2.0_l2_1.6 1.1 2.0 1.6 5.07415199 5.07415199
R_1.1_s_2.0_l2_2.1 1.1 2.0 2.1 5.78820419 5.78820419
R_1.1_s_2.0_l2_2.6 1.1 2.0 2.6 5.84488964 5.84488964
R_1.1_s_2.0_l2_3.1 1.1 2.0 3.1 6.35387516 6.35387516
Using the pandas module I import the data into data frame
import pandas as pd
df = pd.read_excel("output_var.xlsx", header=0)
Everything seems to be ok:
df
in the command line produces:
R s l2 max_amplitude ref_amplitude
0 0.3 0.5 0.1 1.451314 1.451314
1 0.3 0.5 0.6 3.521457 3.521457
2 0.3 0.5 1.1 4.770226 4.770226
...
207 1.1 2.0 2.1 5.788204 5.788204
208 1.1 2.0 2.6 5.844890 5.844890
209 1.1 2.0 3.1 6.353875 6.353875
[210 rows x 5 columns]
Now I need to do some calculations based on the value of R so I need to do slicing of the array. Column R containes 5 different values: 0.3, 0.5, 0.7, 0.9 and 1.1. Each of these 5 values has 42 rows. (5x42=210)
To remove the duplicates from "R" I try
set(df.R)
which returns:
{0.29999999999999999,
0.5,
0.69999999999999996,
0.89999999999999991,
0.90000000000000002,
1.1000000000000001}
Beside from representing the 0.3 as 0.29999 etc. there are 6 (instead of 5) different values for R. It seams that sometimes 0.9 gets interpreted as 0.89999999999999991 and sometimes as 0.90000000000000002
This can be (partialy) solved with:
set(round(df.R,1))
which (at least) returns 5 values:
{0.29999999999999999,
0.5,
0.69999999999999996,
0.90000000000000002,
1.1000000000000001}
But now I come to the dangerous part. If I want to do the slicing according to the known values of R (0.3, 0.5, 0.7, 0.9 and 1.1)
len(df[df.R==0.3])
returns
42
and
len(df[df.R==0.9])
returns
41
One value gets deleted by Python! (remember, there are 42 rows for each of 5 R's giving the total number of 210 rows in the file).
How to deal with this problem?
Don't check floats for equality. There are some issues with floating point arithmetic (check here for example).
Instead, check for closeness (really really closeness):
import numpy as np
len(df[np.isclose(df.R, 0.9)])
Normally, if you don't convert the series to a set, pandas would handle that. So if you want to drop duplicates, I'd suggest using pandas methods:
df.drop_duplicates('R')
Related
I have tried running np.random.normal(1.75,0.20,1000) multiple times and it always returns only positive values in the array.
Why does it always returns only positive values? Isn't supposed to contain some negative values too?
In order to see a negative number, with a mean of 1.75 and a sigma of 0.20, you should see a number which is at least 8.75 sigma away from the mean.
The probability to see a number 7 sigma away (in both directions) from the mean is 1 in 390682215445.
And the probability for 8.75 sigma is even less.
You are making only 1000 tries.
For probabilities: see here
The standard deviation you have inserted is such that most (99.7%) of the numbers that will be drawn will be greater than (1.75 - 3*0.20) = 1.15 and smaller than (1.75 + 3*0.20) = 2.35.
Look up this empirical rule:
Put simply: 99.7% of the values lie within 3 standard deviation from the mean.
Negatives are very unlikely with that, they're just too far away. When I round to one digit, even for a million (rather than your 1000) values l get a distribution like this:
0.8 4
0.9 32
1.0 208
1.1 1099
1.2 4991
1.3 16614
1.4 43976
1.5 92127
1.6 149513
1.7 191929
1.8 191118
1.9 150418
2.0 91883
2.1 43602
2.2 16344
2.3 4777
2.4 1142
2.5 186
2.6 35
2.7 2
Code (Try it online!):
import numpy as np
from collections import Counter
a = np.random.normal(1.75,0.20,1000000)
ctr = Counter(round(x, 1) for x in a)
for x, count in sorted(ctr.items()):
print(x, count)
I have a data frame with columns containing different country values, I would like to have a function that shifts the rows in this dataframe independently without the dates. For example, I have a list of related profile shifters for each country which would be used in shifting the rows.
If the profile shifter for a country is -3, that country column, is shifted 3 times downwards, while the last 3 values become the first 3 values in the dataframe. If a profile shifter is +3, the third value of a row is shifted upwards while the first 2 values become the last values in that column.
After the rows have been shifted instead of having the default Nan value appear in the empty cells, I want the preceding or succeeding values to take up the empty cells. The function should also return a data frame Sample-dataset Profile Shifter Expected-results.
Sample Dataset:
Datetime ARG AUS BRA
1/1/2050 0.00 0.1 2.1 3.1
1/1/2050 1.00 0.2 2.2 3.2
1/1/2050 2.00 0.3 2.3 3.3
1/1/2050 3.00 0.4 2.4 3.4
1/1/2050 4.00 0.5 2.5 3.5
1/1/2050 5.00 0.6 2.6 3.6
Country Profile Shifters:
Country ARG AUS BRA
UTC -3 -2 4
Desired Output:
Datetime ARG AUS BRA
1/1/2050 0.00 0.3 2.4 3.4
1/1/2050 1.00 0.4 2.5 3.5
1/1/2050 2.00 0.5 2.1 3.1
1/1/2050 3.00 0.1 2.2 3.2
1/1/2050 4.00 0.2 2.3 3.3
This is what I have been trying for days now but it's not working
cols = df1.columns
for i in cols:
if i == 'ARG':
x = df1.iat[0:3,0]
df1['ARG'] = df1.ARG.shift(periods=-3)
df1['ARG'].replace(to_replace=np.nan, x)
elif i == 'AUS':
df1['AUS'] = df1.AUS.shift(periods=2)
elif i == 'BRA':
df1['BRA'] = df1.BRA.shift(periods=1)
else:
pass
This works but is far from being 'good pandas'. I hope that someone will come along and give a nicer, cleaner 'more pandas' answer.
Imports used:
import pandas as pd
import datetime as datetime
Offset data setup:
offsets = pd.DataFrame({"Country" : ["ARG", "AUS", "BRA"], "UTC Offset" : [-3, -2, 4]})
Produces:
Country UTC Offset
0 ARG -3
1 AUS -2
2 BRA 4
Note that the timezone offset data I've used here is in a slightly different structure from the example data (country codes by rows, rather than columns). Also worth pointing out that Australia and Brazil have several time zones, so there is no one single UTC offset which applies to those whole countries (only one in Argentina though).
Sample data setup:
sampleDf = pd.DataFrame()
for i in range(6):
dt = datetime.datetime(2050,1,1,i)
sampleDf = sampleDf.append({'Datetime' : dt,
'ARG' : i / 10,
'AUS' : (i + 10)/ 10,
'BRA' : (i + 20) / 10},
ignore_index=True)
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.0 2.0
1 2050-01-01 01:00:00 0.1 1.1 2.1
2 2050-01-01 02:00:00 0.2 1.2 2.2
3 2050-01-01 03:00:00 0.3 1.3 2.3
4 2050-01-01 04:00:00 0.4 1.4 2.4
5 2050-01-01 05:00:00 0.5 1.5 2.5
Code to shift cells:
for idx, offsetData in offsets.iterrows(): # See note 1
countryCode = offsetData["Country"]
utcOffset = offsetData["UTC Offset"]
dfRowCount = sampleDf.shape[0]
wrappedOffset = (dfRowCount + utcOffset) if utcOffset < 0 else \
(-dfRowCount + utcOffset) # See note 2
countryData = sampleDf[countryCode]
sampleDf[countryCode] = pd.concat([countryData.shift(utcOffset).dropna(),
countryData.shift(wrappedOffset).dropna()]).sort_index() # See note 3
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.4 2.4
1 2050-01-01 01:00:00 0.1 1.5 2.5
2 2050-01-01 02:00:00 0.2 1.0 2.0
3 2050-01-01 03:00:00 0.3 1.1 2.1
4 2050-01-01 04:00:00 0.4 1.2 2.2
5 2050-01-01 05:00:00 0.5 1.3 2.3
Notes
Iterating over rows in pandas like this (to me) indicates 'you've run out of pandas skill, and are kind of going against the design of pandas'. What I have here works, but it won't benefit from any/many of the efficiencies of using pandas, and would not be appropriate for a large dataset. Using itertuples rather than iterrows is supposed to be quicker, but I think neither is great, so I went with what seemed most readable for this case.
This solution does two shifts, one of the data shifted by the timezone offset, then a second shift of everything else to fill in what would otherwise be NaN holes left by the first shift. This line calculates the size of that second shift.
Finally, the results of the two shifts are concatenated together (after dropping any NaN values from both of them) and assigned back to the original (unshifted) column. sort_index puts them back in order based on the index, rather than having the two shifted parts one-after-another.
Suppose you have numerical data for some function z = f(x, y) saved in a pandas dataframe, where x is the index values, y is the column values, and the dataframe is populated with the z data. For example:
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
is there a simple pandas command, or maybe a one-line string of a few simple commands, which returns the (x, y) values corresponding to data attributes, specifically in my case as min(z)? In the example data I'd be looking for (1.0, 0.6)
I'm really just hoping there's an answer that doesn't involve parsing out the data into some other structure, because sure, just linearize the data in a numpy array and correlate the numpy array index with (x,y). But if there's something cleaner/more elegant that I simply am not finding, I'd love to learn about it.
Using pandas.DataFrame.idxmin & pandas.Series.idxmin
import pandas as pd
# df view
0.0 0.1 0.2 0.3 0.4 0.5 0.6
1.0 0.0 -0.002961 -0.005921 -0.008883 -0.011845 -0.014808 -0.017772
1.1 0.0 -0.002592 -0.005184 -0.007777 -0.010371 -0.012966 -0.015563
1.2 0.0 -0.002084 -0.004168 -0.006253 -0.008340 -0.010428 -0.012517
# min column
min_col_name = df.min().idxmin()
# min column index if needed
min_col_idx = df.columns.get_loc(min_col_name)
# min row index
min_row_idx = df[min_col_name].idxmin()
another option:
(df.min(axis=1).idxmin(), df.min().idxmin())
I have a pandas dataframe containing about 2 Million rows which looks like the following example
ID V1 V2 V3 V4 V5
12 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
01 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
02 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
07 3.8 2.9 1.1 1.6 1.5
19 0.9 1.2 1.8 2.6 9.0
19 0.5 0.4 0.6 0.7 1.8
06 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
18 0.9 1.2 1.8 2.6 9.0
I want to create three subsets of this data such that the column ID is mutually exclusive. And each of the subset includes all rows corresponding to the ID column in the main dataframe.
As of now, I am randomly shuffling the ID column and selecting unique ID's as a list. Using this list I'm selecting all rows that from the dataframe who's ID belong to fraction of the list.
import numpy as np
import random
distinct = list(set(df.ID.values))
random.shuffle(distinct)
X1, X2 = distinct[:1000000], distinct[1000000:2000000]
df_X1 = df.loc[df['ID'].isin(list(X1))]
df_X2 = df.loc[df['ID'].isin(list(X2))]
This is working as expected for smaller data, however for larger data the run doesn't even complete for many hours. Is there a more efficient way to do this? appreciate responses.
I think the slow down is coming in the nested isin list inside the loc slice. I tried a different approach using numpy and a boolean index that seems to double the speed.
First to set up the dataframe. I wasn't sure how many unique items you had so I selected 50. I was also unsure how many columns so arbitrarily selected 10,000 columns and rows.
df = pd.DataFrame(np.random.randn(10000, 10000))
ID = np.random.randint(0,50,10000)
df['ID'] = ID
Then I try to use mostly numpy arrays and avoid the nested list using a boolean index.
# Create a numpy array from the ID columns
a_ID = np.array(df['ID'])
# use the numpy unique method to get a unique array
# a = np.unique(np.array(df['ID']))
a = np.unique(a_ID)
# shuffle the unique array
np.random.seed(100)
np.random.shuffle(a)
# cut the shuffled array in half
X1 = a[0:25]
# create a boolean mask
mask = np.isin(a_ID, X1)
# set the index to the mask
df.index = mask
df.loc[True]
When I ran your code on my sample df, times were 817 ms, the code above runs at 445 ms.
Not sure if this helps. Good question, thanks.
I want to apply a function f to many slices within each row of a pandas DataFrame.
For example, DataFrame df would look as such:
df = pandas.DataFrame(np.round(np.random.normal(size=(2,49)), 2))
So, I have a dataframe of 2 rows by 49 columns, and my function needs to be applied to every consequent slice of 7 data points in both rows, and so that the resulting dataframe looks identical to the input dataframe.
I was doing it as such:
df1=df.copy()
df1.T[:7], df1.T[7:14], df1.T[14:21],..., df1.T[43:50] = f(df.T.iloc[:7,:]), f(df.T.iloc[7:14,:]),..., f(df.T.iloc[43:50,:])
As you can see that's a whole lot of redundant code.. so I would like to create a loop or something so that it applies the function to every 7 subsequent data point...
I have no idea how to approach this. Is there a more elegant way to do this?
I thought I could maybe use a transform function for this, but in the pandas documentation I can only see that applied to a dataframe that has been grouped and not on slices of the data....
Hopefully this is clear.. let me know.
Thank you.
To avoid redundant code you can just do a loop like this:
STEP = 7
for i in range(0,len(df),STEP):
df1.T[i:i+STEP] = f(df1.T[i:i+STEP]) # could also do an apply here somehow, depending on what you want to do
Don't Repeat Yourself
You don't provide any examples of your desired output, so here's my best guess at what you want...
If your data are lumped into groups of seven, the you need to come up with a way to label them as such.
If other words, you with want to work with arbitrary arrays, use numpy. If you want to work with labeled, meaningful data and it's associated metadata, then use pandas.
Also, pandas works more efficiently when operating (and displaying!) row-wise data. So that mean store data long (49x2), not wide (2x49)
Here's an example of what I mean. I have the same 49x2 random array, but assigned grouping labels to the rows ahead of time.
Let's yeah you're reading in some wide-ish data as following:
import pandas
import numpy
from io import StringIO # python 3
# from StringIO import StringIO # python 2
datafile = StringIO("""\
A,B,C,D,E,F,G,H,I,J
0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9
2.0,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9
""")
df = pandas.read_csv(datafile)
print(df)
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
You could add a cluster value to the columns, like so:
cluster_size = 3
col_vals = []
for n, col in enumerate(df.columns):
cluster = int(n/cluster_size)
col_vals.append((cluster, col))
df.columns = pandas.Index(col_vals)
print(df)
0 1 2 3
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
By default, the groupby method tries to group rows, but you can group columns (I just fogured this out), by passing axis=1 when you create the object. So the sum of each cluster of columns for each row is as follows:
df.groupby(axis=1, level=0).sum()
0 1 2 3
0 0.3 1.2 2.1 0.9
1 3.3 4.2 5.1 1.9
2 6.3 7.2 8.1 2.9
But again, if all you're doing is more "global" operations, there's no need to any of this.
In-place column cluster operation
df[0] *= 5
print(df)
0 1 2 3
A B C D E F G H I J
0 0 2.5 5 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
In-place row operation
df.T[0] += 20
0 1 2 3
A B C D E F G H I J
0 20 22.5 25 20.3 20.4 20.5 20.6 20.7 20.8 20.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
Operate on the entire dataframe at once
def myFunc(x):
return 5 + x**2
myFunc(df)
0 1 2 3
A B C D E F G H I J
0 405 511.25 630 417.09 421.16 425.25 429.36 433.49 437.64 441.81
1 630 761.25 905 6.69 6.96 7.25 7.56 7.89 8.24 8.61
2 2505 2761.25 3030 10.29 10.76 11.25 11.76 12.29 12.84 13.41