I would to change the value of certain DataFrame values only if a certain condition is met an n number of consecutive times.
Example:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12,0]=-40
df.iloc[10:12,1]=-40
Which gives me this DF:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 -1.045209
4 40.000000 0.598657 -1.268399
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 1.744822
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 1.416020
10 -1.337494 -40.000000 -1.195780
11 -0.703669 -40.000000 0.657519
12 -40.000000 -0.288235 -0.840145
13 -1.084869 -0.298030 -1.592004
14 -0.617568 -1.046210 -0.531523
Now, if I do
a=df.copy()
a[ abs(a) > abs(a.std()) ] = float('nan')
I get
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 NaN 0.598657 NaN
5 NaN 0.442297 -0.016363
6 NaN -0.316817 NaN
7 NaN 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
which is fair. However, I would like only to replace the values with NaN if these conditions were met by a maximum of 2 consecutive entries (so I can interpolate later). For example, I wanted the result to be
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
Apparently there's no ready-to-use method to do this. The solution I found that closest resembles my problem was this one, but I couldn't make it work for me.
Any ideas?
See below - the tricky part is (cond[c] != cond[c].shift(1)).cumsum() which breaks the data into contiguous runs of the same value.
In [23]: cond = abs(df) > abs(df.std())
In [24]: for c in df.columns:
...: grouper = (cond[c] != cond[c].shift(1)).cumsum() * cond[c]
...: fill = (df.groupby(grouper)[c].transform('size') <= 2)
...: df.loc[fill, c] = np.nan
In [25]: df
Out[25]:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
To explain a bit more, cond[c] is a boolean series indicating whether your condition is true or not.
The cond[c] != cond[c].shift(1) compares the current row's condition to the next row's. This has the effecting of 'marking' where a run of values begins with the value True.
The .cumsum() converts the bools to integers and takes the cumulative sum. It may not be immediately intuitive, but this 'numbers' the groups of contiguous values. Finally the * cond[c] reassigns all groups that didn't meet the criteria to 0 (using False == 0)
So now you have groups of contiguous numbers that meet your condition, the next step performs a groupby to count how many values are in each group (transform('size').
Finally a new bool condition is used to assign missing values to those groups with 2 or less values meeting the condition.
Related
I would like to analyse a list of orchids (input_df) if it contains orchid species that are on one of six lists. I import these lists from an xlsx file with six sheets as dictionary containing the six lists as DataFrames (orchid_checklists).
import pandas as pd
orchid_checklists = pd.read_excel('\\orchid_checklists.xlsx', sheet_name=None)
input_df = pd.read_excel('\\input.xlsx')
input_df['Orchideen-Checkliste'] = ''
With the following for loop with if condition I am trying to add the name of the Checklist into the field corresponding to the item in input_df['Input Name'] in the column ['Orchideen-Checkliste'] to visualize to what checklist one should refer.
for item in input_df['Input Name']:
for list_name, sheet in orchid_checklists.items():
genus = item.split(' ')[0]
if genus in sheet['referenced'].values:
input_df['Orchideen-Checkliste'] = list_name
else:
pass
In my test input list there is one species called "Bulbophyllum pachyrachis" that should be found. Unfortunately the name of the list "CL_Bulbophyllum" is put into all rows. I can´t figure out why.
In the next step I want to check if the species name is also in the column "exceptions" in either of my checklists. In that case that would not be the correct checklist. In these cases the full species name (e.g. "Aerangis ellisii", see CL_App_I and CL_III below) is found in the column "referenced" of another list.
I haven´t started coding this exception, because I am still stuck on the part before, but any pointers how to approach this are highly welcomed.
This is the input data:
Input Name Orchideen-Checkliste
0 Sobralia madisonii
1 Stelis cocornaensis
2 Stelis gelida
3 Braemia vittata
4 Brassia escobariana
5 Aspasia silvana
6 Bulbophyllum maximum
7 Bulbophyllum pachyrachis
8 Chondroscaphe amabilis
9 Dresslerella hispida
10 Elleanthus sodiroi
11 Maxillaria mathewsii
orchid_checklists:
CL_III
referenced exceptions
0 Aerangis Aerangis ellisii
1 Angraecum NaN
2 Ascocentrum NaN
3 Bletilla NaN
4 Brassavola NaN
5 Calanthe NaN
6 Catasetum NaN
7 Miltonia NaN
8 Miltoniopsis NaN
9 Renanthera NaN
10 Renantherella NaN
11 Rhynchostylis NaN
12 Rossioglossum NaN
13 Vanda NaN
14 Vandopsis NaN
CL_App_I
referenced exceptions
0 Paphiopedilum NaN
1 Phragmipedium NaN
2 Aerangis ellisii NaN
3 Cattleya jongheana NaN
4 Cattleya lobata NaN
5 Dendrobium cruentum NaN
6 Mexipedium xerophyticum NaN
7 Peristeria elata NaN
8 Renanthera imshootiana NaN
CL_Bulbophyllum
referenced exceptions
0 Acrochaene NaN
1 Bulbophyllum NaN
2 Chaseella NaN
3 Codonosiphon NaN
4 Drymoda NaN
5 Monomeria NaN
6 Monosepalum NaN
7 Pedilochilus NaN
8 Succoglossum NaN
9 Sunipia NaN
10 Trias NaN
Thank you in advance for your help!
input_df['Orchideen-Checkliste'] = list_name
Assigns a value to every item of that column because you did not specify a row indexer.
Without changing your process too much: enumerate the items in input_df['Input Name'] when iterating and use the enumeration to specify the row for the assignment.
for index,item in enumerate(input_df['Input Name']):
for list_name, sheet in orchid_checklists.items():
genus = item.split(' ')[0]
if genus in sheet['referenced'].values:
input_df.loc[index,'Orchideen-Checkliste'] = list_name
I am creating a long list of what seem to be tuples that I would like to later convert into a Dataframe, but there are certain common sequences of characters that prevent this from being possible. And example of a fraction of the output:
0,"GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
Name: 0, dtype: object"
Whereas the desired output would be:
GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
How can I get rid of the 0 and " at the start, and then the trash at the end past the TIME_ELAPSED_PERIOD? The int at the start and the one in the bottom row increases by 1 until the end of my program, which could likely go upwards of around 320,000, so I will need the code to be able to adapt for a range of int values. I think it would be easiest to do this after the creation of my list, so it shouldn't be necessary for me to show you any of my code. Just a systematic manipulation of characters should do the trick. Thanks!
Provided that your input data is in the form of a list, you can try the following to meet your requirements:
inputlist = Your_list_to_be_corrected #Assign your input list here
# Now, remove the rows in the list that have the format "Name: 0, dtype: object""
inputlist = [ x for x in inputlist if "dtype: object" not in x ]
#Now, correct the rows containing GAME_ID by removing the int number and special characters
sep = 'GAME_ID'
for index, element in enumerate(inputlist):
if "GAME_ID" in element:
inputlist[index] = 'GAME_ID' + element.split(sep, 1)[1]
I have the following dataframe (called data_coh):
roi mag phase coherence
0 1 0.699883 0.0555903 NaN
1 2 0.640482 0.1053 NaN
2 3 0.477865 1.14926 NaN
3 4 0.128119 2.28403 NaN
4 5 0.563046 2.53091 NaN
5 6 0.58869 0.94647 NaN
6 7 0.428383 1.13915 NaN
7 8 0.164036 1.95959 NaN
8 9 0.27912 3.07456 NaN
9 10 0.244237 2.78111 NaN
10 11 0.696592 2.61011 NaN
11 12 0.237346 3.01836 NaN
For every row, I want to calculate its coherence value as follows (note that I want
to use the imaginary unit j):
import math
import cmath
for roin, val in enumerate(data_coh):
data_coh.loc[roin,'coherence'] = mag*math.cos(phase) + mag*math.sin(phase)*j
First of all, it is not able to perform the computation (which is calculating a complex number based on magnitude and phase). J is a complex unit (from cmath).
But in addition, even when j is left out, the allocation to the rows is not done
correctly. Why is that, and how can it be corrected?
No need to iterate or to import math or cmath, just pandas and numpy:
import pandas as pd
import numpy as np
df['coherence'] = df['mag'] * (np.cos(df['phase']) + 1j*df['phase'])
# Result
df
roi mag phase coherence
0 1 0.699883 0.05559 0.698802+0.038907j
1 2 0.640482 0.10530 0.636934+0.067443j
2 3 0.477865 1.14926 0.195525+0.549191j
3 4 0.128119 2.28403 -0.083826+0.292628j
4 5 0.563046 2.53091 -0.461279+1.425019j
5 6 0.588690 0.94647 0.344119+0.557177j
6 7 0.428383 1.13915 0.179221+0.487992j
7 8 0.164036 1.95959 -0.062182+0.321443j
8 9 0.279120 3.07456 -0.278493+0.858171j
9 10 0.244237 2.78111 -0.228539+0.67925j
10 11 0.696592 2.61011 -0.600502+1.818182j
11 12 0.237346 3.01836 -0.235546+0.716396j
I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.
You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64
I have a pandas dataframe that I am trying to create a tsplot with seaborn and I am getting index duplicate errors. My question is two fold. First off, when I look at the data there are no duplicate indexes (sample df is given below):
BOLD campaign time type
0 2.735430 6041148 3 Default
1 1.356943 6041148 3 None
2 NaN 6041148 3 Vertical
3 9.550452 6013866 6 Default
4 1.000000 6013866 6 None
5 NaN 6013866 6 Vertical
6 322.675089 6086810 8 Default
7 1.508849 6086810 8 None
8 773.393385 6086810 8 Vertical
9 43.396084 6046619 10 Default
10 26.124405 6046619 10 None
11 NaN 6046619 10 Vertical
12 103.955111 6065909 10 Default
13 1.000000 6065909 10 None
14 NaN 6065909 10 Vertical
15 9.744664 6013866 9 Default
16 9.031970 6013866 9 None
17 NaN 6013866 9 Vertical
18 10.980322 6065742 8 Default
19 0.803821 6065742 8 None
However when I go to plot this dataframe with
sns.tsplot(df, time="time", unit="campaign", condition="type", value="BOLD")
I get
ValueError: Index contains duplicate entries, cannot reshape
Can someone explain why this is? I do not see any duplicate entries (or indices). I also tried using drop_duplicates and get the same result