I have some data, in which the index is a threshold, and the values are trns (true negative rates) for two classes, 0 and 1.
I want to get a dataframe, indexed by the tnr, of the threshold that corresponds to that tnr, for each class. Essentially, I want this:
I am able to achieve this effect by using the following:
pd.concat([pd.Series(data[0].index.values, index=data[0]),
pd.Series(data[1].index.values, index=data[1])],
axis=1)
Or, generalizing to any number of columns:
def invert_dataframe(df):
return pd.concat([pd.Series(df[col].index.values,
index=df[col]) for col in df.columns],
axis=1)
However, this seems extremely hacky and error prone. Is there a better way to do this, and is there maybe native Pandas functionality that would do this?
You can use stack with pivot:
data = pd.DataFrame({0:[10,20,31],10:[4,22,36],
1:[7,5,6]}, index=[2.1,1.07,2.13])
print (data)
0 1 10
2.10 10 7 4
1.07 20 5 22
2.13 31 6 36
df = data.stack().reset_index()
df.columns = list('abc')
df = df.pivot(index='c', columns='b', values='a')
print (df)
b 0 1 10
c
4 NaN NaN 2.10
5 NaN 1.07 NaN
6 NaN 2.13 NaN
7 NaN 2.10 NaN
10 2.10 NaN NaN
20 1.07 NaN NaN
22 NaN NaN 1.07
31 2.13 NaN NaN
36 NaN NaN 2.13
Related
I have two dictionaries of data frames LP3 and ExeedenceDict. The ExeedenceDict is a dictionary of 4 dataframes with keys 'two','ten','twentyfive','onehundred'. The LP3 dictionary has keys 'LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston'
Edit: I am not sure of the most concise way to title this question but I think the title suites what I am asking.
There is a column in each dataframe within the ExeedenceDict that has row values equal to the keys in the LP3 dictionary.
Below is a 'blank' dataframe for two in the ExeedenceDict that I created. Using the code:
ExeedenceDF = []
cols = ['Location','Size','Annual Exceedence', 'With Reg Skew','Without Reg Skew','5% Lower','95% Upper']
for i in range(5):
i = pd.DataFrame(columns=cols)
i['Location'] = LP_names
i['Size'] = [39.8,24,34,29.7,21.2,53.7,61.7,27.6,31.6]
ExeedenceDF.append(i)
ExeedenceDict = {'two':ExeedenceDF[0], 'ten':ExeedenceDF[1], 'twentyfive':ExeedenceDF[2], 'onehundred':ExeedenceDF[3]}
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 NaN NaN NaN NaN NaN
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Below is the dataframe for the key LP_DevilMalad in the LP3 dictionary. This dictionary was built by reading in data from 10 excel spreadsheets. Using the code:
LP_names = ['LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston']
for i, df in enumerate(LP_Data):
LP_Data[i] = LP_Data[i].dropna()
LP_Data[i]['Annual Exceedence'] = 1 / LP_Data[i]['Annual Exceedence']
LP_Data[i] = LP_Data[i].loc[LP_Data[i]['Annual Exceedence'].isin([2, 10, 25, 100])]
LP3 = {k:v for (k,v) in zip(LP_names, LP_Data)}
'LP_DevilMalad': Annual Exceedence With Reg Skew Without Reg Skew Log Variance of Est \
6 2.0 21.4 22.4 0.0091
9 10.0 46.5 44.7 0.0119
10 25.0 60.2 54.6 0.0166
12 100.0 81.4 67.4 0.0270
5% Lower 95% Upper
6 14.1 31.2
9 32.1 85.7
10 40.6 136.2
12 51.3 250.6
I am having issues matching the column values of each dataframe within the dictionaries from the keys of LP3 to the Location column in ExeedenceDict dataframes. With the goal of coming up with a script that would do all of this iteratively with some sort of dictionary comprehension.
The caveat is that the two dataframe is just the 6 index value in the LP3 dataframes, ten is the 9th index value, 'twentyfive' is the 10th index value, and onehundred is the 12th index value.
The goale data frame for key two in ExeedenceDict based on the two data frames above would look something like this:
Noting that the rest of the dataframe would be filled with the values from the 6th index from the rest of the dataframe values within the LP3 dictionary.
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 2 21.4 22.4 14.1 31.2
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Can't test it without a reproducible example, but I would do something along the lines:
index_map = {
"two": 6,
"ten": 9,
"twentyfive": 10,
"onehundred": 12
}
col_of_interest = ["Annual Exceedence", "With Reg Skew", "Without Reg Skew", "5% Lower", "95% Upper"]
for index_key, df in ExeedenceDict.items():
lp_index = index_map[index_key]
for lp_val in df['Location'].values:
df.loc[df['Location'] == lp_val, col_of_interest] = LP3[lp_val].loc[lp_index, col_of_interest].values
I have tried to find how to append a Total Row, which sums the columns.
There is a elegant solution to this problem here: [SOLVED] Pandas dataframe total row
However, when using this method, I have noticed a warning message:
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
I have tried to use this alternative, in order to avoid using legacy code. When I tried to use concat method, it was appending the sum row to the last column vertically.
Code I used:
pd.concat( [df, df.sum(numeric_only=True)] )
Result:
a b c d e
0 2200.0 14.30 NaN 2185.70 NaN
1 3200.0 20.80 NaN 3179.20 NaN
2 6400.0 41.60 NaN 6358.40 NaN
3 NaN NaN NaN NaN 11800.00 <-- Appended using Concat
4 NaN NaN NaN NaN 76.70 <-- Appended using Concat
5 NaN NaN NaN NaN 0.00 <-- Appended using Concat
6 NaN NaN NaN NaN 11723.30 <-- Appended using Concat
What I want:
a b c d
0 2200.0 14.30 NaN 2185.70
1 3200.0 20.80 NaN 3179.20
2 6400.0 41.60 NaN 6358.40
3 11800.00 76.70 0.00 11723.30 <-- Appended using Concat
Is there an elegant solution to this problem using concat method?
Convert the sum (which is a pandas.Series) to a DataFrame and transpose before concat:
>>> pd.concat([df,df.sum().to_frame().T], ignore_index=True)
a b c d
0 2200.0 14.3 NaN 2185.7
1 3200.0 20.8 NaN 3179.2
2 6400.0 41.6 NaN 6358.4
3 11800.0 76.7 0.0 11723.3
I'm working on this raw data frame that needs some cleaning. So far, I have transformed this xlsx file
into this pandas dataframe:
print(df.head(16))
date technician alkalinity colour uv ph turbidity \
0 2020-02-01 00:00:00 Catherine 24.5 33 0.15 7.24 1.53
1 Unnamed: 2 NaN NaN NaN NaN NaN 2.31
2 Unnamed: 3 NaN NaN NaN NaN NaN 2.08
3 Unnamed: 4 NaN NaN NaN NaN NaN 2.2
4 Unnamed: 5 Michel 24 35 0.152 7.22 1.59
5 Unnamed: 6 NaN NaN NaN NaN NaN 1.66
6 Unnamed: 7 NaN NaN NaN NaN NaN 1.71
7 Unnamed: 8 NaN NaN NaN NaN NaN 1.53
8 2020-02-02 00:00:00 Catherine 24 NaN 0.145 7.21 1.44
9 Unnamed: 10 NaN NaN NaN NaN NaN 1.97
10 Unnamed: 11 NaN NaN NaN NaN NaN 1.91
11 Unnamed: 12 NaN NaN 33.0 NaN NaN 2.07
12 Unnamed: 13 Michel 24 34 0.15 7.24 1.76
13 Unnamed: 14 NaN NaN NaN NaN NaN 1.84
14 Unnamed: 15 NaN NaN NaN NaN NaN 1.72
15 Unnamed: 16 NaN NaN NaN NaN NaN 1.85
temperature
0 3
1 NaN
2 NaN
3 NaN
4 3
5 NaN
6 NaN
7 NaN
8 3
9 NaN
10 NaN
11 NaN
12 3
13 NaN
14 NaN
15 NaN
From here, I want to combine the rows so that I only have one row for each date. The values for each row will be the mean in the respective columns. ie.
print(new_df.head(2))
date time alkalinity colour uv ph turbidity temperature
0 2020-02-01 00:00:00 24.25 34 0.151 7.23 1.83 3
1 2020-02-02 00:00:00 24 33.5 0.148 7.23 1.82 3
How can I accomplish this when I have Unnamed values in my date column? Thanks!
Try setting the values to NaN and then use ffill:
df.loc[df.date.str.contains('Unnamed', na=False), 'date'] = np.nan
df.date = df.date.ffill()
If I understand, correctly you want to drop rows that contain 'Unnamed' in the date column, right?
Please look here:
https://stackoverflow.com/a/27360130/12790501
The solution would be something like this:
df = df.drop(df['Unnamed' in df.date].index)
Edit:
No, I would like to replace those Unnamed values with the date so I
could then use the groupby('date') function to return the mean values
for the columns
so in the case you should just iterate over the whole table
last_date = ''
for i in df.index:
if 'Unnamed' not in df.at[i, 'date']:
last_date = df.at[i, 'date']
else:
df.at[i, 'date'] = last_date
If the 'date' column is of type object i.e. string
then just write a logic to loop over the number as seen in image provided it follows a certain pattern-
for _ in range(2,9):
df.loc[(df['date'] == 'Unnamed: '+str(_), 'date'] = your_value
I am trying to pull data from a text values in a pandas DataFrame.
df = pd.DataFrame(['{58={1=4.5}, 50={1=4.0}, 42={1=3.5}, 62={1=4.75}, 54={1=4.25}, 46={1=3.75}}',
'{a={1=15.0}, b={1=14.0}, c={1=13.0}, d={1=15.5}, e={1=14.5}, f={1=13.5}}',
'{58={1=15.5}, 50={1=14.5}, 42={1=13.5}, 62={1=16.0}, 54={1=15.0}, 46={1=14.0}}'])
I have tried
df.apply(pd.Series)
pd.DataFrame(df.tolist(),index=df.index)
json_normalize(df)
But with no success.
I want to have new columns 50, 52, a, b c etc. And the values without the '1=' and I dont mind the NaNs. How to do that? What is this format?
Really appreciate your help.
With specific replacement to prepare a valid json string:
In [184]: new_df = pd.DataFrame(df.apply(lambda s: s.str.replace(r'(\w+)=\{1=([^}]+)\}', '"\\1":\\2'))[0].apply(pd.io
...: .json.loads).tolist())
In [185]: new_df
Out[185]:
42 46 50 54 58 62 a b c d e f
0 3.5 3.75 4.0 4.25 4.5 4.75 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 15.0 14.0 13.0 15.5 14.5 13.5
2 13.5 14.00 14.5 15.00 15.5 16.00 NaN NaN NaN NaN NaN NaN
There is a way you can do it by changing strings in order to make your data look like a dictionary. There is probably a smarter way using regex, but that will depend on the assumptions of the entire data you have available.
My steps below are:
Change strings to transform your data into a dict-like structure
Use literal_eval to transform the str on a dict
Unfold the df into a new dataframe
from ast import literal_eval
df[0] = df[0].str.replace('={1=',"':")\ # remove 1= and left inner dict sign {
.str.replace('}, ',",'")\ # remove right inner dict sign }
.str.replace('}}','}')\ # remove outmost extra }
.str.replace('{',"{'")\ # add appropriate string sign to first value.
.apply(literal_eval) # read as a dict
pd.DataFrame(df[0].values.tolist()) # unfold as a new dataframe
Out[1]:
58 50 42 62 54 46 a b c d e f
0 4.5 4.0 3.5 4.75 4.25 3.75 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 15.0 14.0 13.0 15.5 14.5 13.5
2 15.5 14.5 13.5 16.00 15.00 14.00 NaN NaN NaN NaN NaN NaN
Following up on this answer: Is there a way to do a weight-average rolling sum over a grouping?
rsum = pd.rolling_apply(g.values,p,lambda x: np.nansum(w*x),min_periods=p)
rolling_apply is deprecated now. How would you change this to work under current functionality.
As of 0.18+, use Series.rolling.apply.
w = np.array([0.1,0.1,0.2,0.6])
df.groupby('ID').VALUE.apply(
lambda x: x.rolling(window=4).apply(lambda x: np.dot(x, w), raw=False))
0 NaN
1 NaN
2 NaN
3 146.0
4 166.0
5 NaN
6 NaN
7 NaN
8 2.5
9 NaN
10 NaN
11 NaN
12 35.5
13 21.4
14 NaN
15 NaN
16 NaN
17 8.3
18 9.8
19 NaN
Name: VALUE, dtype: float64
The raw argument is new to 0.23 (set it to specify passing Series v/s arrays), so remove it if you're having trouble on older versions.