Trouble rearranging columns with same name in pandas dataframe in Python

Trouble rearranging columns with same name in pandas dataframe in Python - python

I have a pandas dataframe whose columns I have turned into a list and edited and rearranged. I'm trying to reassign the columns like as follows:
sapie_columns = sapie_df_working.columns.tolist()
sapie_columns = [sapie_columns[-1]] + sapie_columns[3:-1]
sapie_df_working = sapie_df_working[sapie_columns]
but it turns my dataframe (initially with 32 columns) into one with 164 columns. I think this is because a number of the existing columns have the same column name (i.e., "90% CI Lower Bound"). I'm curious about why is is happening and how I can rearrange and edit my dataframe's columns as I want to.
For reference, here is a snippet of my dataframe:
# sapie_df_working
2 State FIPS Code County FIPS Code Postal Code Name Poverty Estimate, All Ages 90% CI Lower Bound 90% CI Upper Bound Poverty Percent, All Ages 90% CI Lower Bound 90% CI Upper Bound ... 90% CI Upper Bound Median Household Income 90% CI Lower Bound 90% CI Upper Bound Poverty Estimate, Age 0-4 90% CI Lower Bound 90% CI Upper Bound Poverty Percent, Age 0-4 90% CI Lower Bound 90% CI Upper Bound
3 00 000 US United States 38371394 38309115 38433673 11.9 11.9 11.9 ... 14.9 67340 67251 67429 3146325 3133736 3158914 16.8 16.7 16.9
4 01 000 AL Alabama 714568 695249 733887 14.9 14.5 15.3 ... 20.7 53958 53013 54903 66169 61541 70797 23.3 21.7 24.9
5 01 001 AL Autauga County 6242 4930 7554 11.2 8.8 13.6 ... 19.3 67565 59132 75998 . . . . . .
6 01 003 AL Baldwin County 20189 15535 24843 8.9 6.8 11 ... 16.1 71135 66540 75730 . . . . . .
7 01 005 AL Barbour County 5548 4210 6886 25.5 19.3 31.7 ... 47.2 38866 33510 44222 . . . . . .

df = df[specific_column_names] is indeed producing this result because of duplicate column names. Filtering with column names in this case is tricky, as it's unclear exactly which column is being referred to.
In case of duplicate column names I would instead use column indices to filter the DataFrame.
Let's look at an example:
>>> import pandas as pd
>>> mock_data = [[11.29, 33.1283, -1.219, -33.11, 930.1, 33.91, 0.1213, 0.134], [9.0, 99.101, 9381.0, -940.11, 55.41, -941.1, -1.3913, 1933.1], [-192.1, 0.123, 0.1243, 0.213, 751.1, 991.1, -1.333, 9481.1]]
>>> mock_columns = ['a', 'b', 'c', 'a', 'd', 'b', 'g', 'a']
>>> df = pd.DataFrame(columns=mock_columns, data=mock_data)
>>> df
a b c a d b g a
0 11.29 33.1283 -1.2190 -33.110 930.10 33.91 0.1213 0.134
1 9.00 99.1010 9381.0000 -940.110 55.41 -941.10 -1.3913 1933.100
2 -192.10 0.1230 0.1243 0.213 751.10 991.10 -1.3330 9481.100
>>> columns = df.columns.tolist()
>>> filtered_column_indices = [len(columns) - 1] + list(range(3, len(columns) - 1))
>>> df.iloc[:, filtered_column_indices]
a a d b g
0 0.134 -33.110 930.10 33.91 0.1213
1 1933.100 -940.110 55.41 -941.10 -1.3913
2 9481.100 0.213 751.10 991.10 -1.3330
In the example, instead of extracting column names with [sapie_columns[-1]] + sapie_columns[3:-1], I extracted the equivalent indices and used that to filter the DataFrame using iloc.

Related

Matching a data frame row (pandas) to a separate data frame row and performing a calculation if the first column matches

newbie python/coder who is trying to make data logger downloads and calculations a smoother process as a side project.
Anyways I have two data frames.
The first is "data" which contains the following (number of rows shortened for simplicity):
Logger Name Date and Time Battery Temp(C) Sensor Reading(dg) Sensor Temp(C) Array #
0 TDX 10/1/2021 13:35 2.93 15.59 8772.737 14.5 833
1 TDX 10/1/2021 13:36 2.93 15.59 8773.426 14.5 834
2 TDX 10/1/2021 13:36 2.93 15.59 8773.570 14.5 835
3 TDX 10/1/2021 13:37 2.93 15.59 8773.793 14.5 836
The second is "param" which has parameters which contains values that I use to make calculations:
Transducer_ID elevation_tom elevation_toc elevation_ground elevation_tos calculation gage_factor xd_zero_reading thermal_factor xd_temp_at_zero_reading piezo_elev piezo_downhole_depth
0 TDX NaN NaN 1000 NaN linear -0.04135 9138 0.003119 24.8 1600 400
1 Test NaN NaN 1000 NaN linear -0.18320 8997 -0.170100 22.6 800 200
Now what I hope the code will be able to do is make a new column in "data" called "Linear P" which populates based on this calculation that uses variables from both dataframes:
[digits_zero_digits - Sensor Reading(dg)] * abs(gage_factor). Now this is not a problem if "param" only had one Transducer ID and the same number of rows as "data", but in reality it has lots of rows with different IDs.
So my question is this. What is the best way to accomplish my goal? Is it to loop over the column or is there something more efficient using the pandas library?
Thanks in advance!
edit: the output I am looking for is this
Logger Name Date and Time Battery Voltage(v) Internal Temp(C) Sensor Reading(dg) Sensor Temp(C) Array # Linear P
0 TDX 10/1/2021 13:35 2.93 15.59 8772.737 14.5 833 15.103625
1 TDX 10/1/2021 13:36 2.93 15.59 8773.426 14.5 834 15.075135
2 TDX 10/1/2021 13:36 2.93 15.59 8773.570 14.5 835 15.069181
3 TDX 10/1/2021 13:37 2.93 15.59 8773.793 14.5 836 15.059959

Just figured out a way to do it that seems pretty efficient.
I simply remove the data in "param" that I do not need:
z = data.iloc[0,0]
param = param[param.Transducer_ID == z]
With the data filtered I pull out only the needed values from param:
x = piezo_param.iloc[0, 7]
y = piezo_param.iloc[0, 6]
And perform the calculation:
data['Linear P'] = (x - data['Sensor Reading(dg)']) * abs(y)
Let me know if this seems like the best way to get the job done!

The more efficient way would be based on my experience :
join the two data frame using (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html).
make calculation on the result dataframe ( df["Linear P"] = df["Sensor Reading(dg)"] * ... ) .
here is an example of my process :
import pandas as pd
df1 = pd.DataFrame({'Names': ['a', 'a'],
'var1': [35, 15,],
'var2': [15, 40]})
df2 = pd.DataFrame({'Names1': ['a', 'E'],
'var3': [35, 15,],
'var4': [15, 40]})
final_df = df1.merge(df2, left_on='Names', right_on='Names1', how='left' )
final_df["Linear P"] = final_df["var3"] * final_df["var2"] - abs(final_df["var2"])
print(final_df)

How to sum all rows from multiple columns

I want to do several operations that are repeated for several columns but I can't do it with a list-comprehension or with a loop.
The dataframe I have is concern_polls and I want to rescale the percentages and the total amounts.
text very somewhat \
0 How concerned are you that the coronavirus wil... 19.00 33.00
1 How concerned are you that the coronavirus wil... 26.00 32.00
2 Taking into consideration both your risk of co... 13.00 26.00
3 How concerned are you that the coronavirus wil... 23.00 32.00
4 How concerned are you that you or someone in y... 11.00 24.00
.. ... ... ...
625 How worried are you personally about experienc... 33.09 36.55
626 How do you feel about the possibility that you... 30.00 31.00
627 Are you concerned, or not concerned about your... 34.00 35.00
628 Are you personally afraid of contracting the C... 28.00 32.00
629 Taking into consideration both your risk of co... 22.00 40.00
not_very not_at_all url
0 23.00 11.00 https://morningconsult.com/wp-content/uploads/...
1 25.00 7.00 https://morningconsult.com/wp-content/uploads/...
2 43.00 18.00 https://d25d2506sfb94s.cloudfront.net/cumulus_...
3 24.00 9.00 https://morningconsult.com/wp-content/uploads/...
4 33.00 20.00 https://projects.fivethirtyeight.com/polls/202...
.. ... ... ...
625 14.92 12.78 https://docs.google.com/spreadsheets/d/1cIEEkz...
626 14.00 16.00 https://www.washingtonpost.com/context/jan-10-...
627 19.00 12.00 https://drive.google.com/file/d/1H3uFRD7X0Qttk...
628 16.00 15.00 https://leger360.com/wp-content/uploads/2021/0...
629 21.00 16.00 https://docs.cdn.yougov.com/4k61xul7y7/econTab...
[630 rows x 15 columns]
Variables very, somewhat, not_very and not_at_all they are represented as percentages of the column SAMPLE_SIZE, not shown in the sample share. The percentages don't always add up to 100% so I want to rescale it
To do this, I take the following steps: I calculate the sum of the columns -> variable I sum calculate the amount per %. This step could leave it as a variable and not create a new column in it df. I calculate the final amounts
The code I have so far is this:
sums = concern_polls['very'] + concern_polls['somewhat'] + concern_polls['not_very'] + concern_polls['not_at_all']
concern_polls['Very'] = concern_polls['very'] / sums * 100
concern_polls['Somewhat'] = concern_polls['somewhat'] / sums * 100
concern_polls['Not_very'] = concern_polls['not_very'] / sums * 100
concern_polls['Not_at_all'] = concern_polls['not_at_all'] / sums * 100
concern_polls['Total_Very'] = concern_polls['Very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Somewhat'] = concern_polls['Somewhat'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_very'] = concern_polls['Not_very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_at_all'] = concern_polls['Not_at_all'] / 100 * concern_polls['sample_size']
I have tried to raise the function with "list comprehension" but I can't.
Could someone make me a suggestion?
The problems that I find is that I want to add all the rows of several columns, but they are not all of the df doing repetitive operations on several columns, but they are not all of the df
Thank you.

df[newcolumn] = df.apply(lambda row : function(row), axis=1)
is your friend here I think.
"axis=1" means it does it row by row.
As an example :
concern_polls['Very'] = concern_polls.apply(lambda row: row['very'] / sums * 100, axis=1)
And if you want sums to be the total of each of those df columns it'll be
sums = concern_polls[['very', 'somewhat', 'not_very', 'not_at_all']].sum().sum()

How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?

I need to apply a function that splits multiple numbers from the fields of a dataframe.
In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.
The name is only formed of alpha-characters. But some kids might have both first name and middle name. (e.g. Vivien Ester)
The height is known to be >= 100 cm for every child.
The weight is known to be < 70 kg for every child.
The unique code is known to be any number, but it is associated with the letters 'AX', for every child. But the AX may not always be stick to the number (e.g. 7771AX), it might be a space next to it. (e.g. 100 AX)
Every kid has its dream career
They could appear in any order, but they always follow the rules from above. However, for some kids some measurements could not appear (e.g.: height or unique code or both are missing or all are missing).
So the dataframe is this:
data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
}
df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])
And it looks like this:
Dream Career Measurements
0 Scientist Rachel 24.3 100.25 100 AX
1 Astronaut 100.5 Tuuli 30.1
2 Software Engineer Michael 28.0 7771AX 113.75
3 Doctor Vivien Ester 40AX 115.20
4 Fashion Designer Oliver 40.5
5 Teacher Julien 35.1 678 AX 111.1
6 Architect Bee 20.0 100.80 88AX
I try to split all of these measurements into different columns, based on the specified rules.
So the final dataframe should look like this:
Dream Career Names Weight Height Unique Code
0 Scientist Rachael 24.3 100.25 100AX
1 Astronaut Tuuli 30.1 100.50 NaN
2 Software Engineer Michael 28.0 113.75 7771AX
3 Doctor Vivien Ester NaN 115.20 40AX
4 Fashion Designer Oliver 40.5 NaN NaN
5 Teacher Julien 35.1 111.10 678AX
6 Architect Bee 10.0 100.80 88AX
I tried this code and it works very well, but only on single strings. And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
nums = re.findall(num_rx, s)
height = np.nan
weight = np.nan
if (len(nums) == 0):
height = np.nan
weight = np.nan
elif (len(nums) == 1):
if float(nums[0]) >= 100:
height = nums[0]
weight = np.nan
else:
weight = nums[0]
height = np.nan
elif (len(nums) == 2):
if float(nums[0]) >= 100:
height = nums[0]
weight = nums[1]
else:
height = nums[1]
weight = nums[0]
return height, weight
class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
def extract_measurements(string, substring_name):
height = np.nan
weight = np.nan
unique_code = np.nan
name = ''
if hasNumbers(string):
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
nums = re.findall(num_rx, string)
if (substring_name in string):
special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
if special_match:
unique_code = special_match.group()
string = string.replace(unique_code, '')
unique_code = unique_code + substring_name
if len(nums) >= 2 & len(nums) <= 3:
height, weight = get_weight_height(string)
else:
height, weight = get_weight_height(string)
name = " ".join(re.findall("[a-zA-Z]+", string))
name = name.replace(substring_name,'')
return format(float(height), '.2f'), float(weight), unique_code, name
And I apply it like this:
string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)
The results are very good.
I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:
df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small'])
I cannot figure out how to apply it on my dataframe. Please help me.
I am at the very beginning, I highly appreciate all the help if you could possibly help me!

Use apply for rows (axis=1) and choose 'expand' option. Then rename columns and concat to the original df:
pd.concat([df,(df.apply(lambda row : extract_measurements(row['Measurements'], class_code['Small']), axis = 1, result_type='expand')
.rename(columns = {0:'height', 1:'weight', 2:'unique_code', 3:'name'})
)], axis = 1)
output:
Dream Career Measurements height weight unique_code name
-- ----------------- -------------------------- -------- -------- ------------- ------------
0 Scientist Rachel 24.3 100.25 100 AX 100 100 100AX Rachel
1 Astronaut 100.5 Tuuli 30.1 100 100 nan Tuuli
2 Software Engineer Michael 28.0 7771AX 113.75 100 100 7771AX Michael
3 Doctor Vivien Ester 40AX 115.20 100 100 40AX Vivien Ester
4 Fashion Designer Oliver 40.5 100 100 nan Oliver
5 Teacher Julien 35.1 678 AX 111.1 100 100 678AX Julien
6 Architect Bee 20.0 100.80 88AX 100 100 88AX Bee
(note I stubbed def get_weight_height(string) function because your coded did not include it, to always return 100,100)

#piterbarg's answer seems efficient given the original functions, but the functions seem verbose to me. I'm sure there's a simpler solution here that what I'm doing, but what I have below replaces the functions in OP with I think the same results.
First changing the column names to snake case for ease:
df = pd.DataFrame({
'dream_career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor',
'Fashion Designer', 'Teacher', 'Architect'],
'measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1',
'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20',
'Oliver 40.5', 'Julien 35.1 678 AX 111.1',
'Bee 20.0 100.80 88AX']
})
First the strings in .measurements are turned into lists. From here on list comphrehensions will be applied to each list to filter values.
df.measurements = df.measurements.str.split()
0 [Rachel, 24.3, 100.25, 100, AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678, AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
The second step is filtering out the 'AX' from .measurements and appending 'AX' to all integers. This assumes this example is totally reproducible and all the height/weight measurements are floats, but a different differentiator could be used if this isn't the case.
df.measurements = df.measurements.apply(
lambda val_list: [val for val in val_list if val!='AX']
).apply(
lambda val_list: [str(val)+'AX' if val.isnumeric() else val
for val in val_list]
)
0 [Rachel, 24.3, 100.25, 100AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
.name and .unique_code are pretty easy to grab. With .unique_code I had to apply a second lambda function to insert NaNs. If there are missing values for .name in the original df the same thing will need to be done there. For cases of multiple names, these are joined together separated with a space.
df['name'] = df.measurements.apply(
lambda val_list: ' '.join([val for val in val_list if val.isalpha()])
)
df['unique_code'] = df.measurements.apply(
lambda val_list: [val for val in val_list if 'AX' in val]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
For height and weight I needed to create a column of numerics first and work off that. In cases where there are missing values I'm having to come back around to deal with those.
import re
df['numerics'] = df.measurements.apply(
lambda val_list: [float(val) for val in val_list
if not re.search('[a-zA-Z]', val)]
)
df['height'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val < 70]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
df['weight'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val >= 100]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
Finally, .measurements and .numerics are dropped, and the df should be ready to go.
df = df.drop(columns=['measurements', 'numerics'])
dream_career name unique_code height weight
0 Scientist Rachel 100AX 24.3 100.25
1 Astronaut Tuuli NaN 30.1 100.50
2 Software Engineer Michael 7771AX 28.0 113.75
3 Doctor Vivien Ester 40AX NaN 115.20
4 Fashion Designer Oliver NaN 40.5 NaN
5 Teacher Julien 678AX 35.1 111.10
6 Architect Bee 88AX 20.0 100.80

How do I calculate an average of a range from a series within in a dataframe?

Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?

Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5

Display mean and deviation values on grouped boxplot in Python

I want to display mean and standard deviation values above each of the boxplots in the grouped boxplot (see picture).
My code is
import pandas as pd
import seaborn as sns
from os.path import expanduser as ospath
df = pd.read_excel(ospath('~/Documents/Python/Kandidatspeciale/TestData.xlsx'),'Ark1')
bp = sns.boxplot(y='throw angle', x='incident angle',
data=df,
palette="colorblind",
hue='Bat type')
bp.set_title('Rubber Comparison',fontsize=15,fontweight='bold', y=1.06)
bp.set_ylabel('Throw Angle [degrees]',fontsize=11.5)
bp.set_xlabel('Incident Angle [degrees]',fontsize=11.5)
Where my dataframe, df, is
Bat type incident angle throw angle
0 euro 15 28.2
1 euro 15 27.5
2 euro 15 26.2
3 euro 15 27.7
4 euro 15 26.4
5 euro 15 29.0
6 euro 30 12.5
7 euro 30 14.7
8 euro 30 10.2
9 china 15 29.9
10 china 15 31.1
11 china 15 24.9
12 china 15 27.5
13 china 15 31.2
14 china 15 24.4
15 china 30 9.7
16 china 30 9.1
17 china 30 9.5
I tried with the following code. It needs to be independent of number of x (incident angles), for instance it should do the job for more angles of 45, 60 etc.
m=df.mean(axis=0) #Mean values
st=df.std(axis=0) #Standard deviation values
for i, line in enumerate(bp['medians']):
x, y = line.get_xydata()[1]
text = ' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i])
bp.annotate(text, xy=(x, y))
Can somebody help?

This question brought me here since I was also looking for a similar solution with seaborn.
After some trial and error, you just have to change the for loop to:
for i in range(len(m)):
bp.annotate(
' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i]),
xy=(i, m[i]),
horizontalalignment='center'
)
This change worked for me (although I just wanted to print the actual median values). You can also add changes like the fontsize, color or style (i.e., weight) just by adding them as arguments in annotate.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble rearranging columns with same name in pandas dataframe in Python - python

Related

Matching a data frame row (pandas) to a separate data frame row and performing a calculation if the first column matches

How to sum all rows from multiple columns

How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?

How do I calculate an average of a range from a series within in a dataframe?

Display mean and deviation values on grouped boxplot in Python

Categories

Resources