Average two identically formatted Data Frames in Panda's - python

I have two pandas dataframes that are loaded from CSV files. Each has two columns, column A is an id and is the same value and order in both CSVs. Column B is a numerical value.
I need to create a new CSV with column A identical to the first two and with column B, the average of the two initial CSVs.
I am creating two dataframes like
df1=pd.read_csv(path).set_index('A')
df2=pd.read_csv(otherPath).set_index('A')
If I do
newDf = (df1['B'] + df2['B'])/2
newDf.to_csv(...)
then the newDF has the ids in the wrong order in column A
If i do
df1['B'] = (df1['B'] + df2['B'])/2
df1.to_csv(...)
I get an error on the first line saying "Value Error: cannot reindex from a duplicate axis"
It seems like this should be trivial, what am I doing wrong?

Try using merge instead of setting an index.
I.e. We have these dataframes:
df1 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [3, 4, 5, 6, 7]})
df2 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [7, 4, 3, 10, 23]})
Then we merge them and create a new column with the mean of both B columns.
together = df1.merge(df2, on='A')
together.loc[:, "mean"] = (together['B_x']+ together['B_y']) / 2
together = together[['A', 'mean']]
And the together is:
A mean
0 1 5.0
1 2 4.0
2 3 4.0
3 4 8.0
4 5 15.0

Related

Match column to another column containing array

I have very junior question in python - i have a dataframe with a column containing some IDs and separate dataframe that contains 2 columns, out of which 1 is an array:
df1 = pd.DataFrame({"some_id": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame([["A", [1, 2]], ["B", [3, 4]], ["C", [5]]], columns=['letter', 'some_ids'])
I want to add do df1 new column "letter' that for a given "some_id" will look up df2, check if this id is in df2['some_ids'] and return df2['letter']
I tried this:
df1['letter'] = df2[df1[some_id].isin(df2['some_ids')].letter
and get NaNs - any suggestion where I make mistake?
Create dictionary with flatten nested lists in dict comprehension and then use Series.map:
d = {x: a for a,b in zip(df2['letter'], df2['some_ids']) for x in b}
df1['letter'] = df1['some_id'].map(d)
Or mapping by Series created by DataFrame.explode with DataFrame.set_index:
df1['letter'] = df1['some_id'].map(df2.explode('some_ids').set_index('some_ids')['letter'])
Or use left join with rename column:
df1 = df1.merge(df2.explode('some_ids').rename(columns={'some_ids':'some_id'}), how='left')
print (df1)
some_id letter
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C

Converting columns names from a list

I am reading multiple csv files into a pandas data frame as a list before concatenating them together. All the files from the first have different column names, but I wanted to convert those names to have the same as the first file, so that I can combine them by rows relative to the same column names.
I can call them as a list like:
dfs = (pd.read_csv(f) for f in x)
However, when I concatenate them together the data frame combines both columns together, here's an example data of the outcome:
fs = pd.DataFrame(np.random.randn(5, 3),
index=[1, 2, 3, 4, 5],
columns=['bgif', 'datasetkey', 'occurrenceid'])
ds = pd.DataFrame(np.random.randn(5, 3),
index=[1, 2, 3, 4, 5],
columns=['v1', 'v2', 'v3'])
df_row_merged = pd.concat([fs, ds], ignore_index=True)
So I was wondering how I could change the header of the files to have the same as the first as I presume this could bind them together?
Use np.concatenate to keep only values.
IIUC, something like that should work:
dfs = [fs, ds]
df_row_merged = pd.DataFrame(np.concatenate(dfs), columns=dfs[0].columns)
>>> df_row_merged
bgif datasetkey occurrenceid
0 -0.414690 0.842747 -1.653554
1 0.556024 0.577895 0.852845
2 -0.151411 0.558659 -1.219965
3 -0.702385 -0.895022 -1.123310
4 0.356573 2.121478 0.321810
5 3.349352 -0.746372 -0.849632
6 1.142182 0.175079 0.179597
7 -0.755518 0.365921 -0.212967
8 -1.559804 -0.024858 -0.233414
9 -0.602356 1.521461 0.747047

Filter for rows in pandas dataframe where values in a column are greater than x or NaN

I'm trying to figure out how to filter a pandas dataframe so that that the values in a certain column are either greater than a certain value, or are NaN. Lets say my dataframe looks like this:
df = pd.DataFrame({"col1":[1, 2, 3, 4], "col2": [4, 5, np.nan, 7]})
I've tried:
df = df[df["col2"] >= 5 | df["col2"] == np.nan]
and:
df = df[df["col2"] >= 5 | np.isnan(df["col2"])]
But the first causes an error, and the second excludes rows where the value is NaN. How can I get the result to be this:
pd.DataFrame({"col1":[2, 3, 4], "col2":[5, np.nan, 7]})
Please Try
df[df.col2.isna()|df.col2.gt(4)]
col1 col2
1 2 5.0
2 3 NaN
3 4 7.0
Also, you can fill nan with the threshold:
df[df.fillna(5)>=5]

Filling a column with values from another dataframe

I want to fill the column of the df2 (~100.000 rows) with the values from the same column of df (~1.000.000 rows). Df often has several times the same row but with wrong data, so I always want to take the first value of my column 'C'.
df = pd.DataFrame([[100, 1, 2], [100, 3, 4], [100, 5, 6], [101, 7, 8], [101, 9, 10]],
columns=['A', 'B', 'C'])
df2=pd.DataFrame([[100,0],[101,0]], columns=['A', 'C'])
for i in range(0,len(df2.index)):
#My Question:
df2[i,'C']=first value of 'C' column of df where the 'A' column is the same of both dataframes. E.g. the first value for 100 would be 2 and then the first value for 101 would be 8
In the end, my output should be a table like this:
df2=pd.DataFrame([[100,2],[101,8]], columns=['A', 'C'])
You can try this:
df2['C'] = df.groupby('A')['C'].first().values
Which will give you:
A C
0 100 2
1 101 8
first() returns the first value of every group.
Then you want to assign the values to df2 column, unfortunately, you cannot assign the result directly like this:
df2['C'] = df.groupby('A')['C'].first() .
Because the above line will result in :
A C
0 100 NaN
1 101 NaN
(You can read about the cause here: Adding new column to pandas DataFrame results in NaN)

How to name Pandas Dataframe Columns automatically?

I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1

Categories