Converting columns names from a list - python

I am reading multiple csv files into a pandas data frame as a list before concatenating them together. All the files from the first have different column names, but I wanted to convert those names to have the same as the first file, so that I can combine them by rows relative to the same column names.
I can call them as a list like:
dfs = (pd.read_csv(f) for f in x)
However, when I concatenate them together the data frame combines both columns together, here's an example data of the outcome:
fs = pd.DataFrame(np.random.randn(5, 3),
index=[1, 2, 3, 4, 5],
columns=['bgif', 'datasetkey', 'occurrenceid'])
ds = pd.DataFrame(np.random.randn(5, 3),
index=[1, 2, 3, 4, 5],
columns=['v1', 'v2', 'v3'])
df_row_merged = pd.concat([fs, ds], ignore_index=True)
So I was wondering how I could change the header of the files to have the same as the first as I presume this could bind them together?

Use np.concatenate to keep only values.
IIUC, something like that should work:
dfs = [fs, ds]
df_row_merged = pd.DataFrame(np.concatenate(dfs), columns=dfs[0].columns)
>>> df_row_merged
bgif datasetkey occurrenceid
0 -0.414690 0.842747 -1.653554
1 0.556024 0.577895 0.852845
2 -0.151411 0.558659 -1.219965
3 -0.702385 -0.895022 -1.123310
4 0.356573 2.121478 0.321810
5 3.349352 -0.746372 -0.849632
6 1.142182 0.175079 0.179597
7 -0.755518 0.365921 -0.212967
8 -1.559804 -0.024858 -0.233414
9 -0.602356 1.521461 0.747047

Related

merge all columns in data frame to a list

I have a csv file with no headers and want to take all the columns in the data frame and want to "append" all the columns into a list.
for example I have df:
1 2 7
3 4 8
5 6 9
I have a pseudocode on what I want to perform
import pandas as pd
df = pd.read_csv('file.csv',header=None)
data = []
for i in range(number of columns) #in this case we have 3
data.append(df[i])
#data = [1,2,3,4,5,6,7,8,9]
UUIC, you can try flatten then sort
import numpy as np
np.sort(df.values.flatten())
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

How to name Pandas Dataframe Columns automatically?

I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1

Average two identically formatted Data Frames in Panda's

I have two pandas dataframes that are loaded from CSV files. Each has two columns, column A is an id and is the same value and order in both CSVs. Column B is a numerical value.
I need to create a new CSV with column A identical to the first two and with column B, the average of the two initial CSVs.
I am creating two dataframes like
df1=pd.read_csv(path).set_index('A')
df2=pd.read_csv(otherPath).set_index('A')
If I do
newDf = (df1['B'] + df2['B'])/2
newDf.to_csv(...)
then the newDF has the ids in the wrong order in column A
If i do
df1['B'] = (df1['B'] + df2['B'])/2
df1.to_csv(...)
I get an error on the first line saying "Value Error: cannot reindex from a duplicate axis"
It seems like this should be trivial, what am I doing wrong?
Try using merge instead of setting an index.
I.e. We have these dataframes:
df1 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [3, 4, 5, 6, 7]})
df2 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [7, 4, 3, 10, 23]})
Then we merge them and create a new column with the mean of both B columns.
together = df1.merge(df2, on='A')
together.loc[:, "mean"] = (together['B_x']+ together['B_y']) / 2
together = together[['A', 'mean']]
And the together is:
A mean
0 1 5.0
1 2 4.0
2 3 4.0
3 4 8.0
4 5 15.0

Elegant way to assign header names to pandas dataframe when assigning new columns in for loop?

I have a for loop that iteratively adds columns to a pandas dataframe. I wish to also name these new columns based on a list. I have a convoluted way now, is there a more elegant way to do this?
When assigning a new column, you have to specify a column name. However this cannot be variable for some reason. So I use a dummy and after change the column name based on a list I defined prior. This doesn't seem too elegant though.
The dataframe columns should be [wavelength, layers[0]_n, layers[0]_k, ... layers[z]_n, layers[z]_k]
layers = ['Ag', 'SiO2', 'Au']
colnames = ['wavelength']
for l in layers:
colnames.append(l+'_n')
colnames.append(l+'_k')
n = pd.read_csv('matdata\\' + layers[0] + '.csv')
n = n.iloc[:,0] #get only wavelength
for l in layers:
data = pd.read_csv('matdata\\' + l + '.csv') #read appropriate file
n = n.assign(a = data.iloc[:,1].values)
n = n.assign(b = data.iloc[:,2].values)
n.columns = colnames
Because I don't have access to your CSVs, etc, I am creating some fake data to simulate this process...
Let's start with several DataFrames:
n = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
columns=['x', 'y', 'z'])
dfb = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
layers = ['Ag', 'SiO2']
for layer in layers:
n[layer] = dfb.iloc[:, 1].values
Yields:
x y z Ag SiO2
0 1 2 3 2 2
1 4 5 6 5 5
2 7 8 9 8 8
Using this technique, rather than using .assign() allows for the use of a variable name to create the column header as each column is created.

How to delete the randomly sampled rows of a dataframe, to avoid sampling them again?

I have dataframe (df) of 12 rows x 5 columns. I sample 1 row from each label and create a new dataframe (df1) of 3 rows x 5 columns. I need that the next time I sample more rows from df I will not choose the same ones that are already in df1. So how can I delete the already sampled rows from df?
import pandas as pd
import numpy as np
# 12x5
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
#3x5
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
#My attempt. It should be a 9x5 dataframe
df2 = pd.concat(f.drop(idx) for idx, f in df1.groupby('label'))
df
df1
df2
Starting with this DataFrame:
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
Your first sample is this:
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
For the second sample, you can drop df1's indices from df:
pd.concat(g.sample(1) for idx, g in df.drop(df1.index).groupby('label'))
Out:
0 1 2 3 4 label
2 0.188005 0.765640 0.549734 0.712261 0.334071 1
4 0.599812 0.713593 0.366226 0.374616 0.952237 2
8 0.631922 0.585104 0.184801 0.147213 0.804537 3
This is not an inplace operation. It doesn't modify the original DataFrame. It just drops the rows, returns a copy, and samples from that copy. If you want it to be permanent, you can do:
df2 = df.drop(df1.index)
And sample from df2 afterwards.

Categories