How to name Pandas Dataframe Columns automatically?

How to name Pandas Dataframe Columns automatically? - python

I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?

df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)

One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1

Related

Insert Row in Dataframe at certain place

I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.

import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.

How can I find and store how many columns it takes to reach a value greater than the first value in each row?

import pandas as pd
df = {'a': [3,4,5], 'b': [1,2,3], 'c': [4,3,3], 'd': [1,5,4], 'e': [9,4,6]}
df1 = pd.DataFrame(df, columns = ['a', 'b', 'c', 'd', 'e'])
dg = {'b': [2,3,4]}
df2 = pd.DataFrame(dg, columns = ['b'])
Original dataframe is df1. For each row, I want to find the first time a value is bigger than the value in the first column and store it in a new dataframe.
df1
a b c d e
0 3 1 4 1 9
1 4 2 3 5 4
2 5 3 3 4 6
df2 is the resulting dataframe. For example, for df1 row 1; the first value is 3 and the first value bigger than 3 is 4 (column c). Hence in df2 row 1, we store 2 (there are two columns from column a to c). For df1 row 2, the first value is 4 and the first value bigger than 4 is 5 (column d). Hence in df2 row 2, we store 3 (there are three columns from column a to d). For df1 row 3, the first value is 5 and the first value bigger than 5 is 6 (column e). Hence in df2 row 3, we store 4 (there are four columns from column a to e).
df2
b
0 2
1 3
2 4
I would appreciate the help.

In your case we can do sub , if the value gt than 0 , we get the id with idxmax
s=df1.columns.get_indexer(df1.drop('a',1).sub(df1.a,0).ge(0).idxmax(1))
array([1, 1, 3])
df['New']=s

You can get the column names by comparing the entire DataFrame index wise against the first columns, replacing false values with NaNs and applying first_valid_index row wise, eg:
names = (
df1.gt(df1.iloc[:, 0], axis=0)
.replace(False, pd.NA) # or use np.nan
.apply(pd.Series.first_valid_index, axis=1)
)
That'll give you:
0 c
1 d
2 e
Then you can convert those to offsets:
offsets = df1.columns.get_indexer(names)
# array([2, 3, 4])

Average two identically formatted Data Frames in Panda's

I have two pandas dataframes that are loaded from CSV files. Each has two columns, column A is an id and is the same value and order in both CSVs. Column B is a numerical value.
I need to create a new CSV with column A identical to the first two and with column B, the average of the two initial CSVs.
I am creating two dataframes like
df1=pd.read_csv(path).set_index('A')
df2=pd.read_csv(otherPath).set_index('A')
If I do
newDf = (df1['B'] + df2['B'])/2
newDf.to_csv(...)
then the newDF has the ids in the wrong order in column A
If i do
df1['B'] = (df1['B'] + df2['B'])/2
df1.to_csv(...)
I get an error on the first line saying "Value Error: cannot reindex from a duplicate axis"
It seems like this should be trivial, what am I doing wrong?

Try using merge instead of setting an index.
I.e. We have these dataframes:
df1 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [3, 4, 5, 6, 7]})
df2 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [7, 4, 3, 10, 23]})
Then we merge them and create a new column with the mean of both B columns.
together = df1.merge(df2, on='A')
together.loc[:, "mean"] = (together['B_x']+ together['B_y']) / 2
together = together[['A', 'mean']]
And the together is:
A mean
0 1 5.0
1 2 4.0
2 3 4.0
3 4 8.0
4 5 15.0

How to delete the randomly sampled rows of a dataframe, to avoid sampling them again?

I have dataframe (df) of 12 rows x 5 columns. I sample 1 row from each label and create a new dataframe (df1) of 3 rows x 5 columns. I need that the next time I sample more rows from df I will not choose the same ones that are already in df1. So how can I delete the already sampled rows from df?
import pandas as pd
import numpy as np
# 12x5
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
#3x5
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
#My attempt. It should be a 9x5 dataframe
df2 = pd.concat(f.drop(idx) for idx, f in df1.groupby('label'))
df
df1
df2

Starting with this DataFrame:
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
Your first sample is this:
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
For the second sample, you can drop df1's indices from df:
pd.concat(g.sample(1) for idx, g in df.drop(df1.index).groupby('label'))
Out:
0 1 2 3 4 label
2 0.188005 0.765640 0.549734 0.712261 0.334071 1
4 0.599812 0.713593 0.366226 0.374616 0.952237 2
8 0.631922 0.585104 0.184801 0.147213 0.804537 3
This is not an inplace operation. It doesn't modify the original DataFrame. It just drops the rows, returns a copy, and samples from that copy. If you want it to be permanent, you can do:
df2 = df.drop(df1.index)
And sample from df2 afterwards.

Nested List to Pandas Dataframe with headers

Basically I am trying to do the opposite of How to generate a list from a pandas DataFrame with the column name and column values?
To borrow that example, I want to go from the form:
data = [['Name','Rank','Complete'],
['one', 1, 1],
['two', 2, 1],
['three', 3, 1],
['four', 4, 1],
['five', 5, 1]]
which should output:
Name Rank Complete
One 1 1
Two 2 1
Three 3 1
Four 4 1
Five 5 1
However when I do something like:
pd.DataFrame(data)
I get a dataframe where the first list should be my colnames, and then the first element of each list should be the rowname
EDIT:
To clarify, I want the first element of each list to be the row name. I am scrapping data so it is formatted this way...

One way to do this would be to take the column names as a separate list and then only give from 1st index for pd.DataFrame -
In [8]: data = [['Name','Rank','Complete'],
...: ['one', 1, 1],
...: ['two', 2, 1],
...: ['three', 3, 1],
...: ['four', 4, 1],
...: ['five', 5, 1]]
In [10]: df = pd.DataFrame(data[1:],columns=data[0])
In [11]: df
Out[11]:
Name Rank Complete
0 one 1 1
1 two 2 1
2 three 3 1
3 four 4 1
4 five 5 1
If you want to set the first column Name column as index, use the .set_index() method and send in the column to use for index. Example -
In [16]: df = pd.DataFrame(data[1:],columns=data[0]).set_index('Name')
In [17]: df
Out[17]:
Rank Complete
Name
one 1 1
two 2 1
three 3 1
four 4 1
five 5 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to name Pandas Dataframe Columns automatically? - python

df.columns=["F"+str(i) for i in range(1, 103)] Note: Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g. len(df.columns) + 1, or df.shape[1] + 1. (Thanks to ALollz for this tip in his comment.)

Related

Insert Row in Dataframe at certain place

How can I find and store how many columns it takes to reach a value greater than the first value in each row?

Average two identically formatted Data Frames in Panda's

How to delete the randomly sampled rows of a dataframe, to avoid sampling them again?

Nested List to Pandas Dataframe with headers

Categories

Resources