Get the column names of a python numpy array - python

I have a csv data file with a header indicating the column names.
xy wz hi kq
0 10 5 6
1 2 4 7
2 5 2 6
I run:
X = np.array(pd.read_csv('gbk_X_1.csv').values)
I want to get the column names:
['xy', 'wz', 'hi', 'kg']
I read this post but the solution provides me with None.

Use the following code:
import re
f = open('f.csv','r')
alllines = f.readlines()
columns = re.sub(' +',' ',alllines[0]) #delete extra space in one line
columns = columns.strip().split(',') #split using space
print(columns)
Assume CSV file is like this:
xy wz hi kq
0 10 5 6
1 2 4 7
2 5 2 6

Let's assume your csv file looks like
xy,wz,hi,kq
0,10,5,6
1,2,4,7
2,5,2,6
Then use pd.read_csv to dump the file into a dataframe
df = pd.read_csv('gbk_X_1.csv')
The dataframe now looks like
df
xy wz hi kq
0 0 10 5 6
1 1 2 4 7
2 2 5 2 6
It's three main components are the
data which you can access via the values attribute
df.values
array([[ 0, 10, 5, 6],
[ 1, 2, 4, 7],
[ 2, 5, 2, 6]])
index which you can access via the index attribute
df.index
RangeIndex(start=0, stop=3, step=1)
columns which you can access via the columns attribute
df.columns
Index(['xy', 'wz', 'hi', 'kq'], dtype='object')
If you want the columns as a list, use the to_list method
df.columns.tolist()
['xy', 'wz', 'hi', 'kq']

Related

Insert Row in Dataframe at certain place

I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.

Data Preprocessing in Python using Pandas

I am trying to preprocess one of my columns in my Data frame. The issue is that I have [[ content1] , [content2], [content3]] in the relations column. I want to remove the Brackets
i have tried this following:
df['value'] = df['value'].str[0]
the output that i get is
[content 1]
df
print df
id value
1 [[str1],[str2],[str3]]
2 [[str4],[str5]]
3 [[str1]]
4 [[str8]]
5 [[str9]]
6 [[str4]]
the expected output should be like
id value
1 str1,str2,str3
2 str4,str5
3 str1
4 str8
5 str9
6 str4
It looks like you have lists of lists. You can try to unnest and join:
df['value'] = df['value'].apply(lambda x: ','.join([e for l in x for e in l]))
Or:
from itertools import chain
df['value'] = df['value'].apply(lambda x: ','.join(chain.from_iterable(x)))
NB. If you get an error, please provide it and the type of the column (df.dtypes)
As I could see, your data and sampling the same:
Sample Data:
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':['[[str1],[str2],[str3]]', '[[str4],[str5]]', '[[str1]]', '[[str8]]', '[[str9]]', '[[str4]]']})
print(df)
id value
0 1 [[str1],[str2],[str3]]
1 2 [[str4],[str5]]
2 3 [[str1]]
3 4 [[str8]]
4 5 [[str9]]
5 6 [[str4]]
Result:
df['value'] = df['value'].str.replace('[', '').astype(str).str.replace(']', '')
print(df)
id value
0 1 str1,str2,str3
1 2 str4,str5
2 3 str1
3 4 str8
4 5 str9
5 6 str4
Note: as the error code says AttributeError: Can only use .str accessor with string values which means it's not treating it as str hence you may cast it to str by astype(str) and then do the replace operation.
You can use useful regex python package re.
This is the solution.
import pandas as pd
import re
make the test data
data = [
[1, '[[str1],[str2],[str3]]'],
[2, '[[str4],[str5]]'],
[3, '[[str1]]'],
[4, '[[str8]]'],
[5, '[[str9]]'],
[6, '[[str4]]']
]
conver data to Dataframe
df = pd.DataFrame(data, columns = ['id', 'value'])
print(df)
remove '[', ']' from the 'value' column
df['value']=df.apply(lambda x: re.sub("[\[\]]", "", x['value']),axis=1)
print(df)

Delete empty dataframes from a list with dataframes

This is a list of dataframes.
import pandas as pd
data=[pd.DataFrame([1,2,3],columns=['a']),pd.DataFrame([]),pd.DataFrame([]),
pd.DataFrame([3,4,5,6,7],columns=['a'])]
I am trying to delete the empty dataframes from the above list that contains dataframes.
Here is what I have tried:
for i in data:
del i.empty()
data
which gives:
File "<ipython-input-33-d07b32efe793>", line 2
del i.empty()
^ SyntaxError: cannot delete function call
Important:It needs to store them in the data variable as well
try this:
import pandas as pd
data = [pd.DataFrame([1, 2, 3], columns=['a']), pd.DataFrame([]),
pd.DataFrame([]),
pd.DataFrame([3, 4, 5, 6, 7], columns=['a'])]
for i in range(len(data)-1, 0, -1):
if data[i].empty:
del data[i]
print(data)
The problem with your code is that df.empty returns True or False, While what you want to do is delete the item i if i.empty() returned True.
Please noted that in the range we use a reversed range in order to avoid getting list item out of range error.
We ca use filter
data = list(filter(lambda df: not df.empty, data))
or list comprehension
data = [df for df in data if not df.empty]
print(data)
[ a
0 1
1 2
2 3, a
0 3
1 4
2 5
3 6
4 7]
You can do this:
[i for i in data if len(i)>0]
Output:
[ a
0 1
1 2
2 3, a
0 3
1 4
2 5
3 6
4 7]

How to name Pandas Dataframe Columns automatically?

I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1

Match rows between dataframes and preserve order

I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.
Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN
Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']
Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!

Categories