Convert row to column header for Pandas DataFrame, - python

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header

In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).

This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])

It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)

To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)

You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85

Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Related

How to split multiple dictionaries in row into new rows using Pandas

I have the following dataframe with multiple dictionaries in a list in the Rules column.
SetID SetName Rules
0 Standard_1 [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]
1 Standard_2 [{'RulesID': '12', 'RuleName': 'name_arg'}]
The desired output is:
SetID SetName RulesID RuleName
0 Standard_1 10 name_abc
0 Standard_1 11 name_xyz
1 Standard_2 12 name_arg
It might be possible that there are more than two dictionaries inside of the list.
I am thinking about a pop, explode or pivot function to build the dataframe but I have no clue how to start.
Each advice will be very appreciated!
EDIT: To build the dataframe you can use the follwing dataframe constructor:
# initialize list of lists
data = [[0, 'Standard_1', [{'RulesID': '10', 'RuleName': 'name_abc'}, {'RulesID': '11', 'RuleName': 'name_xyz'}]], [1, 'Standard_2', [{'RulesID': '12', 'RuleName': 'name_arg'}]]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['SetID', 'SetName', 'Rules'])
You can use explode:
tmp = df.explode('Rules').reset_index(drop=True)
df = pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1).drop('Rules', axis=1)
Output:
>>> df
SetID SetName RulesID RuleName
0 0 Standard_1 10 name_abc
1 0 Standard_1 11 name_xyz
2 1 Standard_2 12 name_arg
One-liner version of the above:
df.explode('Rules').reset_index(drop=True).pipe(lambda x: pd.concat([tmp, pd.json_normalize(tmp['Rules'])], axis=1)).drop('Rules', axis=1)

How to split multiples values that are in the same column in a safe way,

As part of a program that reads pandas data frame. One of these columns contains many values separate by : in the same column. To know what these values means, there is another column that says what each value is.
I want to split these values and put them in new columns the problem is that not all input in my programs receive exactly the same type of data frame and the order or new values can appear.
With an example is easier to explain:
df1
Column1 Column2
GT:AV:AD 0.1:123:23
GT:AV:AD 0.2:456:24
df2
Column1 Column2
GT:AD:AV 0.4:23:123
GT:AD:AV 0.5:12:323
Before being awera of this issue what I did to split this data and put them in new columns was something like this:
file_data["GT"] = file_data[name_sample].apply(lambda x: x.split(":")[1])
file_data["AD"] = file_data[name_sample].apply(lambda x: x.split(":")[2])
If what I want is GT and AD (if there are in the input data frame) how can I do this in a more secure way?
import pandas as pd
df = pd.DataFrame({"col1":["GT:AV:AD","GT:AD:AV"],"col2":["0.1:123:23","0.4:23:123"]})
df["keyvalue"] = df.apply(lambda x:dict(zip(x.col1.split(":"),x.col2.split(":"))), axis=1)
print(df)
output
col1 col2 keyvalue
0 GT:AV:AD 0.1:123:23 {'GT': '0.1', 'AV': '123', 'AD': '23'}
1 GT:AD:AV 0.4:23:123 {'GT': '0.4', 'AD': '23', 'AV': '123'}
Explanation: I create column keyvalue holding keys (from col1) and values (from col2), using dict(zip(keys_list, values_list)) construct, as dicts. apply with axis=1 apply function to each row, lambda is used in python for creating nameless function. If you wish to have rather pandas.DataFrame than column with dicts, you might do
df2 = df.apply(lambda x:dict(zip(x.col1.split(":"),x.col2.split(":"))), axis=1).apply(pd.Series)
print(df2)
output
GT AV AD
0 0.1 123 23
1 0.4 123 23
have a look at this answer:
keys = ['a', 'b', 'c']
values = [1, 2, 3]
dictionary = dict(zip(keys, values))
print(dictionary) # {'a': 1, 'b': 2, 'c': 3}
you need to split your column 1 to array (keys) and column 2 to values.
this way you will have dictionary["GT"] etc.

Pandas / Python remove duplicates based on specific row values

I am trying to remove duplicated based on multiple criteria:
Find duplicated in column df['A']
Check column df['status'] and prioritize OK vs Open and Open vs Close
if we have a duplicate with same status pick the lates one based on df['Col_1]
df = pd.DataFrame({'A' : ['11', '11', '12', np.nan, '13', '13', '14', '14', '15'],
'Status' : ['OK','Close','Close','OK','OK','Open','Open','Open',np.nan],
'Col_1' :[2000, 2001, 2000, 2000, 2000, 2002, 2000, 2004, 2000]})
df
Expected output:
I have tried differente solutions like the links below (map or loc) but I am unable to find the correct way:
Pandas : remove SOME duplicate values based on conditions
Create ordered categorical for prioritize Status, then sorting per all columns, remove duplicates by first column A and last sorting index:
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = df.sort_values(['A','Status','Col_1']).drop_duplicates('A').sort_index()
print (df)
A Status Col_1
0 11 OK 2000
2 12 Close 2000
3 NaN OK 2000
4 13 OK 2000
6 14 Open 2000
8 15 NaN 2000
EDIT If need avoid NaNs are removed add helper column:
df['test'] = df['A'].isna().cumsum()
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = (df.sort_values(['A','Status','Col_1', 'test'])
.drop_duplicates(['A', 'test'])
.sort_index())

PYTHON BEGUINNER : How to create pandas dataframe from a list of python dictionaries?

I am looking for a way of creating a pandas DataFrame and then add it in an excel file using pandas from a list of dictionary.
The first dictionary has 3 values (integer) and the second one has one value which correspond to a set of words. The key for the two dictionaries are the same but to be sure there is not error in the excel file I prefer to have them in the DataFrame.
d1 = {'1': ['45', '89', '96'], '2': ['78956', '50000', '100000'], '3': ['0', '809', '656']}
d2 = {'1': ['connaître', 'rien', 'trouver', 'être', 'emmerder', 'rien', 'suffire', 'mettre', 'multiprise'], '2': ['trouver', 'être', 'emmerder'], '3' : ['con', 'ri', 'trou', 'êt', 'emmer',]}
I am getting error at each tentative and i am really block and I need a solution
df = pd.read_csv(sys.argv[1], na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
df1 = pd.DataFrame(d1).T.reset_index()
df1['value1_d2'] = ''
# iterate over the dict and add the lists of words in the new column
for k,v in d2.items():
df1.at[int(k) - 1, 'value1_d2'] = v
#print(df1)
df1.columns = ['id','value_1_Dict1','value_2_Dict1','value_3_Dict1',' value_2_Dict2']
cols = df1.columns.tolist()
cols = cols[-1:] + cols[:-1]
df1 = df1[cols]
print(df1)
df = pd.concat([df, df1], axis = 1)
df.to_excel('exit.xlsx')
I do not have an error but the filling of the dataframe start after the real columns like in the example and I have more then 2000 lines
Expected output: I add it in an existing file :
score freq **value1_d2 id value1 value2 value3 **
0 0.5 2 **['connaître', 'rien', 'trouver'] 1 45 89 96 **
1 0.8 5 ** ['trouver', 'être', 'emmerder'] 2 78956 5000 100000 **
2 0.1 5 **['con', 'ri', 'trou', 'êt', 'emmer',] 3 0 809 65 **
When trying to add to excel file I have the following error, I want to start writing from the first column so that the key will be the same.
Is there a way to solve it using pandas (I have to use pandas for this seminar.
Thank you.
This way you can add the lists of words in a cell:
df1 = pd.DataFrame(d1)
# the new column needs to have dtype object
df1['value1_d2'] = ''
# iterate over the dict and add the lists of words in the new column
for k,v in d2.items():
df1.at[int(k) - 1, 'value1_d2'] = v
I used the info in this post as well.
When reading dictionary into a dataframe you can use :
>>> d1 = {'1': ['45', '89', '96'], '2': ['78956', '50000', '100000'], '3': ['0', '809', '656']}
>>> df1 = pd.DataFrame.from_dict(d1)

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.
You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]
cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want
For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )
You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4
similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))
This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]
Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)
An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]
You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])
This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.
I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

Categories