Pandas read multiindexed csv with blanks

Pandas read multiindexed csv with blanks - python

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0

Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0

There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.

you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)

Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)

I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]

Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.

Related

Removing columns selectively from multilevel index dataframe

Say we have a dataframe like this and want to remove columns when certain conditions met.
df = pd.DataFrame(
np.arange(2, 14).reshape(-1, 4),
index=list('ABC'),
columns=pd.MultiIndex.from_arrays([
['data1', 'data2','data1','data2'],
['F', 'K','R','X'],
['C', 'D','E','E']
], names=['meter', 'Sleeper','sweeper'])
)
df
then lets say we want to remove cols only when meter == data1 and sweeper == E
so I tried
df = df.drop(('data1','E'),axis = 1)
KeyError: 'E'
second try
df.drop(('data1','E'), axis = 1, level = 2)
KeyError: "labels [('data1', 'E')] not found in level"
Pandas: drop a level from a multi-level column index?

Seems drop doesn't support selection over split levels ([0,2] here). We can create a mask with the conditions instead using get_level_values:
# keep where not ((level0 is 'data1') and (level2 is 'E'))
col_mask = ~((df.columns.get_level_values(0) == 'data1')
& (df.columns.get_level_values(2) == 'E'))
df = df.loc[:, col_mask]
We can also do this by integer location by excluding the locs that are in a particular index slice, however, this is overall less clear and less flexible:
idx = pd.IndexSlice['data1', :, 'E']
cols = [i for i in range(len(df.columns))
if i not in df.columns.get_locs(idx)]
df = df.iloc[:, cols]
Either approach produces df:
meter data1 data2
Sleeper F K X
sweeper C D E
A 2 3 5
B 6 7 9
C 10 11 13

You have to do them individually, since they are on different levels:
df.drop('data1', axis=1, level='meter').drop('E', axis = 1, level='sweeper')
Out[833]:
meter data2
Sleeper K
sweeper D
A 3
B 7
C 11

Replace column names with quotations with no quotations

I am trying to replace my column names that have quotations and simply remove the quotations but when I try this:
for x in df.columns:
x = x.replace('"', '')
print(x)
Nothing happens and the quotations are still there.

I would do something like this
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
CODE
import pandas as pd
df=pd.DataFrame({"a":[1,2],'"b"':[3,4]})
print('BEFORE')
print(df)
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
print('AFTER')
print(df)
OUTPUT
BEFORE
a "b"
0 1 3
1 2 4
AFTER
a b
0 1 3
1 2 4

you can remove it by writing the following code:
col=[]
for x in df.columns:
x = x.replace('"', '')
col.append(x)
df.columns=col
To know more about column renaming: Check this Renaming columns in pandas

One canonical solution to this problem is using pandas str.replace on the header directly (this is "vectorized"):
df = pd.DataFrame({"a": [1, 2], '"b"': [3, 4]})
df.columns = df.columns.str.replace('"', '')
df
a b
0 1 3
1 2 4

Run a function that requires multiple arguments through multiple columns - Pandas

Hi I currently have a function that is able to split values in a same cell that is delimited by a new line. However the function below only accepts me to pass through one column at a time was thinking if there is any other ways that I can pass it through multiple columns or in fact the whole dataframe.
A sample would be like this
A B C
1\n2\n3 2\n\5 A
The code is below
def tidy_split(df, column, sep='|', keep=False):
indexes = list()
new_values = list()
df = df.dropna(subset=[column])
for i, presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
if keep and len(values) > 1:
indexes.append(i)
new_values.append(presplit)
for value in values:
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes, :].copy()
new_df[column] = new_values
return new_df
It currently works when I run
df1 = tidy_split(df, 'A', '\n')
After running the function of selecting only column A
A B C
1 2\n5 A
2 2\n5 A
3 2\n5 A
I was hoping to be able to pass in more than just an accepted argument and in this case splitting column 'B' as well. Previously I have attempted passing in lambda or attempted using apply but it requires a positional argument which is 'column'. Would appreciate any help given! Was thinking if a loop is possible
EDIT: Desired output as each number refer to something important
Before
A B C
1\n2\n3 2\n5 A
After
A B C
1 2 A
2 5 A
3 n/a A

Input:
A B C
0 1\n2\n3 2\n5 A
Code:
import pandas as pd
cols = df.columns.tolist()
# create list in each cell by detecting '\n'
for col in cols:
df[col] = df[col].apply(lambda x: str(x).split("\n"))
# empty dataframe to store result
dfs = pd.DataFrame()
# loop over rows to construct small dataframes
# and then accumulate each to the resulting dataframe
for ind, row in df.iterrows():
a_vals = row['A']
b_vals = row['B'] + ["n/a"] * (len(a_vals) - len(row['B']))
c_vals = row['C'] + [row['C'][0]] * (len(a_vals) - len(row['C']))
temp = pd.DataFrame({'A': a_vals, 'B': b_vals, 'C': c_vals})
dfs = pd.concat([dfs, temp], axis=0, ignore_index=True)
Output:
A B C
0 1 2 A
1 2 5 A
2 3 n/a A

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.

You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]

cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want

For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )

You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4

similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))

This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]

Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)

An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]

You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12

You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/

You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])

This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.

I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

Appending a list or series to a pandas DataFrame as a row?

So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?

df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]

Sometimes it's easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.
>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
col1 col2
0 a b
1 e f

Here's a simple and dumb solution:
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)

Could you do something like this?
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
Does anyone have a more elegant solution?

Following onto Mike Chirico's answer... if you want to append a list after the dataframe is already populated...
>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
2 f g

There are several ways to append a list to a Pandas Dataframe in Python. Let's consider the following dataframe and list:
import pandas as pd
# Dataframe
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["col1", "col2"])
# List to append
list = [5, 6]
Option 1: append the list at the end of the dataframe with pandas.DataFrame.loc.
df.loc[len(df)] = list
Option 2: convert the list to dataframe and append with pandas.DataFrame.append().
df = df.append(pd.DataFrame([list], columns=df.columns), ignore_index=True)
Option 3: convert the list to series and append with pandas.DataFrame.append().
df = df.append(pd.Series(list, index = df.columns), ignore_index=True)
Each of the above options should output something like:
>>> print (df)
col1 col2
0 1 2
1 3 4
2 5 6
Reference : How to append a list as a row to a Pandas DataFrame in Python?

Converting the list to a data frame within the append function works, also when applied in a loop
import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))

Here's a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you're adding then it shouldn't be an issue.
import pandas as pd
import numpy as np
def addRow(df,ls):
"""
Given a dataframe and a list, append the list as a new row to the dataframe.
:param df: <DataFrame> The original dataframe
:param ls: <list> The new row to be added
:return: <DataFrame> The dataframe with the newly appended row
"""
numEl = len(ls)
newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))
df = df.append(newRow, ignore_index=True)
return df

If you want to add a Series and use the Series' index as columns of the DataFrame, you only need to append the Series between brackets:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: row=pd.Series([1,2,3],["A","B","C"])
In [4]: row
Out[4]:
A 1
B 2
C 3
dtype: int64
In [5]: df.append([row],ignore_index=True)
Out[5]:
A B C
0 1 2 3
[1 rows x 3 columns]
Whitout the ignore_index=True you don't get proper index.

simply use loc:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6

As mentioned here - https://kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python, you'll need to first convert the list to a series then append the series to dataframe.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)

Consider an array A of N x 2 dimensions. To add one more row, use the following.
A.loc[A.shape[0]] = [3,4]

The simplest way:
my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values
Edit:
Don't forget that the length of the new list should be the same of the corresponding Dataframe.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read multiindexed csv with blanks - python

you can read using : df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True) and then df.columns = pd.MultiIndex.from_tuples(df.columns)

I used a technique to flatten from the multi-index columns and make one column. It works well for me. your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]

Related

Removing columns selectively from multilevel index dataframe

Replace column names with quotations with no quotations

Run a function that requires multiple arguments through multiple columns - Pandas

move column in pandas dataframe

Appending a list or series to a pandas DataFrame as a row?

Categories

Resources