Remove column index from dataframe - python

I extracted multiple dataframes from excel sheet by passing cordinates (start & end)
Now i used below funtion to extacr according to cordinates, but when i am trying to
convert it into dataframe, no sure from where index are coming in df as columns
I wanted to remove these index and make 2nd row as columns, this is my dataframe
0 1 2 3 4 5 6
Cols/Rows A A2 B B2 C C2
0 A 50 50 150 150 200 200
1 B 200 200 250 300 300 300
2 C 350 500 400 400 450 450
def extract_dataframes(sheet):
ws = sheet['pivots']
cordinates = [('A1', 'M8'), ('A10', 'Q17'), ('A19', 'M34'), ('A36', 'Q51')]
multi_dfs_list = []
for i in cordinates:
data_rows = []
for row in ws[i[0]:i[1]]:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
multi_dfs_list.append(data_rows)
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
return multi_dfs
I tried to delete index but not working.
Note: when i say
>>> multi_dfs[0].columns # first dataframe
RangeIndex(start=0, stop=13, step=1)

Change
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
for
multi_dfs = {i: pd.DataFrame(df[1:], columns=df[0]) for i, df in enumerate(multi_dfs_list)}
From the Docs,
columns : Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided

I think need:
df = pd.read_excel(file, skiprows=1)

Related

How to split dataframe by specific string in rows

I have a dataframe like this:
df = pd.DataFrame({"a":["x1", 12, 14, "x2", 32, 9]})
df
Out[10]:
a
0 x1
1 12
2 14
3 x2
4 32
5 9
I would like to split it in multiple dataframes (in this case, two) if row begins with "x". And then this row should be the column name. Maybe splitting these dataframes and put inside a dictionary?
The output should be like this:
x1
Out[12]:
x1
0 12
1 14
x2
Out[13]:
x2
0 32
1 9
Anyone could help me?
You can try cumsum on str.startswith then groupby on that:
for k, d in df.groupby(df['a'].str.startswith('x').fillna(0).cumsum()):
# manipulate data to get desired output
sub_df = pd.DataFrame(d.iloc[1:].to_numpy(), columns=d.iloc[0].to_numpy())
# do something with it
print(sub_df)
print('-'*10)
Output:
x1
0 12
1 14
----------
x2
0 32
1 9
----------
Something like this should work:
import pandas as pd
df = pd.DataFrame({"a":["x1", 12, 14, "x2", 32, 9]})
## Get the row index of value starting with x
ixs = []
for j in df.index:
if isinstance(df.loc[j,'a'],str):
if df.loc[j,'a'].startswith('x'):
ixs.append(j)
dicto = {}
for i,val in enumerate(ixs):
start_ix = ixs[i]
if i == len(ixs) - 1:
end_ix = df.index[-1]
else:
end_ix = ixs[i+1] - 1
new_df = df.loc[start_ix:end_ix,'a'].reset_index(drop=True)
new_df.columns = new_df.iloc[0]
new_df.drop(new_df.index[0],inplace=True)
dicto[i] = new_df
A groupby is like a dictionary, so we can explicitly make it one:
dfs = {f'x{k}':d for k, d in df.groupby(df['a'].str.startswith('x').fillna(False).cumsum())}
for k in dfs:
dfs[k].columns = dfs[k].iloc[0].values # Make x row the header.
dfs[k] = dfs[k].iloc[1:] # drop x row.
print(dfs[k], '\n')
Output:
x1
1 12
2 14
x2
4 32
5 9

Pandas loop into variables adding suffix and transforming original column

I would like to loop into some variable name and the equivalent column with an added suffix "_plus"
#original dataset
raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
df
#desired dataset
df['time']=df['time']*df['time_plus']
df['zone']=df['zone']*df['zone_plus']
df
I would like to do the multiplication in a more elegant way, through a loop, since I have many variables with this pattern: original name * transformed variable with the _plus suffix
something similar to this or better
my_list=['time','zone']
for i in my_list:
df[i]=df[i]*df[i+"_plus"]
Try:
for c in df.filter(regex=r".*(?<!_plus)$", axis=1):
df[c] *= df[c + "_plus"]
print(df)
Prints:
time zone time_plus zone_plus
0 10 0 5 0
1 6 9 6 9
2 8 18 2 6
3 6 0 3 5
Or:
for c in df.columns:
if not c.endswith("_plus"):
df[c] *= df[c + "_plus"]
raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
# Take every column that doesn't have a "_plus" suffix
cols = [i for i in list(df.columns) if "_plus" not in i]
# Calculate new columns
for col in cols:
df[str(col+"_2")] = df[col]*df[str(col+"_plus")]
I decided to create the new columns with a "_2" suffix, this way we don't mess up the original data.
for c in df.columns:
if f"{c}_plus" in df.columns:
df[c] *= df[f"{c}_plus"]

Get row numbers from duplicate rows

I need to read excel file and highlight duplicate rows, without editing excel or adding new column/rows. I read excel file with:
df = pd.read_excel(path2, sheet_name='Sheet1')
and with
df.drop_duplicates(subset=df.columns.difference(['Mark 4']))
i get all duplicate rows, excluding 'Mark 4'. Problem is that I can't extract those row numbers to use them with
df.style.applymap(color_negative_red)
to highlight those rows in excel since the are not included in the df.
I've tried
dfToList = redovi['unique_row_to_index'].tolist()
but since there's no unique row I can't extract the data.
Output of df.drop_duplicates(subset=df.columns.difference(['Mark 4'])) is:
Type1 Type2
0 w A
11 w A
12 w A
18 w A
19 w A
20 w A
[6 rows x 170 columns]
I need to extract those row numbers which are not part of excel columns and use them as list for future formatting.
You can use custom function with DataFrame.duplicated and keep=False for mask of duplicated rows by specified columns names:
df = pd.DataFrame({'Type1':['w'] * 3 + ['a'],
'Type2':['A'] * 3 + ['b'],
'Mark 4': range(4)})
print (df)
Type1 Type2 Mark 4
0 w A 0
1 w A 1
2 w A 2
3 a b 3
Test:
print (df.duplicated(subset=df.columns.difference(['Mark 4']), keep=False))
0 True
1 True
2 True
3 False
dtype: bool
def highlight(x):
c = 'background-color: red'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
m = x.duplicated(subset=x.columns.difference(['Mark 4']), keep=False)
df1 = df1.mask(m, c)
return df1
df.style.apply(highlight, axis=None)

Find all duplicate columns in a collection of data frames

Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.
Example
The input are 3 data frames df1, df2 and df3:
df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})
a b e
0 1 3 0
1 5 9 7
df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})
d e f
0 2 0 2
1 3 7 1
df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})
b c e
0 3 8 0
1 9 2 7
The output is a list [b, e]
pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]
here is my code for this problem, for comparing with only two data frames, with out concat them.
def getDuplicateColumns(df1, df2):
df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
df_compare["df2"] = ""
# Iterate over all the columns in dataframe
for x in range(df1.shape[1]):
# Select column at xth index.
col = df1.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
duplicateColumnNames = []
for y in range(df2.shape[1]):
# Select column at yth index.
otherCol = df2.iloc[:, y]
# Check if two columns at x y index are equal
if col.equals(otherCol):
duplicateColumnNames.append(df2.columns.values[y])
df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)
return df_compare

Moving columns down and replicating keys in pandas

I have the following dataframe:
ID first mes1.1 mes 1.2 ... mes 1.10 mes2.[1-10] mes3.[1-10]
123df John 5.5 130 45 [12,312,...] [123,346,53]
...
where I have abbreviated columns using [] notation. So in this dataframe I have 31 columns: first, mes1.[1-10], mes2.[1-10], and mes3.[1-10]. Each row is keyed by a unique index: ID.
I would like to form a new table where I've replicated all column values, (represented here by ID and first) and move the mes2 and mes3 columns (20 of them) "down" giving me something like this:
ID first mes1 mes2 ... mes10
123df John 5.5 130 45
123df John 341 543 53
123df John 123 560 567
...
# How I set up your dataframe (please include a reproducible df next time)
df = pd.DataFrame(np.random.rand(6,31), index=["ID" + str(i) for i in range(6)],
columns=['first'] + ['mes{0}.{1}'.format(i, j) for i in range(1,4) for j in range(1,11)])
df['first'] = 'john'
Then there are two ways to do this
# Generate new underlying array
first = np.repeat(df['first'].values, 3)[:, np.newaxis]
new_vals = df.values[:, 1:].reshape(18,10)
new_vals = np.hstack((first, new_vals))
# Create new df
m = pd.MultiIndex.from_product((df.index, range(1,4)), names=['ID', 'MesNum'])
pd.DataFrame(new_vals, index=m, columns=['first'] + list(range(1,11)))
or using only Pandas
df.columns = ['first'] + list(range(1,11))*3
pieces = [df.iloc[:, i:i+10] for i in range(1,31, 10)]
df2 = pd.concat(pieces, keys = ['first', 'second', 'third'])
df2 = df2.swaplevel(1,0).sortlevel(0)
df2.insert(0, 'first', df['first'].repeat(3).values)

Categories