Append multiple excel sheets and create a identifier column using pandas - python

I am in a situation where I would like append multiple excel sheets coming from a single workbook on top of each other and build an identifier column.
The identifier column will be built via extracting a word(within brackets of a column) from the column header, essentially creating a new column and storing that extracted information in it. Here is an example:
My excel workbook has two sheets , "Sheet1" and "Sheet2" and their header looks like this:
Sheet1:
a b c d(Connect)
1 2 3 4
11 22 33 44
Sheet2:
a b c d(Connect2)
5 6 7 8
What I want is to append these two sheets together in a way that the resultant dataframe should like following:
identifier a b c d
Connect1 1 2 3 4
Connect1 11 22 33 44
Connect2 5 6 7 8
The idea is that the identifier should be placed corresponding to each and every row when we are appending the sheets on top of each other.
How do I achieve this?

After importing each sheet add the identifier column to each df and concatenate them:
sheet1['identifier'] = "Connect1"
sheet2['identifier'] = "Connect2"
new = pd.concat([sheet1, sheet2], axis=0)

Related

using set() with pandas

May I ask you please if we can use set() to read the data in a specific column in pandas? For example, I have the following output from a DataFrame df1:
df1= [
0 -10 2 5
1 24 5 10
2 30 3 6
3 30 2 1
4 30 4 5
]
where the first column is the index.. I tried first to isolate the second column
[-10
24
30
30
30]
using the following: x = pd.DataFrame(df1, coulmn=[0]) Then, I transposed the column using the following XX = x.T Then, I used set() function.
However, instead of obtaining [-10 24 30] I got the following [0 1 2 3 4]
So set() read the index instead of reading the first column
set() takes an itterable.
using a pandas dataframe as an itterable yields the column names in turn.
Since you've transposed the dataframe, your index values are now column names, so when you use the transposed dataframe as an itterable you get those index values.
If you want to use set to get the values in the column using set() you can use:
x = pd.DataFrame(df1, colmns=[0])
set(x.iloc[:,0].values)
But if you just want the unique values in column 0 then you can use
df1[[0]].unique()

Python Pandas Dynamically Read Excel Sheet with Multiple Header Rows of Different Column Size

I have an excel sheet that I am trying to read into as a dataframe. The sheet has multiple header rows that each can have varying amount of columns. Some of the columns are similar, but no always. Is there a way I can split the rows into separate dataframes?
The data for example would be:
A B C D
1 1 1 1
2 2 2 2
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
A B C
1 1 1
The ideal output would be three separate dataframes which their respective rows and column headers.
.read_excel has header, skiprows and skipfooter arguments that let you do this, provided that you can detect or know ahead of time what row each header is on. With these and usecols you can define any "window" on the sheet as your df. Combining multiple windows can then be accomplished with concat, merge, append, and join, per usual.

How to replace dataframes column value for all the csv files in a folder by other dataframe column value?

I have the following dataframe sheet1
Id Snack Price
5 Orange 55
7 Apple 53
8 Muskmelon 33
I have the other dataframe replace
Snack Cat
Orange a
Apple b
Muskmelon c
For replacing column value with other column value this is the code
sheet1['snack'] = sheet1['snack'].map(replace.set_index('Snack')['Cat'])
So I will get this after the above code.
Id Snack Price
5 a 55
7 b 53
8 c 33
How do I do the same operation for all the csv sheets present in the folder.
Input: https://www.dropbox.com/sh/1mbgjtrr6t069w1/AADC3ZrRZf33QBil63m1mxz_a?dl=0
Output: Replace Snack column sheet values with replace dataframe cat values for all the files in a folder.
I believe you need glob for list of files, then loop and create DataFrame, map and last save back:
import glob
s = replace.set_index('Snack')['Cat']
for file in glob.glob('files/*.csv'):
#df = pd.read_csv(file)
df['Snack'] = df['Snack'].map(s)
df.to_csv(f'{file}', index=False)

Pandas CSV output only the data in a certain row (to_csv)

I need to output only a particular row from a pandas dataframe to a CSV file. In other words, the output needs to have only the data in row X, in a single line separated by commas, and nothing else. The problem I am running into with to_CSV is that I cannot find a way to do just the data; I am always receiving an extra line with a column count.
data.to_csv(filename, index=False)
gives
0,1,2,3,4,5
X,Y,Z,A,B,C
The first line is just a column count and is part of the dataframe, not the data. I need just the data. Is there any way to do this simply, or do I need to break out of pandas and manipulate the data further in python?
Note: the preceding example has only 1 row of data, but it would be nice to have the syntax for choosing row too.
You can try this:
df = pd.DataFrame({'A': ['a','b','c','d','e','f'], 'B': [1,2,3,4,5,6]})
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
You can select the row you want, in this case, I select the row at index 1:
df.iloc[1:2].to_csv('test.csv', index=False, header=False)
The output to the csv file looks like this (makes sure you use header=False):
b 2
You can use this
data.to_csv(filename, index=False, header=False)
the header means:
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
you can find more specific info in pandas.DataFrame.to_csv
it seems like you are looking for filtering data from the existing dataframe and write it into .csv file.
for that you need to filter your data . then apply to_csv command.
here is the command
df[df.index.isin([3,4])]
if this is your data
>>> df
A B
0 X 1
1 Y 2
2 Z 3
3 A 4
4 B 5
5 C 6
then this would be your expected filtered content. then you can apply to_csv on top of it.
>>> df[df.index.isin([3,4])]
A B
3 A 4
4 B 5

Is reading Excel data by column labels supported in pandas?

Shows the Excel file I'm trying to read from.
Shows what I want to do in non-legal code.
Shows what I have been trying so far.
1) Excel file
A | B | C
1 Name1 Name2 Name3
2 33 44 55
3 23 66 77
4 22 33 99
2) Non-legal code:
frame = pd.read_excel(path, 'Sheet1', parse_cols="Name1,Name2,Name3")
In the example I can assume that the column names are unique.
3) Tried so far:
What I have been trying so far is to use parse_cols, but I don't think what I'm trying to do is supported by pandas.
Per the documentation, there is no support for what you are trying to do. You can select columns by column number or column name, but not by column label:
parse_cols : int or list, default None
If None then parse all columns,
If int then indicates last column to be parsed
If list of ints then indicates list of column numbers to be parsed
If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)

Categories