Is reading Excel data by column labels supported in pandas? - python

Shows the Excel file I'm trying to read from.
Shows what I want to do in non-legal code.
Shows what I have been trying so far.
1) Excel file
A | B | C
1 Name1 Name2 Name3
2 33 44 55
3 23 66 77
4 22 33 99
2) Non-legal code:
frame = pd.read_excel(path, 'Sheet1', parse_cols="Name1,Name2,Name3")
In the example I can assume that the column names are unique.
3) Tried so far:
What I have been trying so far is to use parse_cols, but I don't think what I'm trying to do is supported by pandas.

Per the documentation, there is no support for what you are trying to do. You can select columns by column number or column name, but not by column label:
parse_cols : int or list, default None
If None then parse all columns,
If int then indicates last column to be parsed
If list of ints then indicates list of column numbers to be parsed
If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)

Related

Append multiple excel sheets and create a identifier column using pandas

I am in a situation where I would like append multiple excel sheets coming from a single workbook on top of each other and build an identifier column.
The identifier column will be built via extracting a word(within brackets of a column) from the column header, essentially creating a new column and storing that extracted information in it. Here is an example:
My excel workbook has two sheets , "Sheet1" and "Sheet2" and their header looks like this:
Sheet1:
a b c d(Connect)
1 2 3 4
11 22 33 44
Sheet2:
a b c d(Connect2)
5 6 7 8
What I want is to append these two sheets together in a way that the resultant dataframe should like following:
identifier a b c d
Connect1 1 2 3 4
Connect1 11 22 33 44
Connect2 5 6 7 8
The idea is that the identifier should be placed corresponding to each and every row when we are appending the sheets on top of each other.
How do I achieve this?
After importing each sheet add the identifier column to each df and concatenate them:
sheet1['identifier'] = "Connect1"
sheet2['identifier'] = "Connect2"
new = pd.concat([sheet1, sheet2], axis=0)

How to delete rows from a csv file?

I was able to pull the rows that I would like to delete from a CSV file but I can't make that drop() function to work.
data = pd.read_csv(next(iglob('*.csv')))
data_top = data.head()
data_top = data_top.drop(axis=0)
What needs to be added?
Example of a CSV file. It should delete everything until it reaches the Employee column.
creation date Unnamed: 1 Unnamed: 2
0 NaN type of client NaN
1 age NaN NaN
2 NaN birth date NaN
3 NaN NaN days off
4 Employee Salary External
5 Dan 130e yes
6 Abraham 10e no
7 Richmond 201e third-party
If it is just the top 5 rows you want to delete, then you can do it as follows:
data = pd.read_csv(next(iglob('*.csv')))
data.drop([0,1,2,3,4], axis=0, inplace=True)
With axis, you should also pass either a single label or list (of column names, or row indexes).
There are, of course, many other ways to achieve this too. Especially if the case is that the index of rows you want to delete is not just the top 5.
edit: inplace added as pointed out in comments.
Considering the coments and further explanations, assuming you know the name of the column, and that you have a positional index, you can try the following:
data = pd.read_csv(next(iglob('*.csv')))
row = data[data['creation date'] == 'Employee']
n = row.index[0]
data.drop(labels=list(range(n)), inplace=True)
The main goal is to find the index of the row that contains the value 'Employee'. To achieve that, assuming there are no other rows that contain that word, you can filter the dataframe to match the value in question in the specific column.
After that, you extract the index value, wich you will use to create a list of labels (given a positional index) that you will drop of the dataframe, as #MAK7 stated in his answer.

Pandas Merge/Join

I have a dataframe called Bob with Columns = [A,B] and A has only unique values like a serial ID. Shape is (100,2)
I have another dataframe called Anna with Columns [C,D,E,F] where C has the same values as A in bob but there are duplicates. Column D is a category (phone/laptop/ipad) that is defined by the serial ID found in C. Shape of Anna is (500,4).
Example of row in anna:
A B C D
K103 phone 12 17
K103 phone 14 23
G221 laptop 25 6
I want to create a new dataframe that has columns A,B,D by searching for value of A in anna[C]. The final dataframe should be shape (100,3)
I'm finding this difficult with pd.merge (i tried left/inner/right joins) because it keeps creating 2 rows in the new dataframe with same values i.e. K103 will show up 2x in the new dataframe.
Tell me if this works, I'm thinking of this while typing it, so I couldn't actually check.
df = Bob.merge(Anna[['C','D'].drop_duplicates(keep='last'),how='left',left_on='A',right_on='C']
Let me know if it doesn't work, I'll create a sample dataset and edit it with the correct code.

Pandas CSV output only the data in a certain row (to_csv)

I need to output only a particular row from a pandas dataframe to a CSV file. In other words, the output needs to have only the data in row X, in a single line separated by commas, and nothing else. The problem I am running into with to_CSV is that I cannot find a way to do just the data; I am always receiving an extra line with a column count.
data.to_csv(filename, index=False)
gives
0,1,2,3,4,5
X,Y,Z,A,B,C
The first line is just a column count and is part of the dataframe, not the data. I need just the data. Is there any way to do this simply, or do I need to break out of pandas and manipulate the data further in python?
Note: the preceding example has only 1 row of data, but it would be nice to have the syntax for choosing row too.
You can try this:
df = pd.DataFrame({'A': ['a','b','c','d','e','f'], 'B': [1,2,3,4,5,6]})
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
You can select the row you want, in this case, I select the row at index 1:
df.iloc[1:2].to_csv('test.csv', index=False, header=False)
The output to the csv file looks like this (makes sure you use header=False):
b 2
You can use this
data.to_csv(filename, index=False, header=False)
the header means:
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
you can find more specific info in pandas.DataFrame.to_csv
it seems like you are looking for filtering data from the existing dataframe and write it into .csv file.
for that you need to filter your data . then apply to_csv command.
here is the command
df[df.index.isin([3,4])]
if this is your data
>>> df
A B
0 X 1
1 Y 2
2 Z 3
3 A 4
4 B 5
5 C 6
then this would be your expected filtered content. then you can apply to_csv on top of it.
>>> df[df.index.isin([3,4])]
A B
3 A 4
4 B 5

Pandas: Get top 10 values AFTER grouping

I have a pandas data frame with a column 'id' and a column 'value'. It is already sorted by first id (ascending) and then value (descending). What I need is the top 10 values per id.
I assumed that something like the following would work, but it doesn't:
df.groupby("id", as_index=False).aggregate(lambda (index,rows) : rows.iloc[:10])
What I get is just a list of ids, the value column (and other columns that I omitted for the question) aren't there anymore.
Any ideas how it might be done, without iterating through each of the single rows and appending the first ten to another data structure?
Is this what you're looking for?
df.groupby('id').head(10)
I would like to answer this by giving and example dataframe as:
df = pd.DataFrame(np.array([['a','a','b','c','a','c','b'],[4,6,1,8,9,4,1],[12,11,7,1,5,5,7],[123,54,146,96,10,114,200]]).T,columns=['item','date','hour','value'])
df['value'] = pd.to_numeric(df['value'])
This gives you a dataframe
item date hour value
a 4 12 123
a 6 11 54
b 1 7 146
c 8 1 96
a 9 5 10
c 4 5 114
b 1 7 200
Now this is grouped below and displays first 2 values of grouped items.
df.groupby(['item'])['value'].head(2)

Categories