Not standard Excel to pandas data frame - python

I've got a non-standard Excel table with the help of openpyxl. I've done some part on the way to convert it to pandas dataframe. But now I'm stuck with this problem.
I want to select just a range of columns rows and get data from them. Like take cells from 4 to 12 row, and column from j to x. I hope you understand me.
Sorry for my English.

You can try something like that:
df = pd.read_excel('data.xlsx', skiprows=4, usecols=['J:X'], nrows=9)
If the number of rows is not fixed, you can use your second column as delimiter.
df = pd.read_excel('data.xlsx', skiprows=4, usecols=['J:X'])
df = df[df.iloc[:, 1].notna()]

you could skip the rows as you read the excel file to a Dataframe and initially drop the first 4 rows and then manipulate the Dataframe as follows.
first line is reading the file by skipping the first 4 rows
second line is dropping a range of rows from the dataframe (startRow and endRow being the integer values of the row index)
third line is dropping 2 columns from the dataframe
df = pd.read_excel('fileName.xlsx', skiprows=4)
df.drop([startRow, endRow], inplace=True)
df.drop(['column1', 'column2'], axis=1)

Related

moving a row next to another row in panda data frame

I am trying to format a data frame from 2 rows to 1 rows. but I am encountering some issues. Do you have any idea on how to do that? Here the code and df:
Thanks!
If you are looking to convert two rows into one, you can do the following...
Stack the dataframe and reset the index at level=1, which will convert the data and columns into a stack. This will end up having each of the column headers as a column (called level_1) and the data as another column(called 0)
Then set the index as level_1, which will move the column names as index
Remove the index name (level_1). Then transpose the dataframe
Code is shown below.
df3=df3.stack().reset_index(level=1).set_index('level_1')
df3.index.name = None
df3=df3.T
Output
df3

How to convert the header row to a normal row in pandas

I am having a excel sheet where I skipped multiple rows and finally arrived at a dataframe with some little structure. But I have a dataframe which looks like this. Bold are headers.
There are some columns on top which I hid in this screenshot as well. While reading a dataframe by skipping rows from excel, there is a multi level indexing.
I wanted to have the numbers in header to come as a row. Please advice how to achieve this.
Thank you in advance
You can skip header with header = None if you use .read_csv
df = pd.read_csv(file_path, header=None, usecols=[3,6])
The following will add your current columns as the last row in the dataframe. You could then put this row into position 0, or rename the columns, if necessary.
row = pd.Series(df.columns, index=df.columns)
df.append(row, ignore_index=True)

python - append only select columns as rows

Original file has multiple columns but there are lots of blanks and I want to rearrange so that there is one nice column with info. Starting with 910 rows, 51 cols (newFile df) -> Want 910+x rows, 3 cols (final df) final df has 910 rows.
newFile sample
for i in range (0,len(newFile)):
for j in range (0,48):
if (pd.notnull(newFile.iloc[i,3+j])):
final=final.append(newFile.iloc[[i],[0,1,3+j]], ignore_index=True)
I have this piece of code to go through newFile and if 3+j column is not null, to copy columns 0,1,3+j to a new row. I tried append() but it adds not only rows but a bunch of columns with NaNs again (like the original file).
Any suggestions?!
Your problem is that you are using a DataFrame and keeping column names, so adding a new columns with a value will fill the new column with NaN for the rest of the dataframe.
Plus your code is really inefficient given the double for loop.
Here is my solution using melt()
#creating example df
df = pd.DataFrame(numpy.random.randint(0,100,size=(100, 51)), columns=list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY'))
#reconstructing df as long version, keeping columns from index 0 to index 3
df = df.melt(id_vars=df.columns[0:2])
#dropping the values that are null
df.dropna(subset=['value'],inplace=True)
#here if you want to keep the information about which column the value is coming from you stop here, otherwise you do
df.drop(inplace=True,['variable'],axis=1)
print(df)

Use multiple rows as column header for pandas

I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]

Adding a new column to a pandas dataframe

I have a dataframe df with one column and 500k rows (df with first 5 elements is given below). I want to add new data in the existing column. The new data is a matrix of 200k rows and 1 column. How can I do it? Also I want add a new column named op.
X098_DE_time
0.046104
-0.037134
-0.089496
-0.084906
-0.038594
We can use concat function after rename the column from second dataframe.
df2.rename(columns={'op':' X098_DE_time'}, inplace=True)
new_df = pd.concat([df, new_df], axis=0)
Note: If we don't rename df2 column, the resultant new_df will have 2 different columns.
To add new column you can use
df["new column"] = [list of values];

Categories