python - append only select columns as rows - python

Original file has multiple columns but there are lots of blanks and I want to rearrange so that there is one nice column with info. Starting with 910 rows, 51 cols (newFile df) -> Want 910+x rows, 3 cols (final df) final df has 910 rows.
newFile sample
for i in range (0,len(newFile)):
for j in range (0,48):
if (pd.notnull(newFile.iloc[i,3+j])):
final=final.append(newFile.iloc[[i],[0,1,3+j]], ignore_index=True)
I have this piece of code to go through newFile and if 3+j column is not null, to copy columns 0,1,3+j to a new row. I tried append() but it adds not only rows but a bunch of columns with NaNs again (like the original file).
Any suggestions?!

Your problem is that you are using a DataFrame and keeping column names, so adding a new columns with a value will fill the new column with NaN for the rest of the dataframe.
Plus your code is really inefficient given the double for loop.
Here is my solution using melt()
#creating example df
df = pd.DataFrame(numpy.random.randint(0,100,size=(100, 51)), columns=list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY'))
#reconstructing df as long version, keeping columns from index 0 to index 3
df = df.melt(id_vars=df.columns[0:2])
#dropping the values that are null
df.dropna(subset=['value'],inplace=True)
#here if you want to keep the information about which column the value is coming from you stop here, otherwise you do
df.drop(inplace=True,['variable'],axis=1)
print(df)

Related

moving a row next to another row in panda data frame

I am trying to format a data frame from 2 rows to 1 rows. but I am encountering some issues. Do you have any idea on how to do that? Here the code and df:
Thanks!
If you are looking to convert two rows into one, you can do the following...
Stack the dataframe and reset the index at level=1, which will convert the data and columns into a stack. This will end up having each of the column headers as a column (called level_1) and the data as another column(called 0)
Then set the index as level_1, which will move the column names as index
Remove the index name (level_1). Then transpose the dataframe
Code is shown below.
df3=df3.stack().reset_index(level=1).set_index('level_1')
df3.index.name = None
df3=df3.T
Output
df3

Not standard Excel to pandas data frame

I've got a non-standard Excel table with the help of openpyxl. I've done some part on the way to convert it to pandas dataframe. But now I'm stuck with this problem.
I want to select just a range of columns rows and get data from them. Like take cells from 4 to 12 row, and column from j to x. I hope you understand me.
Sorry for my English.
You can try something like that:
df = pd.read_excel('data.xlsx', skiprows=4, usecols=['J:X'], nrows=9)
If the number of rows is not fixed, you can use your second column as delimiter.
df = pd.read_excel('data.xlsx', skiprows=4, usecols=['J:X'])
df = df[df.iloc[:, 1].notna()]
you could skip the rows as you read the excel file to a Dataframe and initially drop the first 4 rows and then manipulate the Dataframe as follows.
first line is reading the file by skipping the first 4 rows
second line is dropping a range of rows from the dataframe (startRow and endRow being the integer values of the row index)
third line is dropping 2 columns from the dataframe
df = pd.read_excel('fileName.xlsx', skiprows=4)
df.drop([startRow, endRow], inplace=True)
df.drop(['column1', 'column2'], axis=1)

Delete rows from a pandas DataFrame based on a conditional expression in another dataframe

I have two pandas dataframes, df1 and df2, with both equal number of rows. df2 has 11 rows which contain NaN values. I know how to drop the empty rows in df2, by applying:
df2.dropna(subset=['HIGH'], inplace=True)
But now I want to delete these same rows from df1 (the rows with the same row numbers that have been deleted from df2). I tried the following but this does not seem to work.
df1.drop(df2[df2['HIGH'] == 'NaN'].index, inplace=False)
Any other suggestions?
You can get all rows with NaN values in it with:
is_NaN = df2.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df2[row_has_NaN]
After that you can delete the rows with NaN. (like you said in the question)
Now you can get every index out of 'rows_with_NaN'. With every index you can delete it out of df1 (Should have the same index like you said).
I hope this is correct! (No test done)

Adding a new column to a pandas dataframe

I have a dataframe df with one column and 500k rows (df with first 5 elements is given below). I want to add new data in the existing column. The new data is a matrix of 200k rows and 1 column. How can I do it? Also I want add a new column named op.
X098_DE_time
0.046104
-0.037134
-0.089496
-0.084906
-0.038594
We can use concat function after rename the column from second dataframe.
df2.rename(columns={'op':' X098_DE_time'}, inplace=True)
new_df = pd.concat([df, new_df], axis=0)
Note: If we don't rename df2 column, the resultant new_df will have 2 different columns.
To add new column you can use
df["new column"] = [list of values];

Parsing a Multi-Index Excel File in Pandas

I have a time series excel file with a tri-level column MultiIndex that I would like to successfully parse if possible. There are some results on how to do this for an index on stack overflow but not the columns and the parse function has a header that does not seem to take a list of rows.
The ExcelFile looks like is like the following:
Column A is all the time series dates starting on A4
Column B has top_level1 (B1) mid_level1 (B2) low_level1 (B3) data (B4-B100+)
Column C has null (C1) null (C2) low_level2 (C3) data (C4-C100+)
Column D has null (D1) mid_level2 (D2) low_level1 (D3) data (D4-D100+)
Column E has null (E1) null (E2) low_level2 (E3) data (E4-E100+)
...
So there are two low_level values many mid_level values and a few top_level values but the trick is the top and mid level values are null and are assumed to be the values to the left. So, for instance all the columns above would have top_level1 as the top multi-index value.
My best idea so far is to use transpose, but the it fills Unnamed: # everywhere and doesn't seem to work. In Pandas 0.13 read_csv seems to have a header parameter that can take a list, but this doesn't seem to work with parse.
You can fillna the null values. I don't have your file, but you can test
#Headers as rows for now
df = pd.read_excel(xls_file,0, header=None, index_col=0)
#fill in Null values in "Headers"
df = df.fillna(method='ffill', axis=1)
#create multiindex column names
df.columns=pd.MultiIndex.from_arrays(df[:3].values, names=['top','mid','low'])
#Just name of index
df.index.name='Date'
#remove 3 rows which are already used as column names
df = df[pd.notnull(df.index)]

Categories