Parsing a Multi-Index Excel File in Pandas - python

I have a time series excel file with a tri-level column MultiIndex that I would like to successfully parse if possible. There are some results on how to do this for an index on stack overflow but not the columns and the parse function has a header that does not seem to take a list of rows.
The ExcelFile looks like is like the following:
Column A is all the time series dates starting on A4
Column B has top_level1 (B1) mid_level1 (B2) low_level1 (B3) data (B4-B100+)
Column C has null (C1) null (C2) low_level2 (C3) data (C4-C100+)
Column D has null (D1) mid_level2 (D2) low_level1 (D3) data (D4-D100+)
Column E has null (E1) null (E2) low_level2 (E3) data (E4-E100+)
...
So there are two low_level values many mid_level values and a few top_level values but the trick is the top and mid level values are null and are assumed to be the values to the left. So, for instance all the columns above would have top_level1 as the top multi-index value.
My best idea so far is to use transpose, but the it fills Unnamed: # everywhere and doesn't seem to work. In Pandas 0.13 read_csv seems to have a header parameter that can take a list, but this doesn't seem to work with parse.

You can fillna the null values. I don't have your file, but you can test
#Headers as rows for now
df = pd.read_excel(xls_file,0, header=None, index_col=0)
#fill in Null values in "Headers"
df = df.fillna(method='ffill', axis=1)
#create multiindex column names
df.columns=pd.MultiIndex.from_arrays(df[:3].values, names=['top','mid','low'])
#Just name of index
df.index.name='Date'
#remove 3 rows which are already used as column names
df = df[pd.notnull(df.index)]

Related

How to Create New Column in Pandas From Two Existing Dataframes

I have two pandas dataframes, call them A and B. In A is most of my data (it's 110 million rows long), while B contains information I'd like to add (it's a dataframe that lists all the identifiers and counties). In dataframe A, I have a column called identifier. In dataframe B, I have two columns, identifier and county. I want to be able to merge the dataframes such that a new dataframe is created where I preserve all of the information in A, while also adding a new column county where I use the information provided in B to do so.
You need to use pd.merge
import pandas as pd
data_A = {'incident_date':['ert','szd','vfd','dvb','sfa','iop'] \
,'incident':['A','B','A','C','B','F']
}
data_B = {'incident':['A','B','C','D','E'] \
, 'number':[1,1,3,23,23]}
df_a = pd.DataFrame(data_A)
df_b = pd.DataFrame(data_B)
Inorder to preserve you df_A which has million rows
df_ans = df_a.merge(df_b[['number','incident']], on='incident',how='left')
The output
print(df_ans)
Output
Note:- There is NaN value since that value was not present in 2nd Dataframe

How to convert cells into columns in pandas? (python) [duplicate]

The problem is, when I transpose the DataFrame, the header of the transposed DataFrame becomes the Index numerical values and not the values in the "id" column. See below original data for examples:
Original data that I wanted to transpose (but keep the 0,1,2,... Index intact and change "id" to "id2" in final transposed DataFrame).
DataFrame after I transpose, notice the headers are the Index values and NOT the "id" values (which is what I was expecting and needed)
Logic Flow
First this helped to get rid of the numerical index that got placed as the header: How to stop Pandas adding time to column title after transposing a datetime index?
Then this helped to get rid of the index numbers as the header, but now "id" and "index" got shuffled around: Reassigning index in pandas DataFrame & Reassigning index in pandas DataFrame
But now my id and index values got shuffled for some reason.
How can I fix this so the columns are [id2,600mpe, au565...]?
How can I do this more efficiently?
Here's my code:
DF = pd.read_table(data,sep="\t",index_col = [0]).transpose() #Add index_col = [0] to not have index values as own row during transposition
m, n = DF.shape
DF.reset_index(drop=False, inplace=True)
DF.head()
This didn't help much: Add indexed column to DataFrame with pandas
If I understand your example, what seems to happen to you is that you transpose takes your actual index (the 0...n sequence as column headers. First, if you then want to preserve the numerical index, you can store that as id2.
DF['id2'] = DF.index
Now if you want id to be the column headers then you must set that as an index, overriding the default one:
DF.set_index('id',inplace=True)
DF.T
I don't have your data reproduced, but this should give you the values of id across columns.

Deleting rows in a dataset when they are missing data in a specific column using Python

I'm trying to identify which rows have a value of nan in a specific column (index 2), and either delete the rows that have nan or move the ones that don't have nan into their own dataframe. Any recommendations on how to go about either way?
I've tried to create a vector with all of the rows and specified column, but the data type object is giving me trouble. Also, I tried creating a list and adding all of the rows that != 'nan' in that specific column to the list.
patientsDD = patients.iloc[:,2].values
ddates = []
for value in patients[:,2]:
if value != 'nan':
ddates.append(value)
I'm expecting that it returns all of the rows that != 'nan' in index 2, but nothing is added to the list, and the error I am receiving is '(slice(None, None, None), 2)' is an invalid key.
I'm a newbie to all of this, so I really appreciate any help!
You can use .isna() of pandas:
patients[!patients.iloc[:, 2].isna()]
Instead of delete rows are nan, you can select only rows that are not nan.
You can try this (assuming df is the name of your data frame):
import numpy as np
df1 = df[np.isfinite(df['index 2'])]
This will give you a new data frame df1 with only the rows that have a finite value in the column index 2. You can also try this:
import pandas as pd
df1 = df[pd.notnull(df['index 2'])]
If you want to drop all the rows that have NaN values in any of the columns, you can use this:
df1 = df.dropna()

python - append only select columns as rows

Original file has multiple columns but there are lots of blanks and I want to rearrange so that there is one nice column with info. Starting with 910 rows, 51 cols (newFile df) -> Want 910+x rows, 3 cols (final df) final df has 910 rows.
newFile sample
for i in range (0,len(newFile)):
for j in range (0,48):
if (pd.notnull(newFile.iloc[i,3+j])):
final=final.append(newFile.iloc[[i],[0,1,3+j]], ignore_index=True)
I have this piece of code to go through newFile and if 3+j column is not null, to copy columns 0,1,3+j to a new row. I tried append() but it adds not only rows but a bunch of columns with NaNs again (like the original file).
Any suggestions?!
Your problem is that you are using a DataFrame and keeping column names, so adding a new columns with a value will fill the new column with NaN for the rest of the dataframe.
Plus your code is really inefficient given the double for loop.
Here is my solution using melt()
#creating example df
df = pd.DataFrame(numpy.random.randint(0,100,size=(100, 51)), columns=list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY'))
#reconstructing df as long version, keeping columns from index 0 to index 3
df = df.melt(id_vars=df.columns[0:2])
#dropping the values that are null
df.dropna(subset=['value'],inplace=True)
#here if you want to keep the information about which column the value is coming from you stop here, otherwise you do
df.drop(inplace=True,['variable'],axis=1)
print(df)

Pandas merge how to avoid unnamed column

There are two DataFrames that I want to merge:
DataFrame A columns: index, userid, locale (2000 rows)
DataFrame B columns: index, userid, age (300 rows)
When I perform the following:
pd.merge(A, B, on='userid', how='outer')
I got a DataFrame with the following columns:
index, Unnamed:0, userid, locale, age
The index column and the Unnamed:0 column are identical. I guess the Unnamed:0 column is the index column of DataFrame B.
My question is: is there a way to avoid this Unnamed column when merging two DFs?
I can drop the Unnamed column afterwards, but just wondering if there is a better way to do it.
In summary, what you're doing is saving the index to file and when you're reading back from the file, the column previously saved as index is loaded as a regular column.
There are a few ways to deal with this:
Method 1
When saving a pandas.DataFrame to disk, use index=False like this:
df.to_csv(path, index=False)
Method 2
When reading from file, you can define the column that is to be used as index, like this:
df = pd.read_csv(path, index_col='index')
Method 3
If method #2 does not suit you for some reason, you can always set the column to be used as index later on, like this:
df.set_index('index', inplace=True)
After this point, your datafame should look like this:
userid locale age
index
0 A1092 EN-US 31
1 B9032 SV-SE 23
I hope this helps.
Either don't write index when saving DataFrame to CSV file (df.to_csv('...', index=False)) or if you have to deal with CSV files, which you can't change/edit, use usecols parameter:
A = pd.read_csv('/path/to/fileA.csv', usecols=['userid','locale'])
in order to get rid of the Unnamed:0 column ...

Categories