Pandas Cleaning up - python

I have an excel file in this format and I am trying to read it in Pandas and clean it up:
I read in the file with read_excel and created a multiindex level starting from row 7([2013,2016,2017...]
df= pd.read_excel(PATH_CY_TABLE, header= [7,8,9])
This is how it read in:
Ideally, I want to clean up to look something like this:
What steps can I follow to get it this format?
Couple of things I have tried are:
1. remove the level 1 of multi index: where the columns names appears as 'unnamed...'
df.columns= df.columns.get_level_values(1)
This gives me an error: IndexError: Too many levels: Index has only 1 level, not 2
Stacking the columns indices:
df.stack()
This gives me an error: TypeError: '>' not supported between instances of 'str' and 'int'
I tried this:
df.columns=df.columns.get_level_values(0)
And this gave me the first level of MultiIndex as [2013, 2013, 2013, 2016,2016,2016...]. But I want the output df to have two levels of indices here: Level 0 and Level 3.
As a first step I am looking to remove the 'Unnamed...' columns names. I have tried to post the df as an output instead of pictures, but unsure how to do them in the correct way- when I copy paste from jupyter notebook, they paste all messed up. I am quite new to posting questions here..so still working my way around.

I wasn't still able find a better way to post my output but I worked around a way to clean up the file to the desired output:
I sliced the MultiLevelIndex level 0 to match year I want(2017)
df1= df
df1= df1.iloc[:, df1.columns.get_level_values(0)== 2017]
Out:
Number MOE1 (±) Rate
Total..........................................… 323156.0 123.0 X
NaN NaN NaN NaN
Any health plan……………….……...… 294613.0 662.0 91.2
NaN NaN NaN NaN
.Any private plan2,3……………………… 217007.0 1158.0 67.2

Related

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?
you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735
This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

Create a new column in pandas using a value of a row

First of all, this is not a duplicate! I have searched in several SO questions as well as the Pandas doc, and I have not found anything conclusive!To create a new column with a row value, like this and this!
Imagine I have the following table, opening an .xls and I create a dataframe with it. As this is a small example created from the real proble, I created this simple Excel table which can be easily reproduceable:
What I want now is to find the row that has "Population Month Year" (I will be looking at different .xls, so the structure is the same: population, month and year.
xls='population_example.xls'
sheet_name='Sheet1'
df = pd.read_excel(xls, sheet_name=sheet_name, header=0, skiprows=2)
df
What I thought is:
Get the value of that row with startswith
Create a column, pythoning that value and getting the month and year value.
I have tried several things similar to this:
dff=df[s.str.startswith('Population')]
dff
But errors won't stop coming. In this above's code error, specifically:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have several guesses:
I am not understanding properly how Seriesin pandas work, even though reading the doc. I did not even think on using them, but the startswithlooks like the thing I am looking for.
If I handle this properly, I might have a NaN error, but I cannot use df.dropna()yet, as I would lose that row value (Population April 2017)!
Edit:
The problem on using this:
df[df['Area'].str.startswith('Population')] Is that it will check the na values.
And this:
df['Area'].str.startswith('Population')
Will give me a true/false/na set of values, which I am not sure how I can use.
Thanks to #Erfan , I got to the solution:
Using properly the line of code in the comments and not like I was trying, I managed to:
dff=df[df['Area'].str.startswith('Population', na=False)]
dff
Which would output: Population and household forecasts, 2016 to 20... NaN NaN NaN NaN NaN NaN
Now I can access this value like
value=dff.iloc[0][0]
value
To get the string I was looking for: 'Population and household forecasts, 2016 to 2041, prepared by .id , the population experts, April 2019.'
And I can python around with this to create the desired column. Thank you!
You could try:
import pandas as pd
import numpy as np
pd.DataFrame({'Area': [f'Whatever{i+1}' for i in range(3)] + [np.nan, 'Population April 2017.'],
'Population': [3867, 1675, 1904, np.nan, np.nan]}).to_excel('population_example.xls', index=False)
df = pd.read_excel('population_example.xls').fillna('')
population_date = df[df.Area.str.startswith('Population')].Area.values[0].lstrip('Population ').rstrip('.').split()
Result:
['April', '2017']
Or (if Population Month Year is always on the last row):
df.iloc[-1, 0].lstrip('Population ').rstrip('.').split()

Create multiindex from existing dataframe

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Fill pandas Panel object with data

This is probably very very basic but I can't seem to find a solution anywhere. I'm trying to construct a 3D panel object in pandas and then fill it with data which I read from several csv files. An example of what I'm trying to do would be the following:
import numpy as np
import pandas as pd
year = np.arange(2000,2005)
obs = np.arange(1,5)
variables = ['x1','x2']
data = pd.Panel(items = obs, major_axis = year, minor_axis = variables)
So that data[i] gives me all the data belonging to one of the observation units in the panel:
data[1]
x1 x2
2000 NaN NaN
2001 NaN NaN
2002 NaN NaN
2003 NaN NaN
2004 NaN NaN
Then, I read in data from a csv which gives me a DataFrame that looks like this (I'm just creating an equivalent object here to make this a working example):
x1data = pd.DataFrame(data = zip(year, np.random.randn(5)), columns = ['year', 'x1'])
x1data
year x1
0 2000 -0.261514
1 2001 0.474840
2 2002 0.021714
3 2003 -1.939358
4 2004 1.167545
No I would like to replace the NaN's in the x1 column of data[1] with the data that is in the x1data dataframe. My first idea (given that I'm coming from R) was to simply make sure that I select an object from x1data that has the same dimension as the x1 column in my panel and assign it to the panel:
data[1].x1 = x1data.x1
However, this doesn't work which I guess is due to the fact that in x1data, the years are a column of the dataframe, whereas in the panel they are whatever it is that shows up to the left of the columns (the "row names", would this be an index)?
As you can probably tell from my question I'm far from really understanding what's going on in the pandas data structure so any help would be greatly appreciated!
I'm guessing this question didn't elicit a lot of replies at it was simply too stupid, but just in case anyone ever comes across this and is as clueless as I was, the very simple answer is to access the panel using the .iloc method, as:
data.iloc[item, major_axis, minor_axis]
where each of the arguments can be single elements or lists, in order to write on slices of the panel. My question above would have been solved by
data.iloc[1, np.arange(2000,2005), 'x1'] = np.asarray(x1data.x1)
or
data.iloc[1, year, 'x1'] = np.asarray(x1data.x1)
Note than had I not used np.asarray, nothing would have happened as data.iloc[] creates an object that has the years as index, while x1data.x1 has an index starting at 0.

Categories