Preparing an aggregate dataframe for publication - python

I have a Pandas aggregate dataframe like this:
import pandas as pd
agg_df = pd.DataFrame({'v1':['item', 'item', 'item', 'item', 'location', 'status', 'status'],
'v2' :['bed', 'lamp', 'candle', 'chair', 'home', 'new', 'used' ],
'count':['2', '2', '2', '1', '7', '4', '3' ]})
agg_df
I want to prepare it for academic publication and I need a new dataframe like this:
# item bed 2
# lamp 2
# candle 2
# chair 1
# location home 7
# status new 4
# used 3
How can I create such a dataframe?

For display only is possible use MultiIndex:
df = agg_df.set_index(['v1','v2'])
print (df)
count
v1 v2
item bed 2
lamp 2
candle 2
chair 1
location home 7
status new 4
used 3
If need replace duplicated values use Series.duplicated with Series.mask:
agg_df['v1'] = agg_df['v1'].mask(agg_df['v1'].duplicated(),'')
print (agg_df)
v1 v2 count
0 item bed 2
1 lamp 2
2 candle 2
3 chair 1
4 location home 7
5 status new 4
6 used 3
If need remove index and columns values:
print (agg_df.to_string(index=False, header=None))
item bed 2
lamp 2
candle 2
chair 1
location home 7
status new 4
used 3

u can do that using
import pandas as pd
agg_df = pd.DataFrame({'v1':['item', 'item', 'item', 'item', 'location', 'status', 'status'],
'v2' :['bed', 'lamp', 'candle', 'chair', 'home', 'new', 'used' ],
'count':['2', '2', '2', '1', '7', '4', '3' ]})
agg_df.set_index(["v1","v2"])

Related

How to get index level0 from multiindexed dataframe by specifing level1, level2, level3?

My question is as in the title, so for example I would like to write dataframe.index.Magic_command[None,'valueA','row','p1'] and get value 'car', or
dataframe.index.Magic_command[None,'valueC','row','p1','1'] and receive as output value bike
Here is the example code:
import numpy as np
import pandas as pd
# multiindex array
arr = [np.array(['car', 'car', 'car','car', 'car', 'car', 'car', 'car', 'car', 'truck', 'truck', 'truck', 'truck', 'truck', 'truck','truck', 'truck', 'truck','bike','bike', 'bike','bike','bike', 'bike','bike','bike', 'bike']),
np.array(['valueA', 'valueA','valueA', 'valueA','valueA', 'valueA','valueA', 'valueA','valueA','valueB','valueB','valueB','valueB','valueB','valueB','valueB','valueB','valueB', 'valueC','valueC','valueC','valueC','valueC','valueC','valueC','valueC','valueC']),
np.array(['row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row','row']),
np.array(['p1','p1','p1','p2','p2','p2','p3','p3','p3','p1','p1','p1','p2','p2','p2','p3','p3','p3','p1','p1','p1','p2','p2','p2','p3','p3','p3',]),
np.array(['1','2','3','1','2','3','1','2','3','1','2','3','1','2','3','1','2','3','1','2','3','1','2','3','1','2','3',])]
# forming multiindex dataframe
dataFrame = pd.DataFrame(
np.random.randn(27, 3), index=arr,columns=['Col 1', 'Col 2', 'Col 3'])
dataFrame.index.names = ['level 0', 'level 1','level 2','level 3','level 4']
print(dataFrame)
From above I get this dataframe:
Col 1 Col 2 Col 3
level 0 level 1 level 2 level 3 level 4
car valueA row p1 1 0.088282 0.645195 -0.102823
2 -0.659527 -1.820909 -1.774308
3 0.859338 0.971282 0.517606
p2 1 1.205428 -0.277596 0.527442
2 0.366879 0.149401 -0.087129
3 -0.084490 -1.802438 2.000927
p3 1 -1.651197 0.340212 -2.170045
2 0.625551 -0.327191 -1.376346
3 -0.112555 -0.727614 -0.949196
truck valueB row p1 1 0.735279 0.324148 -0.588617
2 -1.398363 0.056191 -0.051693
3 -1.948123 0.316405 1.127997
p2 1 -0.899230 0.552561 -0.014481
2 -1.159626 -1.008341 0.569346
3 -0.862040 -1.654220 -0.187640
p3 1 1.177478 0.563265 -0.799456
2 0.631338 0.660141 0.801916
3 -0.361715 -0.070938 -0.113358
bike valueC row p1 1 -1.246785 0.344593 -1.363045
2 1.199800 -0.483610 0.385470
3 -0.820398 1.550655 2.625559
p2 1 0.772196 0.956007 -0.921774
2 1.102925 0.152290 -0.553291
3 0.538580 1.305551 -0.924003
p3 1 -0.025790 -0.134343 0.197256
2 -0.851465 -0.324827 0.057217
3 -0.994596 0.361060 0.797949
Use DataFrame.loc with IndexSlice:
a = dataFrame.loc[pd.IndexSlice[:,'valueC','row','p1','1']].index[0]
print (a)
bike

How to add a column into Pandas with a condition

here is a simple pandas DataFrame :
data={'Name': ['John', 'Dav', 'Ann', 'Mike', 'Dany'],
'Number': ['2', '3', '2', '4', '2']}
df = pd.DataFrame(data, columns=['Name', 'Number'])
df
I would like to add a third column named "color" where the value is 'red' if Number = 2 and 'Blue' if Number = 3
This dataframe just has 5 rows. In reality It has thousand rows so I can not just add a simple column manually.
You can use .map:
dct = {2: "Red", 3: "Blue"}
df["color"] = df["Number"].astype(int).map(dct) # remove .astype(int) if the values are already integer
print(df)
Prints:
Name Number color
0 John 2 Red
1 Dav 3 Blue
2 Ann 2 Red
3 Mike 4 NaN
4 Dany 2 Red

How can I compare the row values of select columns with the same columns in another dataframe?

I have two data frames with headers as follows:
df1 = pd.DataFrame(columns=['STATE', 'COUNTY', 'QUANTITY'])
df2 = pd.DataFrame(columns=['FIPS', 'STATE', 'COUNTY'])
I want to create a 3rd data frame:
df3 = pd.DataFrame(columns=['FIPS', 'QUANTITY'])
Such that each row in df1 will have its state and county values compared every row in df2 until a match is found. Once a match is found, the 'FIPS' value from df2 and the 'QUANTITY' value from df1 will be appended to df3.
Basically, I want a data frame that has the FIPS values and Quantity Values per county / state and the csv that I am reading doesn't come with FIPS values.
The Code:
import pandas as pd
import numpy as np
a = [['1', '5', '10'], ['2', '6', '12'], ['3', '7', '11']]
b = [['005', '2', '6'], ['101', '1', '5'], ['201', '3', '7']]
df1 = pd.DataFrame(a, columns=['STATE', 'COUNTY', 'QUANTITY'])
df2 = pd.DataFrame(b, columns=['FIPS', 'STATE', 'COUNTY'])
df3 = pd.DataFrame(columns=['FIPS', 'QUANTITY'])
print(df1)
print(df2)
df3['QUANTITY'] = np.where((df1['STATE'] == df2['STATE']) &
(df1['COUNTY'] == df2['COUNTY'])
, df1['QUANTITY'], np.nan)
df3['FIPS'] = np.where((df1['STATE'] == df2['STATE']) & (df1['COUNTY']
== df2['COUNTY'])
, df2['FIPS'], np.nan)
Has the Result:
STATE COUNTY QUANTITY
0 1 5 10
1 2 6 12
2 3 7 11
FIPS STATE COUNTY
0 005 2 6
1 101 1 5
2 201 3 7
FIPS QUANTITY
0 NaN NaN
1 NaN NaN
2 201 11
I'm looking for something that gives me:
STATE COUNTY QUANTITY
0 1 5 10
1 2 6 12
2 3 7 11
FIPS STATE COUNTY
0 005 2 6
1 101 1 5
2 201 3 7
FIPS QUANTITY
0 101 10
1 005 12
2 201 11
I am comfortable doing such computations in VBA, C++ and MATLAB however I have no clue how to compare elemental indexes of dataframes in python.
Use DataFrame.merge with default inner join and then select columns by subset:
df3 = df1.merge(df2, on=['STATE','COUNTY'])[['FIPS','QUANTITY']]
print (df3)
FIPS QUANTITY
0 101 10
1 005 12
2 201 11
Maybe you can try something like this:
df3 = pd.merge(df1, df2, left_on = ['STATE', 'COUNTY'], right_on= ['STATE', 'COUNTY']) # merge the two dataframes with STATE and COUNTY as join keys
df3 = df3.drop(['STATE', 'COUNTY'], axis = 1) # drop columns you don't need
df3

Dropping duplicate rows but keeping certain values Pandas

I have 2 similar dataframes that I concatenated that have a lot of repeated values because they are basically the same data set but for different years.
The problem is that one of the sets has some values missing whereas the other sometimes has these values.
For example:
Name Unit Year Level
Nik 1 2000 12
Nik 1 12
John 2 2001 11
John 2 2001 11
Stacy 1 8
Stacy 1 1999 8
.
.
I want to drop duplicates on the subset = ['Name', 'Unit', 'Level'] since some repetitions don't have years.
However, I'm left with the data that has no Year and I'd like to keep the data with these values:
Name Unit Year Level
Nik 1 2000 12
John 2 2001 11
Stacy 1 1999 8
.
.
How do I keep these values rather than the blanks?
Use sort_values with default parameter na_position='last', so should be omit, and then drop_duplicates:
print (df)
Name Unit Year Level
0 Nik 1 NaN 12
1 Nik 1 2000.0 12
2 John 2 2001.0 11
3 John 2 2001.0 11
4 Stacy 1 NaN 8
5 Stacy 1 1999.0 8
subset = ['Name', 'Unit', 'Level']
df = df.sort_values('Year').drop_duplicates(subset)
Or:
df = df.sort_values(subset + ['Year']).drop_duplicates(subset)
print (df)
Name Unit Year Level
5 Stacy 1 1999.0 8
1 Nik 1 2000.0 12
2 John 2 2001.0 11
Another solution with GroupBy.first for return first non missing value of Year per groups:
df = df.groupby(subset, as_index=False, sort=False)['Year'].first()
print (df)
Name Unit Level Year
0 Nik 1 12 2000.0
1 John 2 11 2001.0
2 Stacy 1 8 1999.0
One solution that comes to mind is to first sort the concatenated dataframe by year with the sortvalues function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
then drop duplicates with keep='first' parameter
df.drop_duplicates(subset=['Name', 'Unit', 'Level'], keep="first")
I would suggest that you look at the creation step of your merged dataset.
When merging the data sets you can do so on multiple indices i.e.
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
With the outer join you collect all data sets and remove duplicates right away. The only thing left is to merge the Year column which you can do like so:
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
This fills the gaps and afterwards you are able to simply drop the 'Year_r' column.
The benefit here is that not only NaN values of missing years are covered but also missing Years which are represented as empty strings.
Following a small working example:
import pandas as pd
import numpy as np
left = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo', 'Peter', 'Adam'],
'Unit': ['2', '4', '6', '2', '4', '12'],
'Year': ['', '2009', '1954', '2025', '2012', '2024'],
'Level': ['L1', 'L1', 'L0', 'L4', 'L3', 'L10']})
right = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo'],
'Unit': ['2', '4', '6', '2'],
'Year': ['2010', '2009', '1954', '2025'],
'Level': ['L1', 'L1', 'L0', 'L4']})
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
df

pandas python flag transactions across rows

I have a data as below. I would like to flag transactions -
when a same employee has one of the ('Car Rental', 'Car Rental - Gas' in the column expense type) and 'Car Mileage' on the same day - so in this case employee a and c's transactions would be highlighted. Employee b's transactions won't be highlighted as they don't meet the condition - he doesn't have a 'Car Mileage'
i want the column zflag. Different numbers in that column indicate group of instances when the above condition was met
d = {'emp': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c' ],
'date': ['1', '1', '1', '1', '2', '2', '2', '3', '3', '3', '3' ],
'usd':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ],
'expense type':['Car Mileage', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Mileage', 'Car Rental', 'food', 'wine' ],
'zflag':['1', '1', '1', ' ',' ',' ',' ','2','2',' ',' ' ]
}
df = pd.DataFrame(data=d)
df
Out[253]:
date emp expense type usd zflag
0 1 a Car Mileage 1 1
1 1 a Car Rental 2 1
2 1 a Car Rental - Gas 3 1
3 1 a food 4
4 2 b Car Rental 5
5 2 b Car Rental - Gas 6
6 2 b food 7
7 3 c Car Mileage 8 2
8 3 c Car Rental 9 2
9 3 c food 10
10 3 c wine 11
I would appreciate if i could get pointers regarding functions to use. I am thinking of using groupby...but not sure
I understand that date+emp will be my primary key
Here is an approach. It's not the cleanest but what you're describing is quite specific. Some of this might be able to be simplified with a function.
temp_df = df.groupby(["emp", "date"], axis=0)["expense type"].apply(lambda x: 1 if "Car Mileage" in x.values and any([k in x.values for k in ["Car Rental", "Car Rental - Gas"]]) else 0).rename("zzflag")
temp_df = temp_df.loc[temp_df!=0,:].cumsum()
final_df = pd.merge(df, temp_df.reset_index(), how="left").fillna(0)
Steps:
Groupby emp/date and search for criteria, 1 if met, 0 if not
Remove rows with 0's and cumsum to produce unique values
Join back to the original frame
Edit:
To answer your question below. The join works because after you run .reset_index() that takes "emp" and "date" from the index and moves them to columns.

Categories