I'm having two dataframes with each of them looking like
date country value
20100101 country1 1
20100102 country1 2
20100103 country1 3
date country value
20100101 country2 4
20100102 country2 5
20100103 country2 6
I want to merge them into one dataframe looking like
date country1 country2
20100101 1 4
20100102 2 5
20100103 3 6
Is there any clever way to do this in pandas?
This looks like pivot table, which in Pandas is called unstack for some bizarre reason.
Example analogous to the one used in Wes McKinley's "python for data analysis" book:
bytz = df.groupby(['tz', opersystem])
counts = bytz.size().unstack().fillna(0)
(groupby operating system in rows which is then pivoted so that operating system becomes columns, just like your "country*" values).
P.S. for catting dataframes you can use pandas.concat. It's also often good to do .reset_index on resulting dataframe, bc in some (many?) cases duplicate values in index can make pandas go haywire, throwing strange exceptions on .apply used on dataframe and the like.
Related
I have a data frame with 101 columns currently. The first column is called "Country/Region" and the other 100 are dates in MM/DD/YY format from 1/22/20 to 4/30/20 like the example below. I would like to combine repeat country entries such as 'Australia' below and have its values in the date columns to be added together so that there is one row per country. I would like to keep ALL date columns.I have tried to use the groupby() and agg() functions but I do not know how to sum() together that many columns without calling every single one. Is there a way to do this without calling all 100 columns individually?
Country/Region | 1/22/20 | 1/23/20 | ... | 4/29/20 | 4/30/20
Afghanistan 0 0 ... 1092 1176
Australia 0 0 10526 12065
Australia 0 0 ... 56289 4523
This should work:
df.pivot_table(index='Country/Region', aggfunc='sum')
Did you already try this? It should also give the expected result.
df.groupby('Country/Region').sum()
You can do this:
df.iloc[:,1:].sum(axis=1)
I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).
Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df
I have a pandas dataframe as follows:
You will note here that there are many rows with the same code_module,code_presentation,id_student combination
What I want to do is merge all of these duplicate rows, and in so sum the sum_clicks with each group
An example of this is for the top rows they would be merged into one row looking as follows:
code_module code_presentation id_student sum_click
0 AAA 2013J 28400 18
In SQL terms, the private key should be a code_module,code_presentation,id_student combination
In my progress on this, I tried to use groupby in the following way:
groupby(['id_student','code_presentation','code_module']).aggregate({'sum_click': 'sum',})
But this didn't work as it gave student ids that aren't even in my dataset, which I don't understand why
Also, groupby doesn't seem to be quite what I'm looking for as it has a datastructure different to a standard pandas dataframe, which is what I would be looking for.
The problem can be seen in the following output
sum_click
id_student code_presentation code_module
6516 2014J AAA 2791
8462 2013J DDD 646
2014J DDD 10
11391 2013J AAA 934
Row 1 and 2 (indexing from 0) should be distinct rows, instead of the group as they are
Try this -
df.groupby(['code_module', 'code_presentation', 'id_student']).agg(sum_clicks=('sum_click', 'sum')).reset_index()
DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.
It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.
This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.
pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.
I would like to sort the following dataframe:
Region LSE North South
0 Cn 33.330367 9.178917
1 Develd -36.157025 -27.669988
2 Wetnds -38.480206 -46.089908
3 Oands -47.986764 -32.324991
4 Otherg 323.209834 28.486310
5 Soys 34.936147 4.072872
6 Wht 0.983977 -14.972555
I would like to sort it so the LSE column is reordered based on the list:
lst = ['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht']
of, course the other columns will need to be reordered accordingly as well. Is there any way to do this in pandas?
The improved support for Categoricals in pandas version 0.15 allows you to do this easily:
df['LSE_cat'] = pd.Categorical(
df['LSE'],
categories=['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht'],
ordered=True
)
df.sort('LSE_cat')
Out[5]:
Region LSE North South LSE_cat
3 3 Oands -47.986764 -32.324991 Oands
2 2 Wetnds -38.480206 -46.089908 Wetnds
1 1 Develd -36.157025 -27.669988 Develd
0 0 Cn 33.330367 9.178917 Cn
5 5 Soys 34.936147 4.072872 Soys
4 4 Otherg 323.209834 28.486310 Otherg
6 6 Wht 0.983977 -14.972555 Wht
If this is only a temporary ordering then keeping the LSE column as
a Categorical may not be what you want, but if this ordering is
something that you want to be able to make use of a few times
in different contexts, Categoricals are a great solution.
In later versions of pandas, sort, has been replaced with sort_values, so you would need instead:
df.sort_values('LSE_cat')