PANDAS dataframe python: wanting to sort values by group - python

Link to census data
I have the following link above for a CSV file containing the raw data for which I wish to manipulate.
census_df = df = pd.read_csv('https://raw.githubusercontent.com/Qian-Han/coursera-Applied-Data-Science-with-Python/master/Introduction-to-Data-Science-in-Python/original_data/census.csv')
sortedit = census_df.sort_values(by = ['STNAME','CENSUS2010POP'],ascending=False)
I am trying to order the data in descending order by the column 'CENSUS2010POP'.
I also want to order the data by 'state' alphabetically, hence why I have including the 'STNAME' column in the formula above.
However, I only want to select the 3 highest values for 'CENSUS2010POP' from each state ('STNAME').
Thus, if there are 146 states in total, I should (146 x 3) rows in my new dataframe (and thus in the 'CENSUS2010POP' column).
I would be so grateful if anybody could give me a helping hand?

IIUC, groupby with .nalrgest to create an index filter, chained with sort_values
df2 = df.iloc[df.groupby('STNAME')['CENSUS2010POP']\
.nlargest(3).index.get_level_values(1)]\
.sort_values(['STNAME','CENSUS2010POP'],ascending=True)
print(df['STNAME'].nunique())
51
print(df2.shape)
(152, 100)
print(df2[['STNAME','CENSUS2010POP']])
STNAME CENSUS2010POP
49 Alabama 412992
37 Alabama 658466
0 Alabama 4779736
76 Alaska 97581
71 Alaska 291826
... ... ...
3137 Wisconsin 947735
3096 Wisconsin 5686986
3182 Wyoming 75450
3180 Wyoming 91738
3169 Wyoming 563626
[152 rows x 2 columns]

try this:
df = census_df.groupby(["STNAME"]).apply(lambda x: x.sort_values(["CENSUS2010POP"], ascending = False)).reset_index(drop=True)
df.groupby('STNAME').head(3)[['STNAME','CENSUS2010POP']]
The first statement returns dataframe sorted by CENSUS2010POP in each STNAME.
The second statement returns the top 3.

Related

How to transpose pandas dataframe from multiple rows with same name to single column?

I have tried to convert multiple rows with same name into single column but did not get exactly. I have done code for single and its mentioned in the following.
import pandas as pd
url = 'https://raw.githubusercontent.com/gambler2020/Data_Analysis/master/population/population.csv'
df=pd.read_csv(url)
df
Image of dataframe
The above code used to read and see the data frame from github.But the problem is that there are various countries and each country has hundreds of rows and i want to convert those each country having various rows into single column. The following code is for converting single countries with various rows into one column.
dff=df["Year"].head(259)
dff
a1=df[df["Entity"]=="Paraguay"]
a1=a1.rename(columns={"Population (historical estimates)": "Paraguay"})
a1=a1["Paraguay"]
a1=a1.reset_index(drop=True)
a1
dff=pd.concat([dff, a1], axis=1)
dff
image of output result
This image is the sample of desired dataframe but it will take time to do for each countries because there are hundreds of countries. How do i write code so that i could convert raw dataframe image into desired output image for all the countries.
pandas.DataFrame.pivot
result = df.pivot(index='Year', columns='Entity', values='Population (historical estimates)').reset_index()
result.columns.name = None
result
###
Year Afghanistan Africa Albania Algeria American Samoa …
0 -10000 14737.0 2.276110e+05 1199.0 12090.0 NaN …
1 -9000 20405.0 3.230350e+05 1999.0 20150.0 NaN …
2 -8000 28253.0 4.629670e+05 3332.0 33583.0 NaN …
3 -7000 39120.0 6.700190e+05 5554.0 55973.0 NaN …
4 -6000 54166.0 9.791040e+05 9256.0 93289.0 NaN …
.. ... ... ... ... ... ...
254 2017 36296111.0 1.244222e+09 2884169.0 41389174.0 55617.0 …
255 2018 37171922.0 1.275921e+09 2882735.0 42228415.0 55461.0 …
256 2019 38041757.0 1.308064e+09 2880913.0 43053054.0 55312.0 …
[259 rows x 245 columns]

How to iterate over columns and check condition by group

I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.

How to sort a MultiIndex Pandas Pivot Table based on a certain column

I am new to Python and am trying to play around with the Pandas Pivot Tables. I have searched and searched but none of the answers have been what I am looking for. Basically, I am trying to sort the below pandas pivot table
import numpy as np
import pandas as pd
df = pd.DataFrame({
"TIME":["FQ1","FQ2","FQ2","FQ2"],
"NAME":["Robert",'Miranda',"Robert","Robert"],
"TOTAL":[900,42,360,2000],
"TYPE":["Air","Ground","Air","Ground"],
"GROUP":["A","A","A","A"]})
pt = pd.pivot_table(data=df,
values =["TOTAL"], aggfunc = (np.sum),
index = ["GROUP","TYPE","NAME"],
columns = "TIME",
fill_value=0,
margins = True)
Basically I am hoping to sort the "Type" and the "Name" column based on the sum of each row.
The end goal in this case would be "Ground" type appearing first before "Air", and within the "Ground" type, I'm hoping to have Robert appear before Miranda, since his sum is higher.
Here is how it appears now:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Miranda 0 42 42
Robert 0 2000 2000
All 900 2402 3302
Thanks to anyone who is able to help!!
Try this, because your column header is multiindex, you need to use a tuple to access the colums:
pt.sort_values(['GROUP','TYPE',('TOTAL','All')],
ascending=[True, True, False])
Output:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Robert 0 2000 2000
Miranda 0 42 42
All 900 2402 3302

Pandas merge result missing rows when joining on strings

I have a data set that I've been cleaning and to clean it I needed to put it into a pivot table to summarize some of the data. I'm now putting it back into a dataframe so that I can merge it with some other dataframes. df1 looks something like this:
Count Region Period ACV PRJ
167 REMAINING US WEST 3/3/2018 5 57
168 REMAINING US WEST 3/31/2018 10 83
169 SAN FRANCISCO 1/13/2018 99 76
170 SAN FRANCISCO 1/20/2018 34 21
df2 looks something like this:
Count MKTcode Region
11 RSMR0 REMAINING US SOUTH
12 RWMR0 REMAINING US WEST
13 SFR00 SAN FRANCISCO
I've tried merging them with this code:
df3 = pd.merge(df1, df2, on='Region', how='inner')
but for some reason pandas is not interpreting the Region columns as the same data and the merge is turning up NaN data in the MKTcode column and it seems to be appending df2 to df1, like this:
Count Region Period ACV PRJ MKTcode
193 WASHINGTON, D.C. 3/3/2018 36 38 NaN
194 WASHINGTON, D.C. 3/31/2018 12 3 NaN
195 ATLANTA NaN NaN NaN ATMR0
196 BOSTON NaN NaN NaN B2MRN
I've tried inner and outer joins, but the real problem seems to be that pandas is interpreting the Region column of each dataframe as different elements.
The MKTcode column and Region column in df2 has only 12 observations and each observation occurs only once, whereas df1 has several repeating instances in the Region column (multiples of the same city). Is there a way where I can just create a list of the 12 MKTcodes that I need and perform a merge where it matches with each region that I designate? Like a one to many match?
Thanks.
When a merge isn't working as expected, the first thing to do is look at the offending columns.
The biggest culprit in most cases is trailing/leading whitespaces. These are usually introduced when DataFrames are incorrectly read from files.
Try getting rid of extra whitespace characters by stripping them out. Assuming you need to join on the "Region" column, use
for df in (df1, df2):
# Strip the column(s) you're planning to join with
df['Region'] = df['Region'].str.strip()
Now, merging should work as expected,
pd.merge(df1, df2, on='Region', how='inner')
Count_x Region Period ACV PRJ Count_y MKTcode
0 167 REMAINING US WEST 3/3/2018 5 57 12 RWMR0
1 168 REMAINING US WEST 3/31/2018 10 83 12 RWMR0
2 169 SAN FRANCISCO 1/13/2018 99 76 13 SFR00
3 170 SAN FRANCISCO 1/20/2018 34 21 13 SFR00
Another possibility if you're still getting NaNs, could be because of a difference in whitespace characters between words. For example, 'REMAINING US WEST' will not compare as equal with 'REMAINING US WEST'.
This time, the fix is to use str.replace:
for df in (df1, df2):
df['Region'] = df['Region'].str.replace(r'\s+', ' ')

Manipulations with Lat-Lon and Time Series Pandas

I am trying to do some file merging with Latitude and Longitude.
Input File1.csv
Name,Lat,Lon,timeseries(n)
London,80.5234,121.0452,523
London,80.5234,121.0452,732
London,80.5234,121.0452,848
Paris,90.4414,130.0252,464
Paris,90.4414,130.0252,829
Paris,90.4414,130.0252,98
New York,110.5324,90.0023,572
New York,110.5324,90.0023,689
New York,110.5324,90.0023,794
File2.csv
Name,lat,lon,timeseries1
London,80.5234,121.0452,500
Paris,90.4414,130.0252,400
New York,110.5324,90.0023,700
Now Expected output is
File2.csv
Name,lat,lon,timeseries1,timeseries(n) #timeseries is 24 hrs format 17:45:00
London,80.5234,121.0452,500,2103 #Addition of all three values
Paris,90.4414,130.0252,400,1391
New York,110.5324,90.0023,700,2055
With python, numpy and dictionaries it would be straight as key = sum of values but I want to use Pandas
Please suggest me how to start with or may be a point me to some example. I have not see anything like Dictionary types with Pandas with Latitude and Longitude.
Perform a groupby aggregation on the first df, call sum and then merge this with the other df:
In [12]:
gp = df.groupby('Name')['timeseries(n)'].sum().reset_index()
df1.merge(gp, on='Name')
Out[14]:
Name Lat Lon timeseries1 timeseries(n)
0 London 80.5234 121.0452 500 2103
1 Paris 90.4414 130.0252 400 1391
2 New York 110.5324 90.0023 700 2055
the aggregation looks like this:
In [15]:
gp
Out[15]:
Name timeseries(n)
0 London 2103
1 New York 2055
2 Paris 1391
Your csv files can loaded using read_csv so something like:
df = pd.read_csv('File1.csv')
df1 = pd.read_csv('File2.csv')

Categories