Pandas: Constructing a cross table from Pandas DataFrame - python

I got a DataFrame generated from a CSV database with a list of districts in Buenos Aires Province (Argentina). The CSV has columns like population and surface of all of these districts. Also, it contains two columns with categorical variables. The first of these one is called "REGION", and indicates if the district is located in the north or in the south of the province. The second one, is called "PERTENENCIA" (Belonging), and indicates if the district belongs to the metropolitan area of Buenos Aires City (Greater Buenos Aires, GBA), or it's in the interior of the province (outside GBA). So then, it can adopt the values "GBA" or "INTERIOR", respectively. Since the metropolitan area of Buenos Aires is located at the north of the province, every district which belongs to the GBA it's also categorized as north (we have no districts categorized in south and also in GBA)
So then, my table looks like this ("Municipio" is the district, "Poblacion" is population, and "Superficie" is surface):
MUNICIPIO REGION PERTENENCIA POBLACION SUPERFICIE
0 ALSINA SUR INTERIOR ... ...
1 ADOLFO GONZ. SUR INTERIOR ... ...
2 ALBERTI NORTE INTERIOR ... ...
3 ALT. BROWN SUR GBA ... ...
4 ARRECIFES NORTE INTERIOR ... ...
5 AVELLANEDA NORTE GBA ... ...
...
140 ZARATE NORTE INTERIOR ... ...
The issue is this one: I need to study jointly the frequency of those districts, both by region and belonging. I'm making a stacked bar chart just for that purpose, and also a nested pie chart.
For that, I'd like to generate a cross-table with the total amount of districts in those categories, something like this:
GBA INTERIOR TOTAL
NORTE 33 41 74
SUR 0 67 67
TOTAL 33 108 141
I have now something like this to calculate these values manually:
cant_mun_gba=municipios['PERTENENCIA'].value_counts()['GBA']
cant_mun_interior=municipios['PERTENENCIA'].value_counts()['INTERIOR']
cant_mun_norte=municipios['REGION'].value_counts()['NORTE']
cant_mun_sur=municipios['REGION'].value_counts()['SUR']
cant_mun_norte_interior = cant_mun_norte - cant_mun_gba
cant_mun_norte_gba = cant_mun_gba
cant_mun_sur_interior=cant_mun_sur
cant_mun_sur_gba=0
Although this works, it's pretty ugly, and also I'd like to have the cross-table, just for displaying it.
Is there a way to achieve this?
Thanks a lot!

Try pd.crosstab
pd.crosstab(municipios['REGION'], municipios['PERTENENCIA'])

Related

How to find if elements of a column in a data frame are string-contained by the elements of a column of another data frame?

I have a data frame tweets_df that looks like this:
sentiment id date text
0 0 1502071360117424136 2022-03-10 23:58:14+00:00 AngelaRaeBoon1 Same Alabama Republicans charge...
1 0 1502070916318121994 2022-03-10 23:56:28+00:00 This ’ w/the sentencing JussieSmollett But mad...
2 0 1502057466267377665 2022-03-10 23:03:01+00:00 DannyClayton Not hard find takes smallest amou...
3 0 1502053718711316512 2022-03-10 22:48:08+00:00 I make fake scenarios getting fights protectin...
4 0 1502045714486022146 2022-03-10 22:16:19+00:00 WipeHomophobia Well people lands wildest thing...
.. ... ... ... ...
94 0 1501702542899691525 2022-03-09 23:32:41+00:00 There 's reason deep look things kill bad peop...
95 0 1501700281729433606 2022-03-09 23:23:42+00:00 Shame UN United Dictators Shame NATO Repeat We...
96 0 1501699859803516934 2022-03-09 23:22:01+00:00 GayleKing The difference Ukrainian refugees IL...
97 0 1501697172441550848 2022-03-09 23:11:20+00:00 hrkbenowen And includes new United States I un...
98 0 1501696149853511687 2022-03-09 23:07:16+00:00 JLaw_OTD A world women minorities POC LGBTQ÷ d...
And the second dataFrame globe_df that looks like this:
Country Region
0 Andorra Europe
1 United Arab Emirates Middle east
2 Afghanistan Asia & Pacific
3 Antigua and Barbuda South/Latin America
4 Anguilla South/Latin America
.. ... ...
243 Guernsey Europe
244 Isle of Man Europe
245 Jersey Europe
246 Saint Barthelemy South/Latin America
247 Saint Martin South/Latin America
I want to delete all rows of the dataframe tweets_df which have 'text' that does not contain a 'Country' or 'Region'.
This was my attempt:
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
for entry in globe_df['Country']:
tweet_index = tweets_df[entry in tweets_df['text']].index # if tweets that *contain*, not equal...... entry in tweets_df['text] .... (in)or (not in)?
tweets_df.drop(tweet_index , inplace=True)
print(tweets_df)
Edit: Also, fuzzy, case-insensitive matching with stemming would be preferred when searching the 'text' for countries and regions.
Ex) If the text contained 'Ukrainian', 'british', 'engliSH', etc... then it would not be deleted
Convert country and region values to a list and use str.contains to filter out rows that do not contain these values.
#with case insensitive
vals=globe_df.stack().to_list()
tweets_df = tweets_df[tweets_df ['text'].str.contains('|'.join(vals), regex=True, case=False)]
or (with case insensitive)
vals="({})".format('|'.join(globe_df.stack().str.lower().to_list())) #make all letters lowercase
tweets_df['matched'] = tweets_df.text.str.lower().str.extract(vals, expand=False)
tweets_df = tweets_df.dropna()
# Import data
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
# Get country and region column as list
globe_df_country = globe_df['Country'].values.tolist()
globe_df_region = globe_df['Region'].values.tolist()
# merge_lists, cause you want to check with or operator
merged_list = globe_df_country + globe_df_region
# If you want to update df while iterating it, best way to do it with using copy df
df_tweets2 = tweets_df.copy()
for index,row in tweets_df.iterrows():
# Check if splitted row's text values are intersecting with merged_list
if [i for i in merged_list if i in row['text'].split()] == []:
df_tweets2 = df_tweets2.drop[index]
tweets_df_new = df_tweets2.copy()
print(tweets_df_new)
You can try using pandas.Series.str.contains to find the values.
tweets_df[tweets_df['text'].contains('{}|{}'.format(entry['Country'],entry['Region'])]
And after creating a new column with boolean values, you can remove rows with the value True.

How to do a point in polygon query efficiently using geopandas?

I have a shapefile that has all the counties for the US, and I am doing a bunch of queries at a lat/lon point and then finding what county the point lies in. Right now I am just looping through all the counties and doing pnt.within(county). This isn't very efficient. Is there a better way to do this?
Your situation looks like a typical case where spatial joins are useful. The idea of spatial joins is to merge data using geographic coordinates instead of using attributes.
Three possibilities in geopandas:
intersects
within
contains
It seems like you want within, which is possible using the following syntax:
geopandas.sjoin(points, polygons, how="inner", op='within')
Note: You need to have installed rtree to be able to perform such operations. If you need to install this dependency, use pip or conda to install it
Example
As an example, let's plot European cities. The two example datasets are
import geopandas
import matplotlib.pyplot as plt
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))
countries = world[world['continent'] == "Europe"].rename(columns={'name':'country'})
countries.head(2)
pop_est continent country iso_a3 gdp_md_est geometry
18 142257519 Europe Russia RUS 3745000.0 MULTIPOLYGON (((178.725 71.099, 180.000 71.516...
21 5320045 Europe Norway -99 364700.0 MULTIPOLYGON (((15.143 79.674, 15.523 80.016, ...
cities.head(2)
name geometry
0 Vatican City POINT (12.45339 41.90328)
1 San Marino POINT (12.44177 43.93610)
cities is a worldwide dataset and countries is an European wide dataset.
Both dataset need to be in the same projection system. If not, use .to_crs before merging.
data_merged = geopandas.sjoin(cities, countries, how="inner", op='within')
Finally, to see the result let's do a map
f, ax = plt.subplots(1, figsize=(20,10))
data_merged.plot(axes=ax)
countries.plot(axes=ax, alpha=0.25, linewidth=0.1)
plt.show()
and the underlying dataset merges together the information we need
data_merged.head(5)
name geometry index_right pop_est continent country iso_a3 gdp_md_est
0 Vatican City POINT (12.45339 41.90328) 141 62137802 Europe Italy ITA 2221000.0
1 San Marino POINT (12.44177 43.93610) 141 62137802 Europe Italy ITA 2221000.0
192 Rome POINT (12.48131 41.89790) 141 62137802 Europe Italy ITA 2221000.0
2 Vaduz POINT (9.51667 47.13372) 114 8754413 Europe Austria AUT 416600.0
184 Vienna POINT (16.36469 48.20196) 114 8754413 Europe Austria AUT 416600.0
Here, I used inner join method but that's a parameter you can change if, for instance, you want to keep all points, including those not within a polygon.

Assign a Row to Data Frame Header that Starts with a Specific String from Excel- Pandas

I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas.
# First Column Second Column Address City State Zip
1 House The Clairs 4321 Main Street Chicago IL 54872
2 Restaurant The Monks 6323 East Wing Miluakee WI 45458
and some of them are in various formats with multiple headers,
Table 1
Comp ID Info
# First Column Second Column Address City State Zip
1 Office The Fairs 1234 Main Street Seattle WA 54872
2 College The Blanks 4523 West Street Madison WI 45875
3 Ground The Brewers 895 Toronto Street Madrid IA 56487
Table2
Comp ID Info
# First Column Second Column Address City State Zip
1 College The Banks 568 Old Street Cleveland OH 52125
2 Professional The Circuits 695 New Street Boston MA 36521
This looks like this in Excel (I am pasting the image here to show how it actually looks in excel),
As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column.
For an individual file like this, I can read like below, which is just fine.
xls = pd.ExcelFile(r'mypath\myfile.xlsx')
df = pd.read_excel('xls', 'mysheet', header=[2])
However, I need a final data frame like this (Appended with files that have only one header),
First Column Second Column Address City State Zip
0 House The Clair 4321 Main Street Chicago IL 54872
1 Restaurant The Monks 6323 East Wing Milwaukee WI 45458
2 Office The Fairs 1234 Main Street Seattle WA 54872
3 College The Blanks 4523 West Street Madison WI 45875
4 Ground The Brewers 895 Toronto Street Madrid IA 56487
5 College The Banks 568 Old Street Cleveland OH 52125
6 Professional The Circuits 695 New Street Boston MA 36521
Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.
However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column to be header in the beginning of the file.
Additionally, I want to drop those rows that are in the middle of the the file that has First Column.
After I create a cleaned file headers starting with First Column, I can append each data frame and create my output file I need. How can I achieve this in pandas? Any help or suggestions would be great.

Make row operations faster in pandas

I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.
Here is the original dataset and a sample screenshot is provided below.
The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values.
Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06
Here is how I solved the problem.
First I defined a function quarter_rows() which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row
import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
def quarter_rows(row):
for i in range(0, len(row), 3):
row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
return row
Now I do some subsetting and cleanup of the data to leave only what I need to work with
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]
I then used apply to apply the function to all rows.
housing3 = housing3.apply(quarter_rows, axis=1)
I get the expected result. A sample is shown below
But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.
I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows() function.
I've tried python lambdas but every way I tried threw an exception.
I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.
Thanks
I think you can use instead apply use resample by quarters and aggregate mean, but first convert column names to month periods by to_period:
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
Testing:
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
#for testing slect only 10 first rows and columns from jan 2000 to jun 2000
housing3 = housing.set_index(["State","RegionName"]).ix[:10, '2000-01' : '2000-06']
print (housing3)
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
State RegionName
NY New York NaN NaN NaN NaN NaN NaN
CA Los Angeles 204400.0 207000.0 209800.0 212300.0 214500.0 216600.0
IL Chicago 136800.0 138300.0 140100.0 141900.0 143700.0 145300.0
PA Philadelphia 52700.0 53100.0 53200.0 53400.0 53700.0 53800.0
AZ Phoenix 111000.0 111700.0 112800.0 113700.0 114300.0 115100.0
NV Las Vegas 131700.0 132600.0 133500.0 134100.0 134400.0 134600.0
CA San Diego 219200.0 222900.0 226600.0 230200.0 234400.0 238500.0
TX Dallas 85100.0 84500.0 83800.0 83600.0 83800.0 84200.0
CA San Jose 364100.0 374000.0 384700.0 395700.0 407100.0 416900.0
FL Jacksonville 88000.0 88800.0 89000.0 88900.0 89600.0 90600.0
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
print (housing3)
2000Q1 2000Q2
State RegionName
NY New York NaN NaN
CA Los Angeles 207066.666667 214466.666667
IL Chicago 138400.000000 143633.333333
PA Philadelphia 53000.000000 53633.333333
AZ Phoenix 111833.333333 114366.666667
NV Las Vegas 132600.000000 134366.666667
CA San Diego 222900.000000 234366.666667
TX Dallas 84466.666667 83866.666667
CA San Jose 374266.666667 406566.666667
FL Jacksonville 88600.000000 89700.000000

Group by and find top n value_counts pandas

I have a dataframe of taxi data with two columns that looks like this:
Neighborhood Borough Time
Midtown Manhattan X
Melrose Bronx Y
Grant City Staten Island Z
Midtown Manhattan A
Lincoln Square Manhattan B
Basically, each row represents a taxi pickup in that neighborhood in that borough. Now, I want to find the top 5 neighborhoods in each borough with the most number of pickups. I tried this:
df['Neighborhood'].groupby(df['Borough']).value_counts()
Which gives me something like this:
borough
Bronx High Bridge 3424
Mott Haven 2515
Concourse Village 1443
Port Morris 1153
Melrose 492
North Riverdale 463
Eastchester 434
Concourse 395
Fordham 252
Wakefield 214
Kingsbridge 212
Mount Hope 200
Parkchester 191
......
Staten Island Castleton Corners 4
Dongan Hills 4
Eltingville 4
Graniteville 4
Great Kills 4
Castleton 3
Woodrow 1
How do I filter it so that I get only the top 5 from each? I know there are a few questions with a similar title but they weren't helpful to my case.
I think you can use nlargest - you can change 1 to 5:
s = df['Neighborhood'].groupby(df['Borough']).value_counts()
print s
Borough
Bronx Melrose 7
Manhattan Midtown 12
Lincoln Square 2
Staten Island Grant City 11
dtype: int64
print s.groupby(level=[0,1]).nlargest(1)
Bronx Bronx Melrose 7
Manhattan Manhattan Midtown 12
Staten Island Staten Island Grant City 11
dtype: int64
additional columns were getting created, specified level info
You can do this in a single line by slightly extending your original groupby with 'nlargest':
>>> df.groupby(['Borough', 'Neighborhood']).Neighborhood.value_counts().nlargest(5)
Borough Neighborhood Neighborhood
Bronx Melrose Melrose 1
Manhattan Midtown Midtown 1
Manhatten Lincoln Square Lincoln Square 1
Midtown Midtown 1
Staten Island Grant City Grant City 1
dtype: int64
Solution: for get topn from every group
df.groupby(['Borough']).Neighborhood.value_counts().groupby(level=0, group_keys=False).head(5)
.value_counts().nlargest(5) in other answers only give you one group top 5, doesn't make sence for me too.
group_keys=False to avoid duplicated index
because value_counts() has already sorted, just need head(5)
df['Neighborhood'].groupby(df['Borough']).value_counts().head(5)
head() gets the top 5 rows in a data frame.
Try this one (just change the number in head() to your choice):
# top 3 : total counts of 'Neighborhood' in each Borough
Z = df.groupby('Borough')['Neighborhood'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z
You can also try below code to get only top 10 values of value counts
'country_code' and 'raised_amount_usd' is column names.
groupby_country_code=master_frame.groupby('country_code')
arr=groupby_country_code['raised_amount_usd'].sum().sort_index()[0:10]
print(arr)
[0:10] shows index 0 to 10 from array for slicing. you can choose your slicing option.

Categories