Pandas import two csv files and plot specific data - python

link 1
link 2 *I copied table and created csv file
I need to plot total population from file 1 and Adherents total for New Jersey as a line or bar graph to compare.
I've tried append to combine both cvs's but comes out weird
import pandas as pd
import matplotlib.pyplot as plt
clifton_data = pd.read_csv('cliftondata2010census.csv')
religion = pd.read_csv('2010_ Top Five States by Adherence Rate - Sheet1.csv')
all_data = clifton_data.append(religion)
all_data.plot()
all_data.plot(kind='line',x='1',y='2') # scatter plot
all_data.plot(kind='density')
I need to plot total population from file 1 and compare to Adherents total for New Jersey as a line or bar graph.

Here is a quick guide to get you stared. I hope it helps.
From link 2, you see
Massachusetts 641 2,940,199 449.05
Rhode Island 159 466,598 443.30
New Jersey 729 3,235,290 367.99
Connecticut 399 1,252,936 350.56
New York 1,630 6,286,916 324.43
copy the texts above, paste and save the data to congregation.txt.
Link 1 is broken. However, assuming the population data are as follows,
Massachusetts 3,141,270
Rhode Island 530,698
New Jersey 4,335,399
Connecticut 2,134,935
New York 10,366,556
similarly, copy the texts above, paste and save the data to population.txt.
Then, you can run something like this
import pandas as pd
import matplotlib.pyplot as plt
con = pd.read_csv('congregation.txt', sep=r'[ \t]{2,}',header=None, index_col=False,engine='python')
pop = pd.read_csv('population.txt', sep=r'[ \t]{2,}',header=None, index_col=False,engine='python')
#note concat and not append
#con[0] is state, con[2] is congregation, pop[1] is population
#print(con.head()) and print(pop.head()) to visualize if you are still confused
df = pd.concat([con[[0,2]],pop[1]],axis=1)
df.columns = ['State', 'Congregation', 'Population']
#need to do some cleaning here to convert numbers with comma to an integer
df['Congregation'] = df['Congregation'].apply(lambda t: t.replace(',','')).astype(int)
df['Population'] = df['Population'].apply(lambda t: t.replace(',','')).astype(int)
df.set_index('State',inplace=True)
print(df.head())
#at this stage your df looks like this
# Congregation Population
#State
#Massachusetts 2940199 3141270
#Rhode Island 466598 530698
#New Jersey 3235290 4335399
#Connecticut 1252936 2134935
#New York 6286916 10366556
Output
Note: I am retaining other states here for the sake of demonstration, otherwise if it is only New Jersey, the bar plot will look empty.
ax = df.plot.bar()
plt.show()
Edit: I meant 'Adherent' not 'Congregation'. I made a mistake there.

Related

Looking for Simple Python Help: Counting the Number of Vehicles in a CSV by their Fuel Type

MY DATA IN EXCEL
MY CODE
Hello Everyone!
I am brand new to python and have some simple data I want to separate and graph in a bar chart.
I have a data set on the cars currently being driven in California. They are separated by Year, Fuel type, Zip Code, Make, and 'Light/Heavy'.
I want to tell python to count the number of Gasoline cars, the number of diesel cars, the number of battery electric cars, etc.
How could i separate this data, and then graph it on a bar chart? I am assuming it is quite easy, but I have been learning python myself for maybe a week.
I attached the data set, as well as some code that I have so far. It is returning 'TRUE' when I tried to make subseries of the data as 'gas', 'diesel', etc. I am assuming python is just telling me "yes, it says gasoline there". I now just hope to gather all the "Gasoline"s in the 'Fuel' column, and add them all up by the number in the 'Vehicle' column.
Any help would be very much appreciated!!!
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('~/Desktop/PYTHON/californiavehicles.csv')
print(df.head())
print(df.describe())
X = df['Fuel']
y = df['Vehicles']
gas = df[(df['Fuel']=='Gasoline','Flex-Fuel')]
diesel = df[(df['Fuel']=='Diesel and Diesel Hybrid')]
hybrid = df[(df['Fuel']=='Hybrid Gasoline', 'Plug-in Hybrid')]
electric = df[(df['Fuel']=='Battery Electric')]
I tried to create a subseries of the data. I haven't tried to include the numbers in 'vehicles' yet because I don't know how.
This will let you use the built-in conveniences of pandas. Short answer is, use this line:
df.groupby("Fuel").sum().plot.bar()
Long answer with home made data:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
N = 1000
placeholder = [pd.NA]*N
types = np.random.choice(["Gasoline", "Diesel", "Hybrid", "Battery"], size=N)
nr_vehicles = np.random.randint(low=1, high=100, size=N)
df = pd.DataFrame(
{
"Date": placeholder,
"Zip": placeholder,
"Model year": placeholder,
"Fuel": types,
"Make": placeholder,
"Duty": placeholder,
"Vehicles": nr_vehicles
}
)
df.groupby("Fuel").sum().plot.bar()
You mentioned it's a CSV specifically. Read in the file line by line, split the data by comma (which produces a list for the current row), then if currentrow[3] == fuel type increment your count.
Example:
gas_cars=0
with open("data.csv", "r") as file:
for line in file:
row = line.split(",")
if row[3] == "Gasoline":
gas_cars += int(row[6]) # num cars for that car make
# ...
# ...
# ...

How do I filter out elements in a column of a data frame based upon if it is in a list?

I'm trying to filter out bogus locations from a column in a data frame. The column is filled with locations taken from tweets. Some of the locations aren't real. I am trying to separate them from the valid locations. Below is the code I have. However, the output is not producing the right thing, it instead will only return France. I'm hoping someone can identify what I'm doing wrong here or another way to try. Let me know if I didn't explain it well enough. Also, I assign variables both outside and inside the function for testing purposes.
import pandas as pd
cn_csv = pd.read_csv("~/Downloads/cntry_list.csv") #this is just a list of every country along with respective alpha 2 and alpha 3 codes, see the link below to download csv
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv") #this is a dataframe with multiple columns, one being "source location" See edit below that displays data in "Source Location" column
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
def country_name_check(input_country_list):
cn_csv = pd.read_csv("~/Downloads/cntrylst.csv")
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv")
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
valid_names = []
tobe_checked = []
for i in new_list:
if i in country_names.values:
valid_names.append(i)
else:
tobe_checked.append(i)
return valid_names, tobe_checked
print(country_name_check(src_locs))
EDIT 1: Adding the link for the cntry_list.csv file. I downloaded the csv of the table data. https://worldpopulationreview.com/country-rankings/country-codes
Since I am unable to share a file on here, here is the "Source Location" column data:
Source Location
She/her
South Carolina, USA
Torino
England, UK
trying to get by
Bemidiji, MN
St. Paul, MN
Stockport, England
Liverpool, England
EH7
DLR - LAX - PDX - SEA - GEG
Barcelona
Curitiba
kent
Paris, France
Moon
Denver, CO
France
If your goal is to find and list country names, both valid and not, you may filter the initial results DataFrame:
# make list from unique values of Source Location that match values from country_names
valid_names = list(results[results['Source Location']
.isin(country_names)]['Source Location']
.unique())
# with ~ select unique values that don't match country_names values
tobe_checked = list(results[~results['Source Location']
.isin(country_names)]['Source Location']
.unique())
Your unwanted result with only France being returned could be solved by trying that simpler approach. However, the problem in your code may be there when reading cntrylst outside of the function, as indicated by ScottC

GeoPandas plot shapefile by ignoring some administrative areas

Shapefile Data: The entire world (with 5 administrative areas) from https://gadm.org/data.html
import geopandas as gpd
World = gpd.read_file("~/gadm36.shp")
World=World[['NAME_0','NAME_1','NAME_2','geometry']] #Keep only 3 columns
World.head()
In this GeoDataFrame, I have 60 columns (NAME_0: for country name, NAME_1 for the region, ...)
For now, I am interested in studying the number of users of my website in Germany
Germany=World[World['NAME_0'].isin(['Germany']) == True]
Now here my website users data by region (NAME_1), I renamed the first column to be the same in shapefile
GER = pd.read_csv("~/GER.CSV",sep=";")
GER
Now I merge my data to GeoDataFrame on NAME_1 to plot users in regions
merged_ger = Germany.merge(GER, on = 'NAME_1', how='left')
merged_ger['Users'] = merged_ger['Users'].fillna(0)
The problem here is that NAME_1 is repeated according to NAME_2. Thus, the total number of users in the merged data greatly exceeds the original number
print(merged_ger['Users'].sum())
print(GER['Users'].sum())
7172411.0
74529
So plot data using this code
import matplotlib.pyplot as plt
merged_ger.plot(column='Users')
is obviously wrong
How can I merge the data in this case without duplication and without affecting the final plot?
Or, how do I ignore the rest of the administrative areas in a shapefile?
Wouldn't mapping a dictionary of user's region help?
GER_users = dict(zip(GER.NAME_1, GER.Users))
Germany['Users'] = Germany['NAME_1'].map(GER_users)

concat pdf tables into one excel table using python

I'm using tabula in order to concat all tables in the following pdf file
To be a one table within excel format.
Here's my code:
from tabula import read_pdf
import pandas as pd
allin = []
for page in range(1, 115):
table = read_pdf("goal.pdf", pages=page,
pandas_options={'header': None})[0]
allin.append(table)
new = pd.concat(allin)
new.to_excel("out.xlsx", index=False)
Also i tried the following as well:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Current output: check
But the issue which am facing that from page# 91 i start to see the data not formatted correctly within the excel file.
I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]
print(table)
Example:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Here I've ran the code for two pages 90 and 91.
starting from row# 48 you will see the difference here
Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well
I digged in source code and it has option columns and you can manually define column boundaries. When you set columns then you have to use guess=False.
tabula-py uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.
from tabula import read_pdf
import pandas as pd
# display all columns in dataframe
pd.set_option('display.width', None)
columns = [210, 350, 420, 450] # boundaries in points
#columns = ['210,350,420,450'] # boundaries in points
pages = '90-92'
#pages = [90,91,92]
#pages = list(range(90,93))
#pages = 'all' # read all pages
tables = read_pdf("goal.pdf",
pages=pages,
pandas_options={'header': None},
columns=columns,
guess=False)
df = pd.concat(tables).reset_index(drop=True)
#df.rename(columns=df.iloc[0], inplace=True) # convert first row to headers
#df.drop(df.index[0], inplace=True) # remove first row with headers
# display
#for x in range(0, len(df), 20):
# print(df.iloc[x:x+20])
# print('----------')
print(df.iloc[45:50])
#df.to_csv('output-pdf.csv')
#print(df[ df['State'].str.contains(' ') ])
#print(df[ df.iloc[:,3].str.contains(' ') ])
Result:
0 1 2 3 4
45 JARRARD, GARY 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
46 JARRARD, GARY 2219 COLORADO BLVD DENTON TX (940) 380-1661
47 MASON HARRISON, RATLIFF ENTERPRISES 1815 W. UNIVERSITY DRIVE DENTON TX (940) 387-5431
48 MASON HARRISON, RATLIFF ENTERPRISES 109 N. LOOP #288 DENTON TX (940) 484-2904
49 MASON HARRISON, RATLIFF ENTERPRISES 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
EDIT:
It may need also option area (also in points) to skip headers. Or you will have to remove first row on first page.
I didn't check all rows but it may need some changes in column boundaries.
EDIT:
Few rows make problem - probably because text in City is too long.
col3 = df.iloc[:,3]
print(df[ col3.str.contains(' ') ])
Result:
0 1 2 3 4
1941 UMSTATTD RESTAURANTS, LLC 120 WEST US HIGHWAY 54 EL DORADO SPRING MS O (417) 876-5755
2079 SIMONS, GARY 1412 BURLINGTON NORTH KANSAS CIT MY O (816) 421-5941
2763 GRISHAM, ROBERT (RB) 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830
2764 STAUFFER, JACOB 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830

How to group data in a DataFrame and also show the number of row in that group?

first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.

Categories