data aggregation with sum pandas/ python

data aggregation with sum pandas/ python - python

import pandas as pd
dane= pd.read_csv('WHO-COVID-19-global-data _2.csv')
dane
dane.groupby('Country')[['Cumulative_cases']].sum()
KeyError: 'Country'
I don't know why this code doesn't run?

There are spaces at the beginning of dane columns
Remove them with the following line:
dane.rename(columns=lambda x: x.strip(), inplace=True)
dane.groupby('Country')[['Cumulative_cases']].sum()
Cumulative_cases
Country
Afghanistan 5702767
Albania 1300156
Algeria 5561691
American Samoa 0
Andorra 273756
... ...
Wallis and Futuna 14
Yemen 256353
Zambia 1323403
Zimbabwe 692447
occupied Palestinian territory, including east ... 4057017

Related

How to use Python faker for dependent columns

Scenario
If column1 = ‘Value’ then column2 = ‘AAA’
How can we use faker to generate mock data for these dependent columns. Need to consider both positive and negative data.

You can use Faker database like this:
import pandas as pd
from faker.providers import date_time
df = (pd.DataFrame(date_time.Provider.countries, columns=['name', 'alpha-2-code'])
.rename(columns={'name': 'country', 'alpha-2-code': 'country_code'})
.sample(n=1000, replace=True, ignore_index=True, random_state=2022))
Output:
>>> df
country country_code
0 Rwanda RW
1 Grenada GD
2 Oman OM
3 Moldova MD
4 Saint Vincent and the Grenadines VC
.. ... ...
995 Iceland IS
996 Seychelles SC
997 Israel IL
998 Equatorial Guinea GQ
999 Republic of Ireland IE
[1000 rows x 2 columns]
Or use pycountry.

Remove any apostrophes from string - Python Pandas

Could someone help!py I am only trying to remove any apostrophes from string text in my data frame, I am not sure what am missing.
I have regular express and replace and renaming but can't seem to get rid of it.
country designation points price \
0 US Martha's Vineyard 96.0 235.0
1 Spain Carodorum Selección Especial Reserva 96.0 110.0
2 US Special Selected Late Harvest 96.0 90.0
3 US Reserve 96.0 65.0
4 France La Brûlade 95.0 66.0
province region_1 region_2 variety \
0 California Napa Valley Napa Cabernet Sauvignon
1 Northern Spain Toro NaN Tinta de Toro
2 California Knights Valley Sonoma Sauvignon Blanc
3 Oregon Willamette Valley Willamette Valley Pinot Noir
4 Provence Bandol NaN Provence red blend
winery last_year_points
0 Heitz 94
1 Bodega Carmen Rodríguez 92
2 Macauley
df.columns=df.columns.str.replace("''","")
df.Designation=df.Designation.str.replace("''","")
import re
re.sub("\'+",'',df.Designation)
df.rename(Destination={'Martha's Vineyard:'Mathas'}, inplace=True)
Error Message:SyntaxError: invalid syntax

See the code snippet below to solve your problem using a combination of lambda inline functions and the replace function for a string object.
df = pd.DataFrame({'Name': ["Tom's", "Jerry's", "Harry"]})
print(df, '\n')
Tom's Jerry's Harry
# Remove any apostrophes using lambda and replace function
df = df['Name'].apply(lambda x: str(x).replace("'", ""))
print(df, '\n')
Toms Jerrys Harry

Python - fuzzy string matching - TypeError: expected string or bytes-like object

I am trying to fuzzy merge two dataframes in Python using the code below:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
#all_data_st = pd.merge(prospectus, filings, on='NamePeriod')
filings['key']=filings.NamePeriod.apply(lambda x : [process.extract(x, prospectus.NamePeriod, limit=1)][0][0][0])
all_data_st = filings.merge(prospectus,left_on='key',right_on='NamePeriod')
all_data_st.to_excel('merged_file_fuzzy.xlsx')
The idea is to fuzzy merge based on two columns of each dataframe, Name and Year. I tried to combine these two in one field (NamePeriod) and then merge on that, but I am getting the following error:
TypeError: expected string or bytes-like object
Any idea how to perform this fuzzy merge? Here is how these columns look in the dataframes:
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
Name ... NamePeriod
0 NaN ... NaN
1 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2019
2 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2018
3 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2017
4 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2016
... ... ...
15922 Huitao Technology Co., Ltd. ... NaN
15923 Leaping Group Co., Ltd. ... NaN
15924 PUYI, INC. ... NaN
15925 Puhui Wealth Investment Management Co., Ltd. ... NaN
15926 Tidal Royalty Corp. ... NaN
[15927 rows x 3 columns]
prospectus_issuer_name fyear NamePeriod
0 ALCAN ALUM LTD 1990 ALCAN ALUM LTD 1990
1 ALCAN ALUM LTD 1991 ALCAN ALUM LTD 1991
2 ALCAN ALUM LTD 1992 ALCAN ALUM LTD 1992
3 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
4 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
... ... ...
1798 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1799 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1800 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1801 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1802 BOC AVIATION PTE LTD GLOBAL 2016 BOC AVIATION PTE LTD GLOBAL 2016
[1803 rows x 3 columns]
Here is the full code I try to run:
import pandas as pd
from rapidfuzz import process, utils
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
filings.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
prospectus.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([filings, prospectus], ignore_index=True)
from rapidfuzz import fuzz, utils
df3.dropna(subset = ["name"], inplace=True)
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
df3.drop(i2, inplace=True)
df3.reset_index(inplace=True)
gives me an error IndexError: list index out of range

To summarize the problem:
there are two DataFrames, that both have a key for the name and the year
you would like to merge the two DataFrames and remove all duplicate elements, with duplicate elements being elements, that have the same year and a very similar name
I am working with the following two example DataFrames:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['NAM PROPERTY INC.', 'NAM PROPERTY INC.', 'ALCAN ALUM LTD'],
'Period': [2019, 2019, 2018]})
df2 = pd.DataFrame({
'prospectus_issuer_name': ['NAM TAI PROPERTY INC.', 'ALCAN ALUM LTD', 'AMOCO CDA PETE CO'],
'fyear': [2019, 2019, 1992]})
My approach towards this problem would be to start by concating the two data frames
df1.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
df2.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([df1, df2], ignore_index=True)
Afterwards it is possible to iterate over this new DataFrame an remove all duplicate rows. I am using RapidFuzz here, since it is faster than FuzzyWuzzy (I am the author). The following code is creating a list of preprocessed names ahead of time, since the entries might be used multiple times and the preprocessing is taking a big time of the runtime. Afterwards it is iterating over the rows and always compares it with all rows, that have a higher index (rows with a lower index are already compared, since ratio(a,b) == ratio(b,a)) and that have the correct year. Filtering on the correct year allows to run the slow string matching algorithm a lot less oftern. For all rows that have a similar year and a very similar name the first row is kept and the others are deleted. You might have to play around with the score_cutoff and the matching algorithm to see which one fits your needs the best.
from rapidfuzz import fuzz, utils
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
df3.drop(i2, inplace=True)
df3.reset_index(inplace=True)

How to read CSV file from GitHub using pandas

Im trying to read CSV file thats on github with Python using pandas> i have looked all over the web, and I tried some solution that I found on this website, but they do not work. What am I doing wrong?
I have tried this:
import pandas as pd
url = 'https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv'
df = pd.read_csv(url,index_col=0)
#df = pd.read_csv(url)
print(df.head(5))

You should provide URL to raw content. Try using this:
import pandas as pd
url = 'https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv'
df = pd.read_csv(url, index_col=0)
print(df.head(5))
Output:
alpha-2 ... intermediate-region-code
name ...
Afghanistan AF ... NaN
Åland Islands AX ... NaN
Albania AL ... NaN
Algeria DZ ... NaN
American Samoa AS ... NaN

Add ?raw=true at the end of the GitHub URL to get the raw file link.
In your case,
import pandas as pd
url = 'https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv?raw=true'
df = pd.read_csv(url,index_col=0)
print(df.head(5))
Output:
alpha-2 alpha-3 country-code iso_3166-2 region \
name
Afghanistan AF AFG 4 ISO 3166-2:AF Asia
Åland Islands AX ALA 248 ISO 3166-2:AX Europe
Albania AL ALB 8 ISO 3166-2:AL Europe
Algeria DZ DZA 12 ISO 3166-2:DZ Africa
American Samoa AS ASM 16 ISO 3166-2:AS Oceania
sub-region intermediate-region region-code \
name
Afghanistan Southern Asia NaN 142.0
Åland Islands Northern Europe NaN 150.0
Albania Southern Europe NaN 150.0
Algeria Northern Africa NaN 2.0
American Samoa Polynesia NaN 9.0
sub-region-code intermediate-region-code
name
Afghanistan 34.0 NaN
Åland Islands 154.0 NaN
Albania 39.0 NaN
Algeria 15.0 NaN
American Samoa 61.0 NaN
Note: This works only with GitHub links and not with GitLab or Bitbucket links.

You can copy/paste the url and change 2 things:
Remove "blob"
Replace github.com by raw.githubusercontent.com
For instance this link:
https://github.com/mwaskom/seaborn-data/blob/master/iris.csv
Works this way:
import pandas as pd
pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

I recommend to either use pandas as you tried to and others here have explained, or depending on the application, the python csv-handler CommaSeperatedPython, which is a minimalistic wrapper for the native csv-library.
The library returns the contents of a file as a 2-Dimensional String-Array. It's is in its very early stage though, so if you want to do large scale data-analysis, I would suggest Pandas.

First convert the github csv file to raw in order to access the data, follow the link below in comment on how to convert csv file to raw .
import pandas as pd
url_data = (r'https://raw.githubusercontent.com/oderofrancis/rona/main/Countries-Continents.csv')
data_csv = pd.read_csv(url_data)
data_csv.head()

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low

You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233

figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

data aggregation with sum pandas/ python - python

import pandas as pd dane= pd.read_csv('WHO-COVID-19-global-data _2.csv') dane dane.groupby('Country')[['Cumulative_cases']].sum() KeyError: 'Country' I don't know why this code doesn't run?

Related

How to use Python faker for dependent columns

Remove any apostrophes from string - Python Pandas

Python - fuzzy string matching - TypeError: expected string or bytes-like object

How to read CSV file from GitHub using pandas

Filter and drop rows by proportion python

Categories

Resources