How do you remove sections from a csv file using pandas? - python

I am following along with this project guide and I reached a segment where I'm not exactly sure how the code works. Can someone explain the following block of code please:
to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']
df.drop(to_drop, inplace=True, axis=1)
This is the format of the csv file before the previous code is executed:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London
Date of Publication Publisher \
0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Contributors Corporate Author \
0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN
Corporate Contributors Former owner Engraver Issuance type \
0 NaN NaN NaN monographic
1 NaN NaN NaN monographic
2 NaN NaN NaN monographic
3 NaN NaN NaN monographic
4 NaN NaN NaN monographic
Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...
Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.
Which part of the code tells pandas to remove the columns and not rows? What does the inplace=True and axis=1 mean?

This is really basic in Pandas data frame, I guess you should take on a free tutorial.Anyways this code block removes the columns that you have stored in to_drop.
So far a data frame whose name is df we remove columns using this command
df.drop([], inplace=True), axis=1,
where in list we mention the columns we want to drop, axis =1 means to drop them columnwise and in place simply makes it a permanent change that this change will occur actually on the original dataframe.
You can also write the above command as
df.drop(['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks'], inplace=True, axis=1)
Here is quite basic guide to pandas for your future queries Introduction to pandas

Related

Python/Pandas - How to iterate through rows of a csv data frame in one column, in order to change the values in another column

I'm reading a csv file using pandas, the file has 5 columns and 7,000 rows. The column names are:
df.columns=['agent_name', 'case_id', 'case_type', 'case_subtype', regional_team']
I would like to iterate through each row of the column agent_name. If the value in that column matches a name in the regional_team list, then I would like to add the regional team that corresponds to each agent's name into the regional_team column.
The regional_team column currently has NaN values as a placeholder until I figure out how add a regional team for each agent.
So for example, if the value in the agent_name column is John Smith then assign EUR to the regional_team column for that row.
I tried to establish lists to assign agent names to regional lists so that I could reference the lists in my loop statement, but I can't figure out the best way to get this to work in python.
latam_team = ['Jose Gonzales', Jennifer Pasale', 'Lorena Lorenzo']
eur_team = ['John Smith', 'Alaister Mckinney', Victoria Norton']
nam_team = ['Jenny Rivera', 'Jacob White', 'Emma Tilman']
Regional team nomenclatures which I would like to assign to the regional_team column are:
['LATAM', 'ASPAC', 'EUR', 'NAM']
I would suggest using a dictionary to hold the list of names, and use the dictionary keys
as the values to be replaced in the regional_team column. You can then use Pandas apply
command over the agent_name column to check if the name is present in the dictionary.
Setup example.csv
agent_name case_id case_type case_subtype regional_team
0 Alaister Mckinney 1690 type0 subtype4 NaN
1 George Harrison 1717 type0 subtype2 NaN
2 John Lennon 1389 type0 subtype2 NaN
3 Jacob White 1540 type3 subtype1 NaN
4 Jenny Rivera 1261 type2 subtype0 NaN
5 John Lennon 1302 type4 subtype4 NaN
6 Emma Tilman 1826 type4 subtype4 NaN
7 Jennifer Pasale 1044 type0 subtype1 NaN
8 John Smith 1088 type0 subtype1 NaN
9 Victoria Norton 1162 type2 subtype4 NaN
Code
import pandas as pd
df = pd.read_csv('example.csv')
agents = {
'LATAM' : ['Jose Gonzales', 'Jennifer Pasale', 'Lorena Lorenzo'],
'EUR' : ['John Smith', 'Alaister Mckinney', 'Victoria Norton'],
'NAM' : ['Jenny Rivera', 'Jacob White', 'Emma Tilman'],
'ASPAC' : ['John Lennon', 'Paul McCartney', 'George Harrison'],
}
def set_team_name(agent_name):
for name_reg, name_list in agents.items():
if agent_name in name_list:
return name_reg
df ['regional_team'] = df['agent_name'].apply(set_team_name)
print(df)
agent_name case_id case_type case_subtype regional_team
0 Alaister Mckinney 1690 type0 subtype4 EUR
1 George Harrison 1717 type0 subtype2 ASPAC
2 John Lennon 1389 type0 subtype2 ASPAC
3 Jacob White 1540 type3 subtype1 NAM
4 Jenny Rivera 1261 type2 subtype0 NAM
5 John Lennon 1302 type4 subtype4 ASPAC
6 Emma Tilman 1826 type4 subtype4 NAM
7 Jennifer Pasale 1044 type0 subtype1 LATAM
8 John Smith 1088 type0 subtype1 EUR
9 Victoria Norton 1162 type2 subtype4 EUR

Remove any apostrophes from string - Python Pandas

Could someone help!py I am only trying to remove any apostrophes from string text in my data frame, I am not sure what am missing.
I have regular express and replace and renaming but can't seem to get rid of it.
country designation points price \
0 US Martha's Vineyard 96.0 235.0
1 Spain Carodorum Selección Especial Reserva 96.0 110.0
2 US Special Selected Late Harvest 96.0 90.0
3 US Reserve 96.0 65.0
4 France La Brûlade 95.0 66.0
province region_1 region_2 variety \
0 California Napa Valley Napa Cabernet Sauvignon
1 Northern Spain Toro NaN Tinta de Toro
2 California Knights Valley Sonoma Sauvignon Blanc
3 Oregon Willamette Valley Willamette Valley Pinot Noir
4 Provence Bandol NaN Provence red blend
winery last_year_points
0 Heitz 94
1 Bodega Carmen Rodríguez 92
2 Macauley
df.columns=df.columns.str.replace("''","")
df.Designation=df.Designation.str.replace("''","")
import re
re.sub("\'+",'',df.Designation)
df.rename(Destination={'Martha's Vineyard:'Mathas'}, inplace=True)
Error Message:SyntaxError: invalid syntax
See the code snippet below to solve your problem using a combination of lambda inline functions and the replace function for a string object.
df = pd.DataFrame({'Name': ["Tom's", "Jerry's", "Harry"]})
print(df, '\n')
Tom's Jerry's Harry
# Remove any apostrophes using lambda and replace function
df = df['Name'].apply(lambda x: str(x).replace("'", ""))
print(df, '\n')
Toms Jerrys Harry

pandas read_csv function reading values as NaN even though there are values in cells

I have a file that I am trying to read in a pandas dataframe. However, some of the cells, are coming up as NaN even though there are values in there. The cells that are showing up as float value. The cells that are not showing up were copied pasted in the cells. Not sure why that would make a difference. Can anyone help? I have included the file as a link at this location: https://www.dropbox.com/s/30rxw07eaza29df/manhattan_hs_gps.csv?dl=0
Tried this and it worked fine, both encoding='unicode-escape' and encoding='latin-1' work:
df = pd.read_csv('manhattan_hs_gps.csv', encoding='unicode-escape', header=None)
print(df)
0 1 2 3
0 0 A. Philip Randolph Campus High School 40.818500 -73.950000
1 1 Aaron School 40.744800 -73.983700
2 2 Abraham Joshua Heschel School 40.772300 -73.989700
3 3 Academy of Environmental Science Secondary Hig... 40.785200 -73.942200
4 4 Academy for Social Action: A College Board School 40.815400 -73.955300
.. ... ... ... ...
162 164 Xavier High School 40.737900 -73.994600
163 165 Yeshiva University High School for Boys 40.851749 -73.928695
164 166 York Preparatory School 40.774100 -73.979400
165 167 Young Women's Leadership School 40.792900 -73.947200
166 168 Washington Heights Expeditionary Learning School 40.774100 -73.979400

Add new column to dataframe with value_counts

i have two datasets:
-population: shows the population of USA states, organized alphabetically.
-data: has more than 200,000 rows
population.head()
state population
0 Alabama 4887871
1 Alaska 737438
2 Arizona 7171646
3 Arkansas 3013825
4 California 39557045
i'm trying to add a new column called "Incidents" from the other data set.
I tried: population['incidents'] = data.state.value_counts().sort_index()
but i'm getting the following result:
state population incidents
0 Alabama 4887871 NaN
1 Alaska 737438 NaN
2 Arizona 7171646 NaN
3 Arkansas 3013825 NaN
4 California 39557045 NaN
what can i do to fix this??
EDIT:
data.state.value_counts().sort_index()
Alabama 5373
Alaska 1292
Arizona 2268
Arkansas 2753
California 15975
Colorado 3069
Connecticut 2984
Delaware 1643
District of Columbia 3091
Florida 14610
Georgia 8717
````````````````````````
If you wanna add a specific column from one dataset to the other dataset you do it like this
population['incidents'] = data[['columntoappend']]
Your RHS (right hand side ) must be one column which in your case is not.
https://www.google.com/amp/s/www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/amp/
The way to do this is as follows, provided that your length of your indices are consistent:
population['incidents'] = [x for x in data.state.value_counts().sort_index()]
I can't really explain why your approach results in NaN objects though. In any case, it would be incorrect as well as you're assigning entire series to each row in the population dataset. With the list comprehension, you're assigning one value to each row.

Merge two pandas dataframe two create a new dataframe with a specific operation

I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09

Categories