Python question about reading two txt and output one txt - python

week5_worldarea.txt: week5_worldpop.txt:
China 9388211 China 1415045928
India 2973190 India 1354051854
U.S. 9147420 U.S. 326766748
Indonesia 1811570 Indonesia 266794980
Brazil 8358140 Brazil 210867954
Pakistan 770880 Pakistan 200813818
Nigeria 910770 Nigeria 195875237
Bangladesh 130170 Bangladesh 166368149
Russia 16376870 Russia 143964709
Mexico 1943950 Mexico 130759074
Japan 364555 Japan 127185332
Ethiopia 1000000 Ethiopia 107534882
Philippines 298170 Philippines 106512074
Egypt 995450 Egypt 99375741
Viet-Nam 310070 Viet-Nam 96491146
DR-Congo 2267050 DR-Congo 84004989
Germany 348560 Germany 82293457
Iran 1628550 Iran 82011735
Turkey 769630 Turkey 81916871
Thailand 510890 Thailand 69183173
U.K. 241930 U.K. 66573504
France 547557 France 65233271
Italy 294140 Italy 59290969
Hello, I have two text files as you can see. I want to create a third txt which contains Country Name
and population density. For example:
China 150.7258
India 455.420
.....
To do this, I code a python file called untitled2 which contains these functions :
def get_area(x):
pos1=x.find(' ')
area=x[pos1+1:len(x)]
return area
def get_country(x):
pos2=x.find(' ')
country=x[0:pos2]
return country
def get_pop(x):
pos3=x.find(' ')
pop=x[pos3+1:len(x)]
return pop
and another python file called untitled3 is :
import untitled2
f1=open('week5_worldarea.txt','r')
f2=open('week5_worldpop.txt','r')
f3=open('week5_worlddensity1.txt','w')
for line1 in f1:
float(untitled2.get_area(line1))
for line2 in f2:
float(untitled2.get_pop(line2))
density=float(untitled2.get_pop(line2))/float(untitled2.get_area(line1))
for line3 in f1:
untitled2.get_country(line3)
f3.write(str(line3)+str(density))
f1.close()
f2.close()
f3.close()
I think there is a problem with loops but I don't know how to correct it. Also I use some expressions like pos1=x.find(' ') but is there any way to express it by using tabs? I mean, if I write pos1=x.find('\t'), will it be wrong? Thanks so much.

You want to read each of the two files in lockstep, and you can write the density for each country as you compute it. (I'm assuming the two input files have the same countries on corresponding lines.)
with open('week5_worldarea.txt') as area_f, \
open('week5_worldpop.txt') as pop_f, \
open('week5_worlddensity1.txt', 'w') as dens_f:
for area_l, pop_l in zip(area_f, pop_f):
country1, area = area_l.split()
country2, pop = pop_l.split()
assert country1 == country2
density = int(pop)/int(area)
dens_f.write(str(density))

Related

Sorting values in a pandas series in ascending order not working when re-assigned

I am trying to sort a Pandas Series in ascending order.
Top15['HighRenew'].sort_values(ascending=True)
Gives me:
Country
China 1
Russian Federation 1
Canada 1
Germany 1
Italy 1
Spain 1
Brazil 1
South Korea 2.27935
Iran 5.70772
Japan 10.2328
United Kingdom 10.6005
United States 11.571
Australia 11.8108
India 14.9691
France 17.0203
Name: HighRenew, dtype: object
The values are in ascending order.
However, when I then modify the series in the context of the dataframe:
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True)
Top15['HighRenew']
Gives me:
Country
China 1
United States 11.571
Japan 10.2328
United Kingdom 10.6005
Russian Federation 1
Canada 1
Germany 1
India 14.9691
France 17.0203
South Korea 2.27935
Italy 1
Spain 1
Iran 5.70772
Australia 11.8108
Brazil 1
Name: HighRenew, dtype: object
Why this is giving me a different output to that above?
Would be grateful for any advice?
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).to_numpy()
or
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).reset_index(drop=True)
When you sort_values , the indexes don't change so it is aligning per the index!
Thank you to anky for providing me with this fantastic solution!

Reading excel file with line breaks and tabs preserved using xlrd

I am trying to read excel file cells having multi line text in it. I am using xlrd 1.2.0. But when I print or even write the text in cell to .txt file it doesn't preserve line breaks or tabs i.e \n or \t.
Input:
File URL:
Excel file
Code:
import xlrd
filenamedotxlsx = '16.xlsx'
gall_artists = xlrd.open_workbook(filenamedotxlsx)
sheet = gall_artists.sheet_by_index(0)
bio = sheet.cell_value(0,1)
print(bio)
Output:
"Biography 2018-2019 Manoeuvre Textiles Atelier, Gent, Belgium 2017-2018 Thalielab, Brussels, Belgium 2017 Laboratoires d'Aubervilliers, Paris 2014-2015 Galveston Artist Residency (GAR), Texas 2014 MACBA, Barcelona & L'appartment 22, Morocco - Residency 2013 International Residence Recollets, Paris 2007 Gulbenkian & RSA Residency, BBC Natural History Dept, UK 2004-2006 Delfina Studios, UK Studio Award, London 1998-2000 De Ateliers, Post-grad Residency, Amsterdam 1995-1998 BA (Hons) Textile Art, Winchester School of Art UK "
Expected Output:
1975 Born in Hangzhou, Zhejiang, China
1980 Started to learn Chinese ink painting
2000 BA, Major in Oil Painting, China Academy of Art, Hangzhou, China
Curator, Hangzhou group exhibition for 6 female artists Untitled, 2000 Present
2007 MA, New Media, China Academy of Art, Hangzhou, China, studied under Jiao Jian
Lecturer, Department of Art, Zhejiang University, Hangzhou, China
2015 PhD, Calligraphy, China Academy of Art, Hangzhou, China, studied under Wang Dongling
Jury, 25th National Photographic Art Exhibition, China Millennium Monument, Beijing, China
2016 Guest professor, Faculty of Humanities, Zhejiang University, Hangzhou, China
Associate professor, Research Centre of Modern Calligraphy, China Academy of Art, Hangzhou, China
Researcher, Lanting Calligraphy Commune, Zhejiang, China
2017 Christie's produced a video about Chu Chu's art
2018 Featured by Poetry Calligraphy Painting Quarterly No.2, Beijing, China
Present Vice Secretary, Lanting Calligraphy Society, Hangzhou, China
Vice President, Zhejiang Female Calligraphers Association, Hangzhou, China
I have also used repr() to see if there are \n characters or not, but there aren't any.

Rounding and sorting dataframe with pandas

https://github.com/haosmark/jupyter_notebooks/blob/master/Coursera%20week%203%20assignment.ipynb
All the way at the bottom of the code, with question 3, I'm trying to average, round, and sort the data, however for some reason rounding and sorting isn't working at all
i = df.columns.get_loc('2006')
avgGDP = df[df.columns[i:]].copy()
avgGDP = avgGDP.mean(axis=1).round(2).sort_values(ascending=False)
avgGDP
what am I doing wrong here?
This is what df looks like before I apply average, round, and sort.
Your series is actually sorted, the first line being 1.5e+13 and the last one 4.4e+11:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660648e+12
Russian Federation 1.565460e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106714e+12
Iran 4.441558e+11
Rounding doesn't do anything visible here because the smallest value is 4e+11, and rounding it to 2 decimal places doesn't show on this scale. If you want to keep only 2 decimal places in the scientific notation, you can use .map('{:0.2e}'.format), see my note below.
Note: just for fun, you could also calculate the same with a one-liner:
df.filter(regex='^2').mean(1).sort_values()[::-1].map('{:0.2e}'.format)
Output:
Country
United States 1.54e+13
China 6.35e+12
Japan 5.54e+12
Germany 3.49e+12
France 2.68e+12
United Kingdom 2.49e+12
Brazil 2.19e+12
Italy 2.12e+12
India 1.77e+12
Canada 1.66e+12
Russian Federation 1.57e+12
Spain 1.42e+12
Australia 1.16e+12
South Korea 1.11e+12
Iran 4.44e+11

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

How do I take specific tables from a CSV and write a new file in Python?

I want to take specific tables from a CSV file and return a file for each table. I have something that looks like this:
France city population agriculture
France Paris 2000000 lots
France Nice 500000 some
England city population agriculture
England London 30000 none
England Glasgow 10000 some
and I want to return two files, one with
France city population agriculture
France Paris 2000000 lots
France Nice 500000 some
and the other with
England city population agriculture
England London 30000 none
England Glasgow 10000 some
how do I do this?
Here is a solution without using cvs module (can csv module separate tables?)
with open('table.txt') as f:
text = f.read()
tables = text.split('\n\n')
for itable,table in enumerate(tables):
fileout = 'table%2.2i.txt' % itable
with open(fileout,'w') as f:
f.write(table.strip())

Categories