I have looked at other posts here but I cant seem to find any help with what im specifically trying to do.
So I have data in 'food.txt'. It represents the annual consumption per capita and i have to open and read the txt file into a list of lists named data[]
FOOD | 1980 1985 1990 1995 2000 2005
-----+-------------------------------------------------
BEEF | 72.1 68.1 63.9 63.5 64.5 62.4
PORK | 52.1 49.2 46.4 48.4 47.8 46.5
FOWL | 40.8 48.5 56.2 62.1 67.9 73.6
FISH | 12.4 13.7 14.9 14.8 15.2 16.1
This is what i have so far, to make it into lines
data = []
filename = 'food.txt'
with open('food.txt' , 'r') as inputfile:
for line in inputfile:
data.append(line.strip().split(','))
This separates them in separate lines but I cant use this as inputs for graphs which is the second part that i know how to do. I should be able to call on it like i put below because this will only give the numerical values which is what i need.
years = data[0][1:]
porkconsumption = [2][1:]
Any help would be appreciated, thank you.
I suspect what you have, after a processing, is a list with strings in it like so
['ABC 345 678','DEF 789 657']
Change your code to say line.strip().split(), and you will see your data list will be filled with lists like so:
[['ABC', '345', '678'],['DEF','789','657']]
Then loop through these to turn them into numbers you can plot:
pork = map(int,data[2][1:])
Related
npop
wht
blk
his
bush
gore
198326
74.4
21.5
4.7
34224
47465
20761
82.4
16.8
1.5
5710
2492
146223
80
12.4
1.2
38737
18950
I'm trying to figure out how to calculate the average vote per county obtained by the candidates (bush, gore). I'm using Pandas to work with the datasets. Here's what I have so far:
def averageCandidateVotes(filename, column):
input = pd.read_csv(filename)
candidate = input[column]
avg_vote =
return avg_vote
I have a document.gca file that contains specific information that I need, I'm trying to extract certain information, in a part of text repeats the next sentences:
#Sta/Elev= xx
(here goes pair numbers)
#Mann
This part of text repeats several times. My goal is to catch (the pair numbers) that are in that interval, and repeat this process in my text. How can I extract that? Say I have this:
Sta/Elev= 259
0 2186.31 .3 2186.14 .9 2185.83 1.4 2185.56 2.5 2185.23
3 2185.04 3.6 2184.83 4.7 2184.61 5.6 2184.4 6.4 2184.17
6.9 2183.95 7.5 2183.69 7.6 2183.59 8 2183.35 8.6 2182.92
10.2 2181.47 10.8 2181.03 11.3 2180.63 11.9 2180.27 12.4 2179.97
13 2179.72 13.6 2179.47 14.1 2179.3 14.3 2179.21 14.7 2179.11
15.7 2178.9 17.4 2178.74 17.9 2178.65 20.1 2178.17 20.4 2178.13
20.4 2178.12 21.5 2177.94 22.6 2177.81 22.6 2177.8 22.9 2177.79
24.1 2177.78 24.4 2177.75 24.6 2177.72 24.8 2177.68 25.2 2177.54
Mann= 3 , 0 , 0
0 .2 0 26.9 .2 0 46.1 .2 0
Bank Sta=26.9,46.1
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2176.01,0.3, 56
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
Type RM Length L Ch R = 1 ,2655 ,11.2,11.1,10.5
XS GIS Cut Line=4
858341.2470677761196439.12427935858354.9998313071196457.53292637
858369.2753539641196470.40256485858387.8228168661196497.81690065
Node Last Edited Time=Aug/05/2019 11:42:02
Sta/Elev= 245
0 2191.01 .8 2190.54 2.5 2189.4 5 2187.76 7.2 2186.4
8.2 2185.73 9.5 2184.74 10.1 2184.22 10.3 2184.04 10.8 2183.55
12.8 2180.84 13.1 2180.55 13.3 2180.29 13.9 2179.56 14.2 2179.25
14.5 2179.03 15.8 2178.18 16.4 2177.81 16.7 2177.65 17 2177.54
17.1 2177.51 17.2 2177.48 17.5 2177.43 17.6 2177.4 17.8 2177.39
18.3 2177.37 18.8 2177.37 19.7 2177.44 20 2177.45 20.6 2177.45
20.7 2177.45 20.8 2177.44 21 2177.42 21.3 2177.41 21.4 2177.4
21.7 2177.32 22 2177.26 22.1 2177.21 22.2 2177.13 22.5 2176.94
22.6 2176.79 22.9 2176.54 23.2 2176.19 23.5 2175.88 23.9 2175.68
24.4 2175.55 24.6 2175.54 24.8 2175.53 24.9 2175.53 25.1 2175.54
25.7 2175.63 26 2175.71 26.3 2175.78 26.4 2175.8 26.4 2175.82
#Mann= 3 , 0 , 0
0 .2 0 22.9 .2 0 43 .2 0
Bank Sta=22.9,43
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2175.68,0.3, 51
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
But I want to select the numbers between Sta/Elev and Mann and save as a pair vectors, for each Sta/Elev right now I have this:
import re
with open('a.g01','r') as file:
file_contents = file.read()
#print(file_contents)
try:
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
except AttributeError:
found = '' # apply your error handling
print(found)
found is empty and I want to catch all the numbers in interval '#Sta/Elev and #Mann'
The problem is in your regex, try switching
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
to
found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
output:
>>> import re
>>> file_contents = 'Sta/ElevthisisatestMann'
>>> found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
>>> print(found)
thisisatest
Edit:
For multiline matching try adding the DOTALL parameter:
found = re.search('Sta/Elev=(.*)Mann',file_contents, re.DOTALL).group(1)
It was not clear to me on what is the separating string, since they are different in your examples, but for that you can just change it in the regex expression
I'm trying to create a choropleth map with zipcode and temperature data to overly the counties, however I continue to have a Javascript error when trying to encode my data. I've looked at the github support and found that this was an issue with sometimes pulling in dataframes, but I also tried using a csv file as the datatype. It seems that the Q is not recognizing the temp column as a number?
import altair as alt
from vega_datasets import data
counties = alt.topo_feature(data.us_10m.url, 'counties')
source = max_2007_df
alt.Chart(counties).mark_geoshape().encode(
color='temp:Q').transform_lookup(
lookup='zipcode',
from_=alt.LookupData(source, 'zipcode', ['temp'])
).project(
type='albersUsa'
).properties(
width=500,
height=300
)
Javascript Error: Failed to execute 'addColorStop' on 'CanvasGradient': The provided float value is non-finite.. This usually means there's a typo in your chart specification. See the JavaScript console for the full traceback.
This is part of the max_2007_temp df
zipcode temp
0 1002 33.6
1 1011 31.8
2 1013 34.1
3 1098 31.9
4 1108 34.3
5 1129 34.1
6 1453 33.3
7 1545 33.5
8 1568 33.4
9 1571 32.8
10 1603 33.5
11 1604 33.8
12 1702 35.5
13 1721 35.5
14 1746 35.5
15 1752 35.5
16 1760 35.5
17 1772 34.4
18 1773 35.5
19 1776 35.5
The map data you reference, data.us_10m, does not have any zipcode information, so it will not work to join this data on zipcode.
If you would like to make the chart you have in mind, you'll need to find a source of geographic data indexed by zipcode rather than by county.
my dataframe is like this
star_rating actors_list
0 9.3 [u'Tim Robbins', u'Morgan Freeman']
1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 [u'Al Pacino', u'Robert De Niro']
3 9.0 [u'Christian Bale', u'Heath Ledger']
4 8.9 [u'John Travolta', u'Uma Thurman']
I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.
import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
expected output for (this data)
robert de niro 13
tom hanks 12
clint eastwood 11
johnny depp 10
al pacino 10
james stewart 9
By my tests, it would be much faster to do the regex cleanup after counting.
from itertools import chain
import re
p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]
ser.head()
Robert De Niro 18
Brad Pitt 14
Clint Eastwood 14
Tom Hanks 14
Al Pacino 13
dtype: int64
Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.
If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.
df = pd.concat([df]*1000) # For the sake of large df.
%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop
%%timeit
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop
%%timeit
words = {}
for i in df['actors_list']:
for w in i :
if w in words:
words[w]+=1
else:
words[w]=1
100 loops, best of 3: 5.44 ms per loop
I will using ast convert the list like to list
import ast
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()
according to this code I got below chart
which
coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()
I have the following (sample) dataframe:
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
....
There are no duplicate values in the index.
I am in the unenviable position of having to append to this dataframe using elements from a number of other dataframes. So I'm appending as follows:
names_df = names_df.append({'Age': someage,
'height': someheight,
'weight':someweight,
'haircolor': somehaircolor'},
ignore_index=True)
My question is using this method how do I set the new index value in names_df equal to the person's name?
The only thing I can think of is to reset the df index before I append and then re-set it afterward. Ugly. Has to be a better way.
Thanks in advance.
I am not sure in what format are you getting the data that you are appending to the original df but one way is as follows:
df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
new_name someage someheight someweight somehaircolor
Time Testing:
%timeit df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
1000 loops, best of 3: 408 µs per loop
%timeit df.append(pd.DataFrame({'Age': 'someage', 'height': 'someheight','weight':'someweight','haircolor': 'somehaircolor'}, index=['some_person']))
100 loops, best of 3: 2.59 ms per loop
Here's another way using append. Instead of passing a dictionary, pass a dataframe (created with dictionary) while specifying index:
names_df = names_df.append(pd.DataFrame({'Age': 'someage',
'height': 'someheight',
'weight':'someweight',
'haircolor': 'somehaircolor'}, index=['some_person']))