I hope this has not been posted yet, I have not found anything that helped me. So i have this data frame df
Id Numero Voie CodePostal Commune \
1 940010005V-59 59 Rue d'Ablon 94480 Ablon-sur-Seine
2 940010005V-61 61 Rue d'Ablon 94480 Ablon-sur-Seine
3 940010005V-65 65 Rue d'Ablon 94480 Ablon-sur-Seine
Source Latitude Longitude \
1 C+O 48.721350 2.414291
2 C+O 48.722434 2.413538
3 OSM 48.721141 2.415030
Adresse AdresseGPS LatitudeGPS \
1 59 Rue d'Ablon, Ablon-sur-Seine, France 0.0
2 61 Rue d'Ablon, Ablon-sur-Seine, France 0.0
3 65 Rue d'Ablon, Ablon-sur-Seine, France 0.0
LongitudeGPS
1 0.0
2 0.0
3 0.0
I imported it from a csv and added the last three columns using
df = df.assign(AdresseGPS="",LatitudeGPS = 0.,LongitudeGPS = 0.)
What i want to do is modify these last three columns using a function
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
However when I do
df.ix[1,] = funcRow(df.ix[1,])
I get the following error : IndexError: tuple index out of range
I printed both
df.ix[1,] & funcRow(df.ix[1,])
I get the following:
print df.ix[1,]
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS
LatitudeGPS 0
LongitudeGPS 0
Name: 1, dtype: object
print funcRow
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS t
LatitudeGPS 1
LongitudeGPS 0
Name: 1, dtype: object
I am quite new to using data frames with Python so I provided lots of details, not sure if everything is relevant. I have tried this using others functions such as loc or iloc instead of ix but still get the same error.
Any advice would be very welcome :)
I think the "safest" way to solve this is with .loc[] instead of .ix[].
Try this:
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
df.loc[1,:] = funcRow(df.loc[1,:])
(In case you're not used to .loc[]: the first argument is the row selection, the second argument is the column selection, and giving ":" means you choose all).
When I run the code above I get a warning message, but it does return the updated dataframe if I print df.
(Bonus: This blog post is an excellent reference when learning loc, iloc and ix: http://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)
According to the Documentation,
.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
I think you want to access last three columns of a whole dataframe values.
If it is you can try,
df.ix[:] = funcRow(df.ix[:]) #for whole rows
or
df.ix[start:end]=funcRow(df.ix[start:end]) #for specific rows
or if you want to access only particular row then you can use this,
df.ix[n] = funcRow(df.ix[n])
I hope it might be help you to solve your problem.
This should work:
df.ix[1] = funcRow(df.ix[1,])
I probably need to take a look at the source code to see why the following doesn't work:
df.ix[1,] = funcRow(df.ix[1,])
Related
I am trying to add a string in cell if the length of that cell is more than one. I do it, so I have two groups of strings so in the whole column. That will help to create new columns by using the str.split(). Here is what I am doing:
[i if len(df.address.str.split())>1 else 'NA '+ i for i in df.address ]
here is sample of the data:
store
Le rateau , 37 rue Lagarde
Cordonnerie Fabien, 42 penasse
33 Charles de Gaule # I want to add to them ones So I can have store & address later
LeClerc, 238 Le rabais
....
An output would like this:
store
Le rateau , 37 rue Lagarde
Cordonnerie Fabien, 42 penasse
'NA', 33 Charles de Gaule
LeClerc, 238 Le rabais
Try via boolean masking:
m=df['store'].str.split(',').str.len().eq(1)
#splitted the values by ',' and cheking if its length is 1
Finally pass that mask:
df.loc[m,'store']='NA, '+df.loc[m,'store']
#passing the boolean series(that we stored in m variable) in loc accessor and
#modifying the values of store column where the above condition satisfying
output of df:
store
0 Le rateau , 37 rue Lagarde
1 Cordonnerie Fabien, 42 penasse
2 NA, 33 Charles de Gaule
3 LeClerc, 238 Le rabais
We are discussing data that is imported from excel
ene2 = pd.read_excel('Energy Indicators.xls', index=False)
recently I asked in post, where answers were clear, straightforward and brought success.
Changing Values of elements in Pandas Datastructure
However I went steps further, and I have similar (sic!) problem, where assigning variable does not change anything.
Lets consider Data Structure
print(ene2.head())
Country Energy Supply Energy Supply per Capita % Renewable's
15 NaN Gigajoules Gigajoules %
16 Afghanistan 321000000 10 78.6693
17 Albania 102000000 35 100
18 Algeria1 1959000000 51 0.55101
19 American Samoa ... ... 0.641026
238 Viet Nam 2554000000 28 45.3215
239 Wallis and Futuna Islands 0 26 0
240 Yemen 344000000 13 0
241 Zambia 400000000 26 99.7147
242 Zimbabwe 480000000 32 52.5361
243 NaN NaN NaN NaN
244 NaN NaN NaN NaN
where some countries have index (like Algieria1 or Australia12)
I want to change those names to become just Algieria, Australia and so on.
There is in total 20 entries that suppose to be changed.
I developed a method to do it, which at the last step fails..
for value in ene2['Country']:
if type(value) == float: # to cover NaN values
continue
x = re.findall("\D+\d", value) # to find those countries/elements which are with number
while len(x) > 0: # this shows elements with number, otherwise answer is [], which is 0
for letters in x: # to touch letters
right = letters[:-1] # and get rid of the last number
ene2.loc[ene2['Country'] == value, 'Country'] = right # THIS IS ELEMENT WHICH FAILS <= it does not chagne the value
x = re.findall("\D+\d", value) # to bring the new value to the while loop
Code above should make the task, to finally remove all the indexes from the names,
however the code - ene2.loc[...] which used to work previously, here, where is nested, just do nothing.
What could be the case that this exchange does not work, how can I overcome the problem a) in a old style way b) in the Panda way?
The code suggest you already use pandas, so why not use the built-in replace method with regex?
df = pd.DataFrame(data=["Afghanistan","Albania", "Algeria1", "Algeria9999"], columns=["Country"])
df["Country_clean"] = df["Country"].str.replace(r'\d+$', '')
output:
print(df["Country_clean"])
0 Afghanistan
1 Albania
2 Algeria
3 Algeria
Name: Country, dtype: object
I would need to create a new column with data extracted from another column.
Name Surname Age
Nivea Jones 45
Kelly Pams 68
Matthew Currigan 24
...
I would like to create a new column with only the first letter from the name and surname, i.e.
Name Surname Age Short FN
Nivea Jones 45 NJ
Kelly Pams 68 KP
Matthew Currigan 24 MC
...
I did as follows:
df['Short FN'] = df['Name'].str.get(0) +df['Surname'].str.get(0)
and it works well. However, I would need to build a function, with two columns (in this case, name and surname) as parameters:
def sh(x,y):
df['Short FN'] = df[x].str.get(0) +df[y].str.get(0)
return
and it does not work, probably because I should keep in mind that I am using columns from a dataframe as parameter. Also, I do not know if and what I should return.
Could you please explain me how to create a function where I check/pass columns and how to use this function (not clear to me if I need to iterate through rows using a for loop)?
You can do this:
def sh(x, y):
return x[0] + y[0]
df['Short'] = df.apply(lambda x: sh(x['Name'], x['Surname']), axis=1)
print(df)
Name Surname Age Short
0 Nivea Jones 45 NJ
1 Kelly Pams 68 KP
2 Matthew Currigan 24 MC
There are several ways to do that. The simplest way, assuming df is global (as it seems to be in your case), is:
def short_name(col1, col2):
return df[col1].str[0] + df[col2].str[0]
calling short_name("Name", "Surname")
produces:
0 NJ
1 KP
2 MC
dtype: object
You can now use it in whatever way you want. For example:
df["sn"] = short_name("Name", "Surname")
print(df)
# produces:
Name Surname Age sn
0 Nivea Jones 45 NJ
1 Kelly Pams 68 KP
2 Matthew Currigan 24 MC
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
so I'm new to python and ran into a problem using pytrends. I'm trying to compare 5 search terms and store the sum in a CSV.
The problem I'm having right now is I can't seem to isolate an individual element returned. I have the data, I can see it, but I can't seem to isolate an element to be able to do anything meaningful with it.
I found elsewhere a suggestion to use iloc, but that doesn't return anything for what's shown, and if I pass only one parameter it seems to display everything.
It feels really dumb, but I just can't figure this out, nor can I find anything online.
from pytrends.request import TrendReq
import csv
import pandas
import numpy
import time
# Login to Google. Only need to run this once, the rest of requests will use the same session.
pytrend = TrendReq(hl='en-US', tz=360)
with open('database.csv',"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
comparator_string = data[1][0] + " opening"
print("comparator: ",comparator_string,"\n")
#Initialize search term list including comparator_string as the first item, plus 4 search terms
kw_list=[]
kw_list.append(comparator_string)
for x in range(1, 5, 1):
search_string = data[x][0] + " opening"
kw_list.append(search_string)
# Create payload and capture API tokens. Only needed for interest_over_time(), interest_by_region() & related_queries()
pytrend.build_payload(kw_list, cat=0, timeframe='today 3-m',geo='',gprop='')
# Interest Over Time
interest_over_time_df = pytrend.interest_over_time()
#time.sleep(randint(5, 10))
#printer = interest_over_time_df.sum()
printer = interest_over_time_df.iloc[1,1]
print("printer: \n",printer)
pytrends returns pandas.DataFrame objects, and there are a number of ways to go about indexing and selecting data.
Let's take this following bit of code, for example:
kw_list = ['apples', 'oranges', 'bananas']
interest_over_time_df = pytrend.interest_over_time()
If you run print(interest_over_time_df) you will see something like this:
apples oranges bananas isPartial
date
2017-10-23 77 15 43 False
2017-10-24 77 15 46 False
2017-10-25 78 14 41 False
2017-10-26 78 14 43 False
2017-10-27 81 17 42 False
2017-10-28 91 17 39 False
...
You'll see an index column date on the left, as well as the four data columns apples, oranges, bananas, and isPartial. You can ignore isPartial for now: that field lets you know if the data point is complete for that particular date.
At this point you can select data by column, by columns + index, etc.:
>>> interest_over_time_df['apples']
date
2017-10-23 77
2017-10-24 77
2017-10-25 78
2017-10-26 78
2017-10-27 81
>>> interest_over_time_df['apples']['2017-10-26']
78
>>> interest_over_time_df.iloc[4] # Give me row 4
apples 81
oranges 17
bananas 42
isPartial False
Name: 2017-10-27 00:00:00, dtype: object
>>> interest_over_time_df.iloc[4, 0] # Give me row 4, value 0
81
You may be interested in pandas.DataFrame.loc, which selects rows by label, as opposed to pandas.DataFrame.iloc, which selects rows by integer:
>>> interest_over_time_df.loc['2017-10-26']
apples 78
oranges 14
bananas 43
isPartial False
Name: 2017-10-26 00:00:00, dtype: object
>>> interest_over_time_df.loc['2017-10-26', 'apples']
78
Hope that helps.