Modify values of in loop comprehension - python

I am trying to add a string in cell if the length of that cell is more than one. I do it, so I have two groups of strings so in the whole column. That will help to create new columns by using the str.split(). Here is what I am doing:
[i if len(df.address.str.split())>1 else 'NA '+ i for i in df.address ]
here is sample of the data:
store
Le rateau , 37 rue Lagarde
Cordonnerie Fabien, 42 penasse
33 Charles de Gaule # I want to add to them ones So I can have store & address later
LeClerc, 238 Le rabais
....
An output would like this:
store
Le rateau , 37 rue Lagarde
Cordonnerie Fabien, 42 penasse
'NA', 33 Charles de Gaule
LeClerc, 238 Le rabais

Try via boolean masking:
m=df['store'].str.split(',').str.len().eq(1)
#splitted the values by ',' and cheking if its length is 1
Finally pass that mask:
df.loc[m,'store']='NA, '+df.loc[m,'store']
#passing the boolean series(that we stored in m variable) in loc accessor and
#modifying the values of store column where the above condition satisfying
output of df:
store
0 Le rateau , 37 rue Lagarde
1 Cordonnerie Fabien, 42 penasse
2 NA, 33 Charles de Gaule
3 LeClerc, 238 Le rabais

Related

How to remove rows based on another Dataframe?

I've been working with pandas for a while but I haven't figured out how to achieve the following result.
DF A consists of records that contain active and inactive LOB. I want to remove the inactive LOBs. But the inactive LOBs differ between states.
DF B consists of States as columns and the inactive LOBs in the resulting columns.
So, I want a resulting DF which does not contain any inactive LOBs.
ex: an LOB 78 inactive in OH could be active in MI.
Reasoning:
In the DF a: you can see a record with state OH, and LOB 78. I do not want this record in the DF C because it is considered inactive due to 78 existing in the column OH in DF b.
In DF a: you can see a record with state MI, and LOB 78. I want the record in my DF C because there is no 78 in the column MI in DF b
DF A has 500k records in it. Time to run isn't an issue, but it would be great if it was less than 5 minutes.
(I read DF B from a list of dict : [{state: [list of inactive lob]}] )
Sample DF A:
Name, state, LOB, ID
a , OH , 66 , 7979
aa , OH , 78 , 12341
bas , OH , 67 , 13434
basd, VT , 99 , 1241234
badf, MI , 77 , 12341234
bbdf, MI , 78 , 12341234
caff, VT , 66 , 2134
cdse, AZ , 01 , 232
sample DF B:
OH , VT , MI
66 , 99 , 77
78 , 23
I want a DF C:
Name, state, LOB, ID
bas , OH , 67 , 13434
bbdf, MI , 78 , 12341234
caff, VT , 66 , 2134
cdse, AZ , 01 , 232
IIUC, you can do an anti left join by first melting dfb
dfc= pd.merge(
dfa,
pd.melt(dfb, var_name="state", value_name="LOB"),
on=["state", "LOB"],
how="left",
indicator=True,
).query('_merge != "both"').drop("_merge", axis=1)
print(dfc)
Name state LOB ID
2 bas OH 67 13434
5 bbdf MI 78 12341234
6 caff VT 66 2134
7 cdse AZ 1 232
You can use multi-index to achieve this as follows:
First, index A using both state and LOB:
A2 = A.set_index(['state', 'LOB'])
Then remove the rows you don't want in A:
to_remove = sum([[(list(d.keys())[0], vi) for vi in list(d.values())[0]] for d in B], []) # If we use the list dictionaries without converting it to DataFrame
C = A2.loc[list(set(A2.index) - set(to_remove))]
After this C will contain only the rows you want. Let me know if it helps.

python access a column after groupby

I would like to replace null value of stadium attendance (affluence in french) with their means. Therefore I do this to have the mean by seasons / teams :
test = data.groupby(['season','domicile']).agg({'affluence':'mean'})
This code works and give me what I want (data is dataframe) :
affluence
season domicile
1999 AS Monaco 10258.647059
AS Saint-Etienne 27583.375000
FC Nantes 28334.705882
Girondins de Bordeaux 30084.941176
Montpellier Hérault SC 13869.312500
Olympique Lyonnais 35453.941176
Olympique de Marseille 51686.176471
Paris Saint-Germain 42792.647059
RC Strasbourg Alsace 19845.058824
Stade Rennais FC 13196.812500
2000 AS Monaco 8917.937500
AS Saint-Etienne 26508.750000
EA Guingamp 13056.058824
FC Nantes 31913.235294
Girondins de Bordeaux 29371.588235
LOSC 16793.411765
Olympique Lyonnais 34564.529412
Olympique de Marseille 50755.176471
Paris Saint-Germain 42716.823529
RC Strasbourg Alsace 13664.875000
Stade Rennais FC 19264.062500
Toulouse FC 19926.294118
....
So now I would like to do a condition on the season and the team. For example test[test.season == 1999]. However this doesn't work because I have only one column 'affluence'. It gives me the error :
'DataFrame' object has no attribute 'season'
I tried :
test = data[['season','domicile','affluence']].groupby(['season','domicile']).agg({'affluence':'mean'})
Which results as above. So I thought of maybe indexing the season/team, but how ? And after that how do I access it ?
Thanks
Doing test = data.groupby(['season','domicile'], as_index=False).agg({'affluence':'mean'}) should do the trick for what you're trying to do.
The parameter as_index=False is particularly useful when you do not want to deal with MultiIndexes.
Example:
import pandas as pd
data = {
'A' : [0, 0, 0, 1, 1, 1, 2, 2, 2],
'B' : list('abcdefghi')
}
df = pd.DataFrame(data)
print(df)
# A B
# 0 0 a
# 1 0 b
# 2 0 c
# 3 1 d
# 4 1 e
# 5 1 f
# 6 2 g
# 7 2 h
# 8 2 i
grp_1 = df.groupby('A').count()
print(grp_1)
# B
# A
# 0 3
# 1 3
# 2 3
grp_2 = df.groupby('A', as_index=False).count()
print(grp_2)
# A B
# 0 0 3
# 1 1 3
# 2 2 3
After the groupby-operation, the columns you refer in the groupby-operation become the index. You can access the index by df.index (or test.index in your case).
In your case, you created a multi-Index. A detailed description of how to handle dataframe with MultiIndex can be found in the pandas documentation.
However, you could recreate a standard dataframe again by using:
df = pd.DataFrame({
'season': test.index.season,
'domicile': test.index.domicile,
'affluence': test.affluence}
)

Modify dataframe row - Panda Python

I hope this has not been posted yet, I have not found anything that helped me. So i have this data frame df
Id Numero Voie CodePostal Commune \
1 940010005V-59 59 Rue d'Ablon 94480 Ablon-sur-Seine
2 940010005V-61 61 Rue d'Ablon 94480 Ablon-sur-Seine
3 940010005V-65 65 Rue d'Ablon 94480 Ablon-sur-Seine
Source Latitude Longitude \
1 C+O 48.721350 2.414291
2 C+O 48.722434 2.413538
3 OSM 48.721141 2.415030
Adresse AdresseGPS LatitudeGPS \
1 59 Rue d'Ablon, Ablon-sur-Seine, France 0.0
2 61 Rue d'Ablon, Ablon-sur-Seine, France 0.0
3 65 Rue d'Ablon, Ablon-sur-Seine, France 0.0
LongitudeGPS
1 0.0
2 0.0
3 0.0
I imported it from a csv and added the last three columns using
df = df.assign(AdresseGPS="",LatitudeGPS = 0.,LongitudeGPS = 0.)
What i want to do is modify these last three columns using a function
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
However when I do
df.ix[1,] = funcRow(df.ix[1,])
I get the following error : IndexError: tuple index out of range
I printed both
df.ix[1,] & funcRow(df.ix[1,])
I get the following:
print df.ix[1,]
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS
LatitudeGPS 0
LongitudeGPS 0
Name: 1, dtype: object
print funcRow
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS t
LatitudeGPS 1
LongitudeGPS 0
Name: 1, dtype: object
I am quite new to using data frames with Python so I provided lots of details, not sure if everything is relevant. I have tried this using others functions such as loc or iloc instead of ix but still get the same error.
Any advice would be very welcome :)
I think the "safest" way to solve this is with .loc[] instead of .ix[].
Try this:
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
df.loc[1,:] = funcRow(df.loc[1,:])
(In case you're not used to .loc[]: the first argument is the row selection, the second argument is the column selection, and giving ":" means you choose all).
When I run the code above I get a warning message, but it does return the updated dataframe if I print df.
(Bonus: This blog post is an excellent reference when learning loc, iloc and ix: http://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)
According to the Documentation,
.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
I think you want to access last three columns of a whole dataframe values.
If it is you can try,
df.ix[:] = funcRow(df.ix[:]) #for whole rows
or
df.ix[start:end]=funcRow(df.ix[start:end]) #for specific rows
or if you want to access only particular row then you can use this,
df.ix[n] = funcRow(df.ix[n])
I hope it might be help you to solve your problem.
This should work:
df.ix[1] = funcRow(df.ix[1,])
I probably need to take a look at the source code to see why the following doesn't work:
df.ix[1,] = funcRow(df.ix[1,])

How to add unprocessed strings to a DataFrame in python?

I have a string object that is looking like this:
Numărul de camere
3 camere
Suprafaţa totală
77 m²
Suprafaţa bucătăriei
11 m²
Tipul clădirii
Dat în exploatare
Etaj
3
Locul de amplasare în casă
In mijlocul casei
Grup sanitar
separat
Balcon/lojă
2
Parcare
acoperită
Încălzire autonomă
✔
This is data parsed from a web site. I want to add the data to a DataFrame:
df = pd.DataFrame(columns=['ID','Numarul de camere','Suprafata totala',
'Suprafata bucatariei','Tipul cladirii','Etaj',
'Amplasarea in bloc', 'Grup sanitar', 'Balcon/loja',
'Parcare', 'Incalzire autonoma'])
Each second row of strings is a characteristic and I want to add it to his place in my DataFrame. How to do this?
text = """Numărul de camere
3 camere
Suprafaţa totală
77 m²
Suprafaţa bucătăriei
11 m²
Tipul clădirii
Dat în exploatare
Etaj
3
Locul de amplasare în casă
In mijlocul casei
Grup sanitar
separat
Balcon/lojă
2
Parcare
acoperită
Încălzire autonomă
✔ """
#split the string
s = text.split('\n')
import pandas as pd
d = {k:v for k, v in zip(s[0::2],s[1::2])}
df = pd.DataFrame([d])
print df.head()
# if you want to preserve the order of the columns
df = pd.DataFrame.from_items([('Values', s[1::2])], orient='index',columns=s[0::2])
print df.head()

Remove rows where a column contains a specific substring [duplicate]

This question already has answers here:
Search for "does-not-contain" on a DataFrame in pandas
(9 answers)
Closed 2 years ago.
how to eliminate rown that have a word i don't want?
I have this DataFrame:
index price description
0 15 Kit 10 Esponjas Para Cartuchos Jato De Tinta ...
1 15 Snap Fill Para Cartuchos Hp 60 61 122 901 21 ...
2 16 Clips Para Cartuchos Hp 21 22 60 74 75 92 93 ...
I'm trying to remove the rown with the word 'esponja'
i want a DataFrame like this:
index price description
1 15 Snap Fill Para Cartuchos Hp 60 61 122 901 21 ...
2 16 Clips Para Cartuchos Hp 21 22 60 74 75 92 93 ...
i'm newbe, i don't have any idea how to resolve that
Create a boolean mask by checking for strings that contain 'Esponjas', then index into your dataframe with the negated mask.
df[~df['description'].str.contains('Esponjas')]
If you are unsure what's going on, print out what
df['description']
df['description'].str.contains('Esponjas')
~df['description'].str.contains('Esponjas')
do on their own. If you want to perform the substring-check case-insensitively, use case=False as a keyword argument to str.contains.

Categories