I have a string object that is looking like this:
Numărul de camere
3 camere
Suprafaţa totală
77 m²
Suprafaţa bucătăriei
11 m²
Tipul clădirii
Dat în exploatare
Etaj
3
Locul de amplasare în casă
In mijlocul casei
Grup sanitar
separat
Balcon/lojă
2
Parcare
acoperită
Încălzire autonomă
✔
This is data parsed from a web site. I want to add the data to a DataFrame:
df = pd.DataFrame(columns=['ID','Numarul de camere','Suprafata totala',
'Suprafata bucatariei','Tipul cladirii','Etaj',
'Amplasarea in bloc', 'Grup sanitar', 'Balcon/loja',
'Parcare', 'Incalzire autonoma'])
Each second row of strings is a characteristic and I want to add it to his place in my DataFrame. How to do this?
text = """Numărul de camere
3 camere
Suprafaţa totală
77 m²
Suprafaţa bucătăriei
11 m²
Tipul clădirii
Dat în exploatare
Etaj
3
Locul de amplasare în casă
In mijlocul casei
Grup sanitar
separat
Balcon/lojă
2
Parcare
acoperită
Încălzire autonomă
✔ """
#split the string
s = text.split('\n')
import pandas as pd
d = {k:v for k, v in zip(s[0::2],s[1::2])}
df = pd.DataFrame([d])
print df.head()
# if you want to preserve the order of the columns
df = pd.DataFrame.from_items([('Values', s[1::2])], orient='index',columns=s[0::2])
print df.head()
Related
I am trying to add a string in cell if the length of that cell is more than one. I do it, so I have two groups of strings so in the whole column. That will help to create new columns by using the str.split(). Here is what I am doing:
[i if len(df.address.str.split())>1 else 'NA '+ i for i in df.address ]
here is sample of the data:
store
Le rateau , 37 rue Lagarde
Cordonnerie Fabien, 42 penasse
33 Charles de Gaule # I want to add to them ones So I can have store & address later
LeClerc, 238 Le rabais
....
An output would like this:
store
Le rateau , 37 rue Lagarde
Cordonnerie Fabien, 42 penasse
'NA', 33 Charles de Gaule
LeClerc, 238 Le rabais
Try via boolean masking:
m=df['store'].str.split(',').str.len().eq(1)
#splitted the values by ',' and cheking if its length is 1
Finally pass that mask:
df.loc[m,'store']='NA, '+df.loc[m,'store']
#passing the boolean series(that we stored in m variable) in loc accessor and
#modifying the values of store column where the above condition satisfying
output of df:
store
0 Le rateau , 37 rue Lagarde
1 Cordonnerie Fabien, 42 penasse
2 NA, 33 Charles de Gaule
3 LeClerc, 238 Le rabais
sample file:
03|02|2|02|F|3|47|P| |AG|AFL|24|20201016| 1 |West |CH|India - LA |CNDO
code:
df1 = pd.read_csv("GM3.txt",sep="|",dtype=object)
df1.to_csv('file_validation.csv',index=None)
output in csv:
3 2 2 2 F 3 47 P AG AFL 24 20201016 1 West CH India - LA CNDO 302
when I am trying to print df1.to_csv() it is giving me below output:
0 03 02 2 CH India - LA CNDO
I want csv to be stored as string format i.e. 03,02 instead of integer.
Your code works for me:
import pandas as pd
df1 = pd.read_csv("GM3.txt",sep="|",dtype=object)
df1.to_csv('file_validation.csv',index=None)
produces
I would like to replace null value of stadium attendance (affluence in french) with their means. Therefore I do this to have the mean by seasons / teams :
test = data.groupby(['season','domicile']).agg({'affluence':'mean'})
This code works and give me what I want (data is dataframe) :
affluence
season domicile
1999 AS Monaco 10258.647059
AS Saint-Etienne 27583.375000
FC Nantes 28334.705882
Girondins de Bordeaux 30084.941176
Montpellier Hérault SC 13869.312500
Olympique Lyonnais 35453.941176
Olympique de Marseille 51686.176471
Paris Saint-Germain 42792.647059
RC Strasbourg Alsace 19845.058824
Stade Rennais FC 13196.812500
2000 AS Monaco 8917.937500
AS Saint-Etienne 26508.750000
EA Guingamp 13056.058824
FC Nantes 31913.235294
Girondins de Bordeaux 29371.588235
LOSC 16793.411765
Olympique Lyonnais 34564.529412
Olympique de Marseille 50755.176471
Paris Saint-Germain 42716.823529
RC Strasbourg Alsace 13664.875000
Stade Rennais FC 19264.062500
Toulouse FC 19926.294118
....
So now I would like to do a condition on the season and the team. For example test[test.season == 1999]. However this doesn't work because I have only one column 'affluence'. It gives me the error :
'DataFrame' object has no attribute 'season'
I tried :
test = data[['season','domicile','affluence']].groupby(['season','domicile']).agg({'affluence':'mean'})
Which results as above. So I thought of maybe indexing the season/team, but how ? And after that how do I access it ?
Thanks
Doing test = data.groupby(['season','domicile'], as_index=False).agg({'affluence':'mean'}) should do the trick for what you're trying to do.
The parameter as_index=False is particularly useful when you do not want to deal with MultiIndexes.
Example:
import pandas as pd
data = {
'A' : [0, 0, 0, 1, 1, 1, 2, 2, 2],
'B' : list('abcdefghi')
}
df = pd.DataFrame(data)
print(df)
# A B
# 0 0 a
# 1 0 b
# 2 0 c
# 3 1 d
# 4 1 e
# 5 1 f
# 6 2 g
# 7 2 h
# 8 2 i
grp_1 = df.groupby('A').count()
print(grp_1)
# B
# A
# 0 3
# 1 3
# 2 3
grp_2 = df.groupby('A', as_index=False).count()
print(grp_2)
# A B
# 0 0 3
# 1 1 3
# 2 2 3
After the groupby-operation, the columns you refer in the groupby-operation become the index. You can access the index by df.index (or test.index in your case).
In your case, you created a multi-Index. A detailed description of how to handle dataframe with MultiIndex can be found in the pandas documentation.
However, you could recreate a standard dataframe again by using:
df = pd.DataFrame({
'season': test.index.season,
'domicile': test.index.domicile,
'affluence': test.affluence}
)
With my code I integrate 2 databases in 1. The problem is when I add one more column to my databases, the result is not as expected. Use Python 2.7
code:
import pandas as pd
import pandas.io.formats.excel
import numpy as np
# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")
df = (pd.concat([df1,df2])
.set_index(["Cliente",'Fecha'])
.stack()
.unstack(1)
.sort_index(ascending=(True, False)))
m = df.index.get_level_values(1) == 'Impresiones'
df.index = np.where(m, 'Impresiones', df.index.get_level_values(0))
# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None
with pd.ExcelWriter("Data.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
archivo1:
Fecha Cliente Impresiones Impresiones 2 Revenue
20/12/17 Jose 1312 35 $12
20/12/17 Martin 12 56 $146
20/12/17 Pedro 5443 124 $1,256
20/12/17 Esteban 667 1235 $1
archivo2:
Fecha Cliente Impresiones Impresiones 2 Revenue
21/12/17 Jose 25 5 $2
21/12/17 Martin 6347 523 $123
21/12/17 Pedro 2368 898 $22
21/12/17 Esteban 235 99 $7,890
Hope Results:
I tried with m1 = df.index.get_level_values(1) == 'Impresiones 2' df.index = np.where(m1, 'Impresiones 2', df.index.get_level_values(0)) but I have this error: IndexError: Too many levels: Index has only 1 level, not 2
The first bit of the solution is similar to jezrael's answer to your previous question, using concat + set_index + stack + unstack + sort_index.
df = pd.concat([df1, df2])\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
Now comes the challenging part, we have to incorporate the Names in the 0th level, into the 1st level, and then reset the index.
I use np.insert to insert names above the revenue entry in the index.
i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
Now, I create a new MultiIndex which I then use to reindex df -
idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
Now, drop the extra level -
df.index = df.index.droplevel()
df
Fecha 20/12/17 21/12/17
Esteban
Revenue $1 $7,890
Impresiones2 1235 99
Impresiones 667 235
Jose
Revenue $12 $2
Impresiones2 35 5
Impresiones 1312 25
Martin
Revenue $146 $123
Impresiones2 56 523
Impresiones 12 6347
Pedro
Revenue $1,256 $22
Impresiones2 124 898
Impresiones 5443 2368
I hope this has not been posted yet, I have not found anything that helped me. So i have this data frame df
Id Numero Voie CodePostal Commune \
1 940010005V-59 59 Rue d'Ablon 94480 Ablon-sur-Seine
2 940010005V-61 61 Rue d'Ablon 94480 Ablon-sur-Seine
3 940010005V-65 65 Rue d'Ablon 94480 Ablon-sur-Seine
Source Latitude Longitude \
1 C+O 48.721350 2.414291
2 C+O 48.722434 2.413538
3 OSM 48.721141 2.415030
Adresse AdresseGPS LatitudeGPS \
1 59 Rue d'Ablon, Ablon-sur-Seine, France 0.0
2 61 Rue d'Ablon, Ablon-sur-Seine, France 0.0
3 65 Rue d'Ablon, Ablon-sur-Seine, France 0.0
LongitudeGPS
1 0.0
2 0.0
3 0.0
I imported it from a csv and added the last three columns using
df = df.assign(AdresseGPS="",LatitudeGPS = 0.,LongitudeGPS = 0.)
What i want to do is modify these last three columns using a function
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
However when I do
df.ix[1,] = funcRow(df.ix[1,])
I get the following error : IndexError: tuple index out of range
I printed both
df.ix[1,] & funcRow(df.ix[1,])
I get the following:
print df.ix[1,]
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS
LatitudeGPS 0
LongitudeGPS 0
Name: 1, dtype: object
print funcRow
Id 940010005V-59
Numero 59
Voie Rue d'Ablon
CodePostal 94480
Commune Ablon-sur-Seine
Source C+O
Latitude 48.7214
Longitude 2.41429
Adresse 59 Rue d'Ablon, Ablon-sur-Seine, France
AdresseGPS t
LatitudeGPS 1
LongitudeGPS 0
Name: 1, dtype: object
I am quite new to using data frames with Python so I provided lots of details, not sure if everything is relevant. I have tried this using others functions such as loc or iloc instead of ix but still get the same error.
Any advice would be very welcome :)
I think the "safest" way to solve this is with .loc[] instead of .ix[].
Try this:
def funcRow(dataIn):
dataOut = dataIn
dataOut['AdresseGPS'] = 't'
dataOut['LatitudeGPS'] = 1
return(dataOut)
df.loc[1,:] = funcRow(df.loc[1,:])
(In case you're not used to .loc[]: the first argument is the row selection, the second argument is the column selection, and giving ":" means you choose all).
When I run the code above I get a warning message, but it does return the updated dataframe if I print df.
(Bonus: This blog post is an excellent reference when learning loc, iloc and ix: http://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)
According to the Documentation,
.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
I think you want to access last three columns of a whole dataframe values.
If it is you can try,
df.ix[:] = funcRow(df.ix[:]) #for whole rows
or
df.ix[start:end]=funcRow(df.ix[start:end]) #for specific rows
or if you want to access only particular row then you can use this,
df.ix[n] = funcRow(df.ix[n])
I hope it might be help you to solve your problem.
This should work:
df.ix[1] = funcRow(df.ix[1,])
I probably need to take a look at the source code to see why the following doesn't work:
df.ix[1,] = funcRow(df.ix[1,])