I have two dataframes:
In [31]: df1
Out[31]:
State Score
0 Arizona AZ 62
1 Georgia GG 47
2 Newyork NY 55
3 Indiana IN 74
4 Florida FL 31
and
In [30]: df3
Out[30]:
letter number animal
0 c 3 cat
1 d 4 dog
I want to obtain a csv like this:
1 State Score
2 Arizona AZ 62
3 Georgia GG 47
4 Newyork NY 55
5 Indiana IN 74
6 Florida FL 31
7
8 letter number animal
9 c 3 cat
8 d 4 dog
I was able to obtain it by creating an empty dataframe, appending it to the first dataframe and then adding the second dataframe to the csv like this:
empty_df = pd.Series([],dtype=pd.StringDtype())
df1.append(empty_df, ignore_index=True).to_csv('foo.csv', index=False)
df3.to_csv('foo.csv', mode='a', index=False)
but I am getting a warning that the function 'append' is getting deprecated and I should be using 'concat'.
I tried this with concat:
pd.concat([df1, empty_df], ignore_index=True).to_csv('foo.csv', index=False)
df3.to_csv('foo.csv', mode='a', index=False)
but I am not getting the empty line between the 2 sets of data.
Use pandas.DataFrame with np.nan to create the empty row :
import numpy as np
empty_df = pd.DataFrame([[np.nan] * len(df1.columns)], columns=df1.columns)
pd.concat([df1, empty_df], ignore_index=True).to_csv('foo.csv', index=False)
df2.to_csv('foo.csv', mode='a', index=False)
# Output (in Excel):
Related
I have a dataframe the looks like the following below. I want to be able to find out the total number of "missing" that is occurring in the dataframe but to also group the data by columns A,B and C.
A B C D E F total_missing
0 Miami Heat FL Basketball 21 MISSING MISSING 11
1 Miami Heat FL Basketball 17 MISSING MISSING 11
2 Miami Heat FL Basketball MISSING 12 23 11
3 Orlando Magic FL Basketball MISSING 5 MISSING 11
4 Orlando Magic FL Basketball 10 MISSING MISSING 11
5 Orlando Magic FL Basketball 5 MISSING MISSING 11
This code below only gives back the total number of the words missing which is 11 and it appears in all the rows. I just want the total 11 to appear in one cell and to be able to group it by columns a,b and c which I am not sure how to do with the isin function. Any help would be appreciated.
import pandas as pd
df = pd.read_excel(r'C:\ds_test\basketball.xlsx')
df['total_missing'] = df.isin(["MISSING"]).sum().sum()
print(df['total_missing'])
If I understand you correctly, you want:
dfn = (
df.assign(Total_missing=df.eq('MISSING').sum(axis=1))
.groupby(['A', 'B', 'C'])['Total_missing'].sum()
.reset_index()
)
A B C Total_missing
0 Miami Heat FL Basketball 5
1 Orlando Magic FL Basketball 6
I have a dataset structures as below:
index country city Data
0 AU Sydney 23
1 AU Sydney 45
2 AU Unknown 2
3 CA Toronto 56
4 CA Toronto 2
5 CA Ottawa 1
6 CA Unknown 2
I want to replace 'Unknown' in the city column with the mode of the occurences of cities per country. The result would be:
...
2 AU Sydney 2
...
6 CA Toronto 2
I can get the city modes with:
city_modes = df.groupby('country')['city'].apply(lambda x: x.mode().iloc[0])
And I can replace values with:
df['column']=df.column.replace('Unknown', 'something')
But i cant work out how to combine these to only replace unknowns for each country based on mode of occurrence of cities.
Any ideas?
Use transform for Series with same size as original DataFrame and set new values by numpy.where:
city_modes = df.groupby('country')['city'].transform(lambda x: x.mode().iloc[0])
df['column'] = np.where(df['column'] == 'Unknown',city_modes, df['column'])
Or:
df.loc[df['column'] == 'Unknown', 'column'] = city_modes
i have a dataframe like this
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
and i want to get country and product combination count by user wise like below
first split the countries then combine with product and take the count.
wanted output:
Here is one way combining other answers on SO (which just shows the power of searching :D)
import pandas as pd
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
# Making use of: https://stackoverflow.com/a/37592047/7386332
j = (df.Country.str.split(',', expand=True).stack()
.reset_index(drop=True, level=1)
.rename('Country'))
df = df.drop('Country', axis=1).join(j)
# Reformat to get desired Country_Product
df = (df.drop(['Country','Product'], 1)
.assign(Country_Product=['_'.join(i) for i in zip(df['Country'], df['Product'])]))
df2 = df.groupby(['User','Country_Product'])['User'].count().rename('Count').reset_index()
print(df2)
Returns:
User Country_Product count
0 101 Brazil_x 1
1 101 India_x 2
2 102 Brazil_x 1
3 102 Brazil_z 2
4 102 India_x 1
5 102 India_z 1
6 102 Japan_x 1
How about get_dummies
df.set_index(['User','Product']).Country.str.get_dummies(sep=',').replace(0,np.nan).stack().sum(level=[0,1,2])
Out[658]:
User Product
101 x Brazil 1.0
India 2.0
102 x Brazil 1.0
India 1.0
Japan 1.0
z Brazil 2.0
India 1.0
dtype: float64
Hi All I have a Dataframe with more than 50000 records. It has a column by name "Country" which has duplicate values.
As part of a Machine Learning project I am doing a Label Encoding on this column which will replace this column with 50000 records with integer values. (ok for those who do not know about Label Encoding - it takes the unique values of the column and assign an integer value to it mostly based on alphabetical order but not sure though). Say this Dataframe is DF1 and column is "Country".
Now my requirement is that I have to do the same for another dataframe (DF2) manually i.e without using the Label Encoding function.
What I have tried so far and where do I get struck is mentioned below
I have taken the unique values of DF1.Country column and kept in a
new dataframe(temp_df).
Tried to do right join of DF2 and temp_df keeping on="Country". But getting "NaN" in few records. Not sure why
Tried to do find-and-replace using .isin method but still not getting
desired output.
So my basic question is how to fill a column in a dataframe with the values of a column in another dataframe by matching the values of two columns in two dataframes ?
UPDATED
Sample code output is given below for better understanding
The Country Column in DF2 has repeatable values like this :
0 us
1 us
2 gb
3 us
4 au
5 fr
6 us
7 us
8 us
9 us
10 us
11 us
12 ca
13 at
14 us
15 us
16 es
17 fi
18 fr
19 us
20 us
The temp_df dataframe will have integer value for every unique country name like mentioned below (Note : This dataframe will only have unique values. Not duplicates) :
1 gb 49
2 ca 22
3 au 5
4 de 34
5 fr 48
6 br 17
7 jp 75
8 sv 136
9 no 111
10 se 132
11 es 43
12 nl 110
13 mx 103
14 dk 36
15 ro 127
16 ch 24
17 it 71
18 be 10
19 ru 129
20 kr 78
21 fi 44
22 hk 59
23 ie 65
24 sg 133
25 nz 112
26 ar 3
27 at 4
28 in 68
29 cl 26
30 il 66
Now I have to create a new column in DF2 by taking the integer values from temp_df for each country value in DF2. Hope this helps.
You could use pandas.Series.map to accomplish this:
from io import StringIO
import pandas as pd
# Your data ..
data = """
id,country
0,AT
1,DE
2,UK
3,FR
4,AT
5,UK
6,IT
7,DE
"""
df = pd.read_table(StringIO(data), sep=',', index_col=[0])
# Create a map from your current labels to numeric labels:
country_labels = dict([(c, i) for i, c in enumerate(df.country.unique())])
# Use map() to transform your column and re-assign it
df.country = df.country.map(lambda c: country_labels[c])
print(df)
which will transform the above data to
country
id
0 0
1 1
2 2
3 3
4 0
5 2
6 4
7 1
As suggested in one of the comments to your question, you could also use replace()
df = df.replace({'country': country_labels })
Try this:
import pandas as pd
# dataframe
df = pd.DataFrame({'Country' : ['z','x', 'x', 'a', 'a', 'b', 'c'], 'Something' : [10, 1, 2, 1, 2, 3, 4]})
# create dictionary for mapping `sorted` countries to integer
country_map = dict(zip(sorted(df.Country.unique()), range(len(df.Country.unique()))))
# country_map should look smthing like:
# {'a': 0, 'b': 1, 'c': 2, 'x': 3, 'z': 4}, where a, b, .. are countries
# replace `Country` coloumn with mapping
df.replace({'Country': country_map })
I'm a recent convert from excel to python. I think that what I'm trying to here would be traditionally done with a Vlookup of sorts. But I might be struggling with the terminology and not being able to find the python solution. I have been using the pandas library for most of my data analysis framework.
I have two different data frames. One with the weight changes (DF1), and other with the weights(DF2). I want to go line by line (changes are chronological) and:
create a new column in DF1 with the weight before the change
(basically extracted from DF2).
update the results in DF2 where Weight = Weight + WeightChange
Note: The data frames do not have the same dimension, an individual has several weight changes(DF1) but only one weight (DF2):
Name WeightChange
1 John 5
2 Peter 10
3 John 7
4 Mary -20
5 Gary -3
DF2:
Name Weight
1 John 180
2 Peter 160
3 Mary 120
4 Gary 150
Firstly I'd merge df1 and df2 on the 'Name' column to add the weight column to df1.
Then I'd groupby df1 on name and apply a transform to calculate the total weight change for each person. transform returns a Series aligned to the orig df so you can add an aggregated column back to the df.
Then I'd merge this column to df2 and then it's a simple case of adding this total weight change to the existing weight column:
In [242]:
df1 = df1.merge(df2, on='Name', how='left')
df1['WeightChangeTotal'] = df1.groupby('Name')['WeightChange'].transform('sum')
df1
Out[242]:
Name WeightChange Weight WeightChangeTotal
0 John 5 180 12
1 Peter 10 160 10
2 John 7 180 12
3 Mary -20 120 -20
4 Gary -3 150 -3
In [243]:
df2 = df2.merge(df1[['Name','WeightChangeTotal']], on='Name')
df2
Out[243]:
Name Weight WeightChangeTotal
0 John 180 12
1 John 180 12
2 Peter 160 10
3 Mary 120 -20
4 Gary 150 -3
In [244]:
df2['Weight'] = df2['Weight'] + df2['WeightChangeTotal']
df2
Out[244]:
Name Weight WeightChangeTotal
0 John 192 12
1 John 192 12
2 Peter 170 10
3 Mary 100 -20
4 Gary 147 -3
EDIT
To address your desired behaviour for the 'WeightBefore' column:
In [267]:
df1['WeightBefore'] = df1['Weight'] + df1.groupby('Name')['WeightChange'].shift().cumsum().fillna(0)
df1
Out[267]:
Name WeightChange Weight WeightBefore
0 John 5 180 180
1 Peter 10 160 160
2 John 7 180 185
3 Mary -20 120 120
4 Gary -3 150 150
So the above groups on 'Name', applies a shift to the column and then cumsum so we add the incremental differences, we have to call fillna as this will produce NaN values where we have only a single weight change per Name.