Transform a DataFrame back to integers after np.nan? - python

I have a DataFrame that looks like this (original is a lot longer):
Country Energy Supply Energy Supply per Capita % Renewable
0 Afghanistan 321 10 78.6693
1 Albania 102 35 100
2 Algeria 1959 51 0.55101
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.6957
5 Angola 642 27 70.9091
I am trying to replace those pesky '...' with a NaN value using np.nan. But I want to change only those specific '...' values because if I apply np.nan to the df then all the integers are changed to float. I am not sure if I am getting this right, please correct me if I am. The reason why I don't want all the numbers in the df to be float is that I will have to multiply integers by large numbers and it comes up as scientific notation. I tried using this:
energy = energy.replace('...', np.nan)
But as I said, all numbers from df are turned into float.

If you wanna write back into file as integer, df.astype({'col1': 'int32'}) may help.
In numpy, you may need to split integer columns and float columns and operate separately. my_npArray.astype(int) may help you

Related

How to iterate over columns and check condition by group

I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.

Issues detecting data input error in a dataframe

I have a dataframe lexdata and I want to check and count the number of null values and also detect invalid values in the 'sales column' some of the columns
sample data
city year month sales
0 Abilene 2000 1 72.0
1 Abilene 2000 2 ola-k
2 Abilene 2000 3 130.0
3 Abilene 2000 4 lee
4 Abilene 2000 5 141.0
I successfully checked and counted the null values with the following code:
lexdata.isnull().sum()
The challenge is to check for invalid values (string) in the sale column
You can use pd.to_numeric and pass coerce to errors parameter, it will try to convert the values to numeric, and if it can not be converted to the numeric, it will return NaN, and finally you can count the null values after conversion
pd.to_numeric(df['sales'], errors='coerce').isnull().sum()
#output: 2

Pandas merge result missing rows when joining on strings

I have a data set that I've been cleaning and to clean it I needed to put it into a pivot table to summarize some of the data. I'm now putting it back into a dataframe so that I can merge it with some other dataframes. df1 looks something like this:
Count Region Period ACV PRJ
167 REMAINING US WEST 3/3/2018 5 57
168 REMAINING US WEST 3/31/2018 10 83
169 SAN FRANCISCO 1/13/2018 99 76
170 SAN FRANCISCO 1/20/2018 34 21
df2 looks something like this:
Count MKTcode Region
11 RSMR0 REMAINING US SOUTH
12 RWMR0 REMAINING US WEST
13 SFR00 SAN FRANCISCO
I've tried merging them with this code:
df3 = pd.merge(df1, df2, on='Region', how='inner')
but for some reason pandas is not interpreting the Region columns as the same data and the merge is turning up NaN data in the MKTcode column and it seems to be appending df2 to df1, like this:
Count Region Period ACV PRJ MKTcode
193 WASHINGTON, D.C. 3/3/2018 36 38 NaN
194 WASHINGTON, D.C. 3/31/2018 12 3 NaN
195 ATLANTA NaN NaN NaN ATMR0
196 BOSTON NaN NaN NaN B2MRN
I've tried inner and outer joins, but the real problem seems to be that pandas is interpreting the Region column of each dataframe as different elements.
The MKTcode column and Region column in df2 has only 12 observations and each observation occurs only once, whereas df1 has several repeating instances in the Region column (multiples of the same city). Is there a way where I can just create a list of the 12 MKTcodes that I need and perform a merge where it matches with each region that I designate? Like a one to many match?
Thanks.
When a merge isn't working as expected, the first thing to do is look at the offending columns.
The biggest culprit in most cases is trailing/leading whitespaces. These are usually introduced when DataFrames are incorrectly read from files.
Try getting rid of extra whitespace characters by stripping them out. Assuming you need to join on the "Region" column, use
for df in (df1, df2):
# Strip the column(s) you're planning to join with
df['Region'] = df['Region'].str.strip()
Now, merging should work as expected,
pd.merge(df1, df2, on='Region', how='inner')
Count_x Region Period ACV PRJ Count_y MKTcode
0 167 REMAINING US WEST 3/3/2018 5 57 12 RWMR0
1 168 REMAINING US WEST 3/31/2018 10 83 12 RWMR0
2 169 SAN FRANCISCO 1/13/2018 99 76 13 SFR00
3 170 SAN FRANCISCO 1/20/2018 34 21 13 SFR00
Another possibility if you're still getting NaNs, could be because of a difference in whitespace characters between words. For example, 'REMAINING US WEST' will not compare as equal with 'REMAINING US WEST'.
This time, the fix is to use str.replace:
for df in (df1, df2):
df['Region'] = df['Region'].str.replace(r'\s+', ' ')

Slicing in Pandas Dataframes and comparing elements

Morning. Recently I have been trying to implement pandas in creating large data tables for machine learning (I'm trying to move away from numpy as best I can).
However-I'm running into some issues-namely, slicing pandas date frames.
Namely-I'd like to return the rows I specify and reference and compare particular elements with those in other arrays-here's some a small amount of code i've implemented and some outline
import pandas as pd
import csv
import math
import random as nd
import numpy
#create the pandas dataframe from my csv. The Csv is entirely numerical data
#with exception of the first row vector which has column labels
df=pd.read_csv(r"C:\Users\Python\Downloads\Data for Brent - Secondattampatatdfrandomsample.csv")
#I use panda functionality to return a random sample of the data (a subset
#of the array)
df_sample=pd.DataFrame.sample(df,10)
It's at this point that I want to compare the first element along each row vector to the original data. Specifically, the first element in any row contains an id number.
If the elements of the original data frame and the sample frame match up like to compute a 3 and 6 month average of the associated column elements with matching id number
I want to disclaim I'm comfy moving to numpy and away from pandas-but there are training model methods I hear a ton of good things about in pandas (My training is the mathematics side of things and less so the program development). thanks for the input!
edit: here is the sample input for the first 11 row vectors in the dataframe (id, year, month,x,y,z)
id year month x y z
0 2 2016 2 1130 343.627538 163660.060200
1 2 2016 4 859 913.314513 360633.159400
2 2 2016 5 931 858.548056 93608.190030
3 2 2016 6 489 548.314860 39925.669950
4 2 2016 7 537 684.441725 80270.240060
5 2 2016 8 618 673.887072 124041.560000
6 2 2016 9 1030 644.749493 88975.429980
7 2 2016 10 1001 543.312870 54874.599830
8 2 2016 11 1194 689.053707 79930.230000
9 2 2016 12 673 483.644736 27567.749940
10 2 2017 1 912 657.716386 54590.460070
11 2 2017 2 671 682.007537 52514.580380
here is how sample data is returned given N same n tuple as before. I used native panda functions to return a randomly generated subset of 10 row vectors out of almost 9000 entries
2 2016 1 633 877.9282175 75890.97027
5185 2774 2016 4 184 399.418719 9974.375000
9441 4974 2017 2 239 135.520851 0.000000
5134 2745 2017 2 187 217.220657 7711.333333
8561 4063 2017 1 103 505.714286 18880.000000
3328 2033 2016 11 118 452.152542 7622.000000
3503 2157 2016 3 287 446.668831 8092.588235
5228 2791 2016 2 243 400.166008 12655.250000
9380 4708 2017 2 210 402.690583 5282.352941
1631 1178 2016 10 56 563.716667 16911.500000
2700 1766 2016 1 97 486.764151 6449.625000
I'd like to decry the appropriate positions in the sample array to search for identical elements in the original array and compute averages (and eventually more rigorous statistical modeling) to their associated numerical data
for id in df_sample['id'].unique():
df.groupby('id').mean()[['x', 'y', 'z']].reset_index()
I'm not sure if this is exactly what you want but I'll walk through it to see if it gives you ideas. For each unique id in the sample (I did it for all of them, implement whatever check you like), I grouped the original dataframe by that id (all rows with id == 2 are smushed together) and took the mean of the resulting pandas.GroupBy object as required (which averages the smushed together rows, for each column not in the groupby call). Since this averages your month and year as well, and all I think I care about is x, y, and z, I selected those columns, and then for aesthetic purposes reset the index.
Alternatively, if you wanted the average for that id for each year in the original df, you could do
df.groupby(['id', 'year']).mean()[['x', 'y', 'z']].reset_index()

drop_duplicates() with double brackets [[ ]]

I know what a horrible error message "not working" is, but it simply is that simple. I have a data set with a year and group identifier, year and group.
The code that I used to do was
df = df.reset_index().drop_duplicates([['year', 'gvkey']]).set_index(['year', 'gvkey'], drop=True)
However, df.index.is_unique would return false. Puzzled, I looked at some slice of the data, and indeed:
>>> asd = df.head().reset_index()
>>> asd
Out[575]:
year gvkey sic state naics
0 1966 1000 3089 NaN NaN
1 1966 1000 3089 NaN NaN
2 1972 1000 3089 NaN NaN
3 1976 1000 3089 NaN NaN
4 1984 1001 5812 OK 722
>>> asd.drop_duplicates([['year', 'gvkey']])
Out[576]:
year gvkey sic state naics
0 1966 1000 3089 NaN NaN
1 1966 1000 3089 NaN NaN
4 1984 1001 5812 OK 722
However, following a random twitch, I also tried:
>>> asd.drop_duplicates(['year', 'gvkey'])
Out[577]:
year gvkey sic state naics
0 1966 1000 3089 NaN NaN
2 1972 1000 3089 NaN NaN
3 1976 1000 3089 NaN NaN
4 1984 1001 5812 OK 722
which gave me what I expected. Now I am ultimately confused. What exactly is the difference between the two notations - I always used the double brackets [[]]for slicing etc in python. Do I need to revise all my code or is this specific to drop_duplicates()?
From the documentation when you pass a sequence to the first argument, which is cols in Pandas 0.13.1, you are giving the names of the columns to be considered when identifying the duplicates.
Therefore, the right sintax uses single brackets [], (), because they will produce the sequence that you want. Using double brackets will produce a sequence of lists, in your case, and this will not represent the column labels that you are looking for.
drop_duplicates expects a label or list of labels for its first argument. What you've created by putting two sets of brackets is a list of lists of labels. Pandas doesn't know what it's looking at when you do that.
I always used the double brackets [[]] for slicing etc in python
Most likely, either you haven't been doing that as much as you thought, or your code is full of awkwardly formed data structures and weird code to work with them. Under normal circumstances (such as here), double brackets would be an error, and you would already have noticed. I would recommend rechecking the places you've used double brackets; I can't tell whether they should be changed just from this information.

Categories