Input Dataframe(df)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
USA WA 08-01-2020 43345
USA WA 09-01-2020 345
USA WV 10-01-2020 345
.
.
.
.
Expected Output(df1)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
.
.
.
.
So from the above dataframe you can see that the column 'Region' has NaN as well as non-null values, I'd like to remove the entire row where column 'Region' has non-NaN values.
Also, AFTER performing the above operation, if I wanted to entirely remove the Region column, how to do that in the fastest possible way(10k+ columns)?? Experts, please help!
FINAL Expected Output
Country Date Value.....
ABW 01-01-2020 123
ABW 02-01-2020 1234
ABW 03-01-2020 3242
USA 04-01-2020 4354
USA 05-01-2020 43543
USA 06-01-2020 34534
USA 07-01-2020 435
Here's the code I tried
df1=df1.isnull(df1['Region'])
Error
df1=df.isnull(df['Region'])
TypeError: isnull() takes 1 positional argument but 2 were given
Using #BEN_YO's suggestion, this is what I did, works fine
filtered_df = df1[df1['Region'].isnull()]
Related
I have two datasets that look like this:
df1:
Date
City
State
Quantity
2019-01
Chicago
IL
35
2019-01
Orlando
FL
322
...
....
...
...
2021-07
Chicago
IL
334
2021-07
Orlando
FL
4332
df2:
Date
City
State
Sales
2020-03
Chicago
IL
30
2020-03
Orlando
FL
319
...
...
...
...
2021-07
Chicago
IL
331
2021-07
Orlando
FL
4000
My date is in format period[M] for both datasets. I have tried using the df1.join(df2,how='outer') and (df2.join(df1,how='outer') commands but they don't add up correctly, essentially, in 2019-01, I have sales for 2020-03. How can I join these two datasets such that my output is as follows:
I have not been able to use merge() because I would have to merge with a combination of City and State and Date
Date
City
State
Quantity
Sales
2019-01
Chicago
IL
35
NaN
2019-01
Orlando
FL
322
NaN
...
...
...
...
...
2021-07
Chicago
IL
334
331
2021-07
Orlando
FL
4332
4000
You can outer-merge. By not specifying the columns to merge on, you merge on the intersection of the columns in both DataFrames (in this case, Date, City and State).
out = df1.merge(df2, how='outer').sort_values(by='Date')
Output:
Date City State Quantity Sales
0 2019-01 Chicago IL 35.0 NaN
1 2019-01 Orlando FL 322.0 NaN
4 2020-03 Chicago IL NaN 30.0
5 2020-03 Orlando FL NaN 319.0
2 2021-07 Chicago IL 334.0 331.0
3 2021-07 Orlando FL 4332.0 4000.0
This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I'm trying to transpose a few columns while keeping the other columns. I'm having a hard time with pivot codes or transpose codes as it doesn't really give me the output I need.
Can anyone help?
I have this data frame:
EmpID
Goal
week 1
week 2
week 3
week 4
1
556
54
33
24
54
2
342
32
32
56
43
3
534
43
65
64
21
4
244
45
87
5
22
My expected dataframe output is:
EmpID
Goal
Weeks
Actual
1
556
week 1
54
1
556
week 2
33
1
556
week 3
24
1
556
week 4
54
and so on until the full employee IDs are listed..
Something like this.
# Python - melt DF
import pandas as pd
d = {'Country Code': [1960, 1961, 1962, 1963, 1964, 1965, 1966],
'ABW': [2.615300, 2.734390, 2.678430, 2.929920, 2.963250, 3.060540, 4.349760],
'AFG': [0.249760, 0.218480, 0.210840, 0.217240, 0.211410, 0.209910, 0.671330],
'ALB': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 1.12214]}
df = pd.DataFrame(data=d)
print(df)
df1 = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
print(df1)
df2 = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print(df2)
# BEFORE
ABW AFG ALB Country Code
0 2.61530 0.24976 NaN 1960
1 2.73439 0.21848 NaN 1961
2 2.67843 0.21084 NaN 1962
3 2.92992 0.21724 NaN 1963
4 2.96325 0.21141 NaN 1964
5 3.06054 0.20991 NaN 1965
6 4.34976 0.67133 1.12214 1966
# AFTER
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Also, take a look at the link below, for more info.
https://www.dataindependent.com/pandas/pandas-melt/
I'm trying to clean a pdf to turn it into a file for geocoding. I've been using tabula-py to rip the pdf and have had pretty good results up until the point of removing rows that are empty entirely. I'm not even sure if this is an efficient way to do this.
I've tried the majority of solutions SO has recommended to me and I still can't quite figure it out. I've set inplace = True, axis = 0 and 1, how="all". Tried indexing out the NaN values and that didn't work either.
import pandas as pd
import tabula
pd.set_option('display.width', 500)
df = tabula.read_pdf("C:\\Users\\Jack\\Documents\\Schoolwork\\Schoolwork\\WICResearch\\RefDocs\\wicclinicdirectory.pdf", pages='all', guess = False, pandas_options={'header': None})
df.columns = ["County_Name", "Clinic_Number", "Clinic_Name", "Address", "City", "Zip_Code", "Phone_Number", "Hours_of_Operation"]
df.drop(["Phone_Number", "Hours_of_Operation"], axis = 1, inplace = True)
#mass of code here that removes unwanted repeated column headers as by product of tabula reading PDFs.
df.drop(["Clinic_Name"], axis = 1, inplace = True)
df[['ClinicNum','ClinicName']] = df.Clinic_Number.apply(lambda x: pd.Series(str(x).split(" ", maxsplit = 1)))
df.drop(["Clinic_Number"], axis = 1, inplace = True)
#df[~df.isin(['NaN', 'NaT']).any(axis=1)]
#df.dropna(axis= 0, how ='all', inplace = True)
NaNIndex = df.index[df.isnull().all(1)]
print(NaNIndex)
print(df)
The above code gives this output:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
0 NaN NaN Ohio WIC Clinic Locations NaN nan NaN
1 NaN NaN NaN NaN Clinic NaN
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
4 NaN NaN NaN NaN nan NaN
5 NaN NaN NaN NaN nan NaN
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
16 NaN NaN NaN NaN nan NaN
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
And what I'd like is:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
I am able to create the data frame I want with the correct headings but it still does not remove the NaN values. Or it removes the entire thing. I'd also like to be able to move the rows that are not all NaN values into the correlating ones so they are all one line.
I'm also not sure how reproducible I can get this as I have fiddled around with tabula quite a bit trying to get this pdf converted.
I have a dataset that has been merged together to fill missing values from one another.
The problem is that I have some columns with missing data that I want to now fill with the values that aren't missing.
The merged data set looks like this for an input:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 nan 1956 nan
Johnson AL 1 nan nan 1956 nan
Johnson AL 2 1 nan 1999 nan
Johnson AL 2 0 nan 1999 nan
Debra AK 1A 0 nan 2000 nan
Debra AK 1B nan 20 nan 1997
Debra AK 2 nan 10 nan 2009
Debra AK 3 nan 1 nan 2008
.
.
What I'd want for an output is this:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 1 1956 1956
Johnson AL 2 1 1 1999 1999
Johnson AL 2 0 0 1999 1999
Debra AK 1A 0 0 2000 2000
Debra AK 1B 20 20 1997 1997
Debra AK 2 10 10 2009 2009
Debra AK 3 1 1 2008 2008
.
.
So I want it so that all nan values are replaced by the associated values in their columns - match Number_x to Number_y and Op_x to Op_y.
One thing to note is that when there are two IDs that are the same sometimes their values will be different; like Johnson with ID = 2 which has different numbers but the same op values. I want to keep these because I need to investigate them more.
Also, if the row has two missing values for Number_x and Number_y I want to take that row out - like Johnson with Number_x and Number_y missing as a nan value.
let us do groupby with axis =1
df.groupby(df.columns.str.split('_').str[0],1).first().dropna(subset=['Number','Op'])
ID Name Number Op State
0 1 Johnson 1.0 1956.0 AL
2 2 Johnson 1.0 1999.0 AL
3 2 Johnson 0.0 1999.0 AL
4 1A Debra 0.0 2000.0 AK
5 1B Debra 20.0 1997.0 AK
6 2 Debra 10.0 2009.0 AK
7 3 Debra 1.0 2008.0 AK
My panda data frame looks like as follows:
Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 ... 2015
ABW 2.615300 2.734390 2.678430 2.929920 2.963250 3.060540 ... 4.349760
AFG 0.249760 0.218480 0.210840 0.217240 0.211410 0.209910 ... 0.671330
ALB NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.12214
...
How can I transpose it that it looks like as follows?
Country_Code Year Econometric_Metric
ABW 1960 2.615300
ABW 1961 2.734390
ABW 1962 2.678430
...
ABW 2015 4.349760
AFG 1960 0.249760
AFG 1961 0.218480
AFG 1962 0.210840
...
AFG 2015 0.671330
ALB 1960 NaN
ALB 1961 NaN
ALB 1962 NaN
ALB 2015 1.12214
...
Thanks.
I think need melt with sort_values:
df = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
Or set_index with stack:
df = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print (df.head(10))
Country Code Year Econometric_Metric
0 ABW 1960 2.61530
1 ABW 1961 2.73439
2 ABW 1962 2.67843
3 ABW 1963 2.92992
4 ABW 1964 2.96325
5 ABW 1965 3.06054
6 ABW 1966 NaN
7 ABW 1967 NaN
8 ABW 1968 NaN
9 ABW 2015 4.34976