Replace the missing value NAN based on values of another columns (conditions) - python

Hi I would like to fill in the NaN value based on value of sources.
I have tried the np.select, but this method also overwrite the other correct values.
landline_area1['area'] = np.select(area_conditions, values)
Table overview
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN NaN
7 1225 1225 Bath
8 1142 NaN NaN
conditions
area_conditions = [
(landline_area1['source'].str.startswith('20')),
(landline_area1['source'].str.startswith('23')),
(landline_area1['source'].str.startswith('24'))]
values
values = [
'London',
'Southampton / Portsmouth',
'Coventry']
Expected result
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN London
7 1225 1225 Bath
8 1142 NaN Sheffield

Let us try np.select and adding astype str
#landline_area1['source'].astype(str).str.startswith('20')
s = np.select(area_conditions, values)
landline_area1['area'].fillna(pd.Series(s, index=landline_area1.index),inplace=True)

Related

Calculated mean column for a group transposed in a dataframe

I'm having an issue with my final analysis column. I'm looking to get the mean of each row in the below output.
ValueSource StackedValues Count Sum_Weight Group_Count Mean
0 AgeBand 4.0 402 6152.237828 2418 NaN
2 AgeBand 2.0 402 5250.436317 2053 NaN
7 AgeBand 3.0 402 4344.387011 1667 NaN
11 AgeBand 5.0 402 7296.371395 2911 NaN
19 AgeBand 1.0 402 3260.035257 1254 NaN
20 AgeBand 6.0 402 8501.978737 3341 NaN
59 AgeBand 8.0 402 15487.932515 6210 NaN
92 AgeBand 7.0 402 12054.620941 4846 NaN
So for index row 0, the mean would be Sum_Weight/SUM(Sum_Weight) and grouped across Valuesource
I tried the following Data['Mean'] = Data.groupby("ValueSource")['Sum_Weight'].mean() but as you can see, it didn't quite work.
The end result would be a mean column that has a value for each row per ValueSource and StackedValue
Any help would be much appreciated.
You could do that with groupby and apply like
Data['Mean'] = Data.groupby("ValueSource")['Sum_Weight'].apply(lambda x: x / x.sum())

Sorting values and percentages creating inaccuracies

Trying to return the percentage change of value B using the value of B-1. However I when I run the for loop I am using to carry out this calculation I get 100% from A- B (the first two values).
Here is the table in reference to give you more context
val pct_of_whole
3612 100.0
2339 65.0
2339 65.0
2208 61.0
1890 52.0
1368 38.0
1365 38.0
1363 38.0
1086 30.0
1058 29.0
So from this table I am trying to return the percentage change from 3612 to 2339, from 2339 to 2339, from 2208 to 2339 etc.
This is the for loop I am using to carry out the percentage change calculation:
pct_change=[100]
length = len(df_two['val'])
for j in range(1,length):
pct_change.append(int(df_two['val'][j]/df_two['val'][j-1]*100))
At this point this point my chart retains the correct percentage changes. Since I am building a funnel showing drop-offs between each stage I sort the values from smallest to largest
df_two = df_two.sort_values('val').reset_index(drop=True)
At this point the percentage changes start looking inaccurate
val pct_of_whole pct_change
1058 29.0 97
1086 30.0 79
1363 38.0 99
1365 38.0 99
1368 38.0 99
1890 52.0 72
2208 61.0 94
2339 65.0 64
2339 65.0 100
3612 100.0 100
Understandably this makes the funnel i build appear inaccurate.
I think when I sort by val the 2339 with the higher percentage change is incorrectly placed as the second in order- which is the cause of my confusion
EDIT: Sorry - now I understand the question... :)
To sort for different columns in a different direction (i.e. one column ascending, the other descending), you can provide lists for both kwargs, by and ascending:
df.sort_values(['val', 'pct_change'], ascending=[True, False]).reset_index(drop=True)
val pct_of_whole pct_change
0 1058 29.291251 97.421731
1 1086 30.066445 79.677183
2 1363 37.735327 99.853480
3 1365 37.790698 99.780702
4 1368 37.873754 72.380952
5 1890 52.325581 85.597826
6 2208 61.129568 94.399316
7 2339 64.756368 100.000000
8 2339 64.756368 64.756368
9 3612 100.000000 NaN
IIUC, your dataframe can be done like this:
Given a dataframe with column val
df
0 3612
1 2339
2 2339
3 2208
4 1890
5 1368
6 1365
7 1363
8 1086
9 1058
the pct_of_whole can be calculated via
df.val/df.val.loc[0]*100
0 100.000000
1 64.756368
2 64.756368
3 61.129568
4 52.325581
5 37.873754
6 37.790698
7 37.735327
8 30.066445
9 29.291251
Name: val, dtype: float64
...and the pct_change would be
df.val/df.val.shift()*100
0 NaN
1 64.756368
2 100.000000
3 94.399316
4 85.597826
5 72.380952
6 99.780702
7 99.853480
8 79.677183
9 97.421731
Name: val, dtype: float64
So I noticed that after running the for loop the dataframe was in the inverse of the order I wanted.
So i reset the index to give each of the values an index number from 0-9.
df_two = df_two.reset_index()
I then sorted the order of my dataframe by the index and reset the index again - dropping this index.
df_two = df_two.sort_values('index',ascending =False).reset_index(drop=True)
After this both my dataframe and funnel were in the right order

Trying to ignore Nan in csv file throws a typeerror

I'm loading a local csv file that contains data. I'm trying to find the smallest float in a row thats mixed of NaN and numbers.
I have tried using the numpy function called np.nanmin, but it throws:
"TypeError: '<=' not supported between instances of 'str' and 'float'"
database = pd.read_csv('database.csv',quotechar='"',skipinitialspace=True, delimiter=',')
coun_weight = database[['Country of Operator/Owner', 'Launch Mass (Kilograms)']]
print(coun_weight)
lightest = np.nanmin(coun_weight['Launch Mass (Kilograms)'])
Any suggestions to why nanmin might not work?
A link to the entire csv file: http://www.sharecsv.com/s/5aea6381d1debf75723a45aacd40abf8/database.csv
Here is a sample of my coun_weight:
Country of Operator/Owner Launch Mass (Kilograms)
1390 China NaN
1391 China 1040
1392 China 1040
1393 China 2700
1394 China 2700
1395 China 1800
1396 China 2700
1397 China NaN
1398 China NaN
1399 China NaN
1400 China NaN
1401 India 92
1402 Russia 45
1403 South Africa 1
1404 China NaN
1405 China 4
1406 China 4
1407 China 12
I try test it and all problematic values are:
coun_weight = pd.read_csv('database.csv')
print (coun_weight.loc[pd.to_numeric(coun_weight['Launch Mass (Kilograms)'], errors='coerce').isnull(), 'Launch Mass (Kilograms)'].dropna())
1091 5,000+
1092 5,000+
1093 5,000+
1094 5,000+
1096 5,000+
Name: Launch Mass (Kilograms), dtype: object
And solution is:
coun_weight['Launch Mass (Kilograms)'] =
coun_weight['Launch Mass (Kilograms)'].replace('5,000+', 5000).astype(float)
print (coun_weight['Launch Mass (Kilograms)'].iloc[1091:1098])
1091 5000.0
1092 5000.0
1093 5000.0
1094 5000.0
1095 NaN
1096 5000.0
1097 6500.0
Name: Launch Mass (Kilograms), dtype: float64
Then if need find minimal values with NaNs - Series.min, where NaNs are skipped:
print (coun_weight['Launch Mass (Kilograms)'].min())
0.0
Testing if some 0 are in column:
a = coun_weight['Launch Mass (Kilograms)']
print (a[a == 0])
912 0.0
Name: Launch Mass (Kilograms), dtype: float64
Another possible solution is replace this values to NaNs:
coun_weight['Launch Mass (Kilograms)'] =
pd.to_numeric(coun_weight['Launch Mass (Kilograms)'], errors='coerce')
print (coun_weight['Launch Mass (Kilograms)'].iloc[1091:1098])
1091 NaN
1092 NaN
1093 NaN
1094 NaN
1095 NaN
1096 NaN
1097 6500.0
Name: Launch Mass (Kilograms), dtype: float64
Trying to convert column to float explicitly reveals the problem, you have "5,000+" which does not convert to 'float64'.
coun_weight['Launch Mass (Kilograms)'].astype('float64')
Result:
ValueError: invalid literal for float(): 5,000+

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Dataframe Merge in Pandas

For some reason, I cannot get this merge to work correctly.
This Dataframe (rspars) has 2,000+ rows...
rsparid f1mult f2mult f3mult
0 1 0.318 0.636 0.810
1 2 0.348 0.703 0.893
2 3 0.384 0.777 0.000
3 4 0.296 0.590 0.911
4 5 0.231 0.458 0.690
5 6 0.275 0.546 0.839
6 7 0.248 0.486 0.731
7 8 0.430 0.873 0.000
8 9 0.221 0.438 0.655
9 11 0.204 0.399 0.593
When trying to join the above to a table based on the rsparid columns to this Dataframe...
line_track line_race rsparid
line_date
2013-03-23 TP 10 1400
2013-02-23 GP 7 634
2013-01-01 GP 7 1508
2012-11-11 AQU 5 96
2012-10-11 BEL 2 161
Using this...
df = pd.merge(datalines, rspars, how='left', on='rsparid')
I get blanks..
line_track line_race rsparid f1mult f2mult f3mult
0 TP 10 1400 NaN NaN NaN
1 TP 10 1400 NaN NaN NaN
2 TP 10 1400 NaN NaN NaN
3 GP 7 634 NaN NaN NaN
4 GP 10 634 NaN NaN NaN
Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?
I also tried it this way...
df = datalines.merge(rspars, how='left', on='rsparid')
EXAMPLE #2
I dropped the data down to a few rows...
rspars:
rsparid f1mult f2mult f3mult
0 1400 0.216 0.435 0.656
datalines:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
Merging...
datalines.merge(rspars, how='left', on='rsparid')
Output...
rsparid f1mult f2mult f3mult
0 1400 NaN NaN NaN
1 634 NaN NaN NaN
2 1508 NaN NaN NaN
3 96 NaN NaN NaN
4 161 NaN NaN NaN
5 1011 NaN NaN NaN
6 1007 NaN NaN NaN
7 518 NaN NaN NaN
8 1955 NaN NaN NaN
9 678 NaN NaN NaN
The NaNs mean they have no values in rsparid in common. This can be tricky when merging things that may look the same when they repr
The repr of small DataFrames with strings (of integers) or integers looks the same and no dtype information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info() method, like so: df.info(). This will give you a nice summary of what's in the DataFrame and what the dtypes of its columns are:
In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})
In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})
In [207]: datalines_int
Out[207]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [208]: datalines_str
Out[208]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: int64(1)
In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: object(1)
NOTE: You'll notice a slight difference in the reprs here, most likely because of padding of numeric DataFrames. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.

Categories