Compare Misaligned Series columns Pandas - python

Comparing 2 series objects of different sizes:
IN[248]:df['Series value 1']
Out[249]:
0 70
1 66.5
2 68
3 60
4 100
5 12
Name: Stu_perc, dtype: int64
IN[250]:benchmark_value
#benchamrk is a subset of data from df2 only based on certain filters
Out[251]:
0 70
Name: Stu_perc, dtype: int64
Basically I wish to compare df['Series value 1'] with benchmark_value and return the values which are greater than 95% of benchark value in a column Matching list. Type of both of these is Pandas series. However sizes are different for both, hence it is not comparing.
Input given:
IN[252]:df['Matching list']=(df2['Series value 1']>=0.95*benchmark_value)
OUT[253]: ValueError: Can only compare identically-labeled Series objects
Output wanted:
[IN]:
df['Matching list']=(df2['Stu_perc']>=0.95*benchmark_value)
#0.95*Benchmark value is 66.5 in this case.
df['Matching list']
[OUT]:
0 70
1 66.5
2 68
3 NULL
4 100
5 NULL

Because benchmark_value is Series, for scalar need select first value of Series by Series.iat and set NaNs by Series.where:
benchmark_value = pd.Series([70], index=[0])
val = benchmark_value.iat[0]
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 NaN
4 100.0 100.0
5 12.0 NaN
General solution also working if benchmark_value is empty is next with iter for return first value of Series and if not exist use default value - here 0:
benchmark_value = pd.Series([])
val = next(iter(benchmark_value), 0)
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 60.0
4 100.0 100.0
5 12.0 12.0

is your benchmark value is single-value?
If yes, you might need to convert benchmark_value which is a series to a number (without index) by using df['Matching list']=(df['Stu_perc']>=0.95*benchmark_value.values)

It seems benchmark value is a Series with a single row, so not an actual number, I believe you need to access it first.
But this will return a list of Booleans. To get just the values that you want, you can use the where function.
Try this:
df['Matching list']= df2['Stu_perc'].where(df2['Stu_perc'] >=0.95*benchmark_value[0][0]))

Related

Pandas create column with names of columns with lowest match

I have Pandas dataframe where I have points and corresponding lengths to another points. I am able to get minimal value of the calculated columns, however, I need the column names itself. I am unable to figure out how can I get the column names corresponding to values in a new column. My dataframe looks like this:
df.head():
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 218.039561
71 100.0 381.0 925.324708 ... 647.707783 169.856557 169.856557
61 225.0 69.0 751.353014 ... 515.152768 122.377490 122.377490
0 and 1 are datapoints, the rest are distances to datapoints #1 to 7, in some cases the number of points can differ, does not really matter for the question. The code I use to count min is following:
new = users.iloc[:,2:].min(axis=1)
users["min"] = new
#could also do the following way
#users.assign(Min=lambda users: users.iloc[:,2:].min(1))
This is quite simple and there is no much about finding the minimum of multiple columns. However, I need to get the col name instead of the value. So my desired output would look like this (in the example all are 7, which is not rule):
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 7
71 100.0 381.0 925.324708 ... 647.707783 169.856557 7
61 225.0 69.0 751.353014 ... 515.152768 122.377490 7
Is there a simple way to achieve this?
Use df.idxmin:
In [549]: df['min'] = df.iloc[:,2:].idxmin(axis=1)
In [550]: df
Out[550]:
0 1 2 6 7 min
9 58.0 94.0 984.003636 696.667367 218.039561 7
71 100.0 381.0 925.324708 647.707783 169.856557 7
61 225.0 69.0 751.353014 515.152768 122.377490 7

Pandas join.fillna of two data frames replaces all all values with anf not only nan

The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object
Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)
If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)

iteritems() in dataframe column

I have a dataset of U.S. Education Datasets: Unification Project. I want to find out
Number of rows where enrolment in grade 9 to 12 (column: GRADES_9_12_G) is less than 5000
Number of rows where enrolment is grade 9 to 12 (column: GRADES_9_12_G) is between 10,000 and 20,000.
I am having problem in updating the count whenever the value in the if statement is correct.
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape
df = df.iloc[:, -6]
for key, value in df.iteritems():
count = 0
count1 = 0
if value < 5000:
count += 1
elif value < 20000 and value > 10000:
count1 += 1
print(str(count) + str(count1))
df looks like this
0 196386.0
1 30847.0
2 175210.0
3 123113.0
4 1372011.0
5 160299.0
6 126917.0
7 28338.0
8 18173.0
9 511557.0
10 315539.0
11 43882.0
12 66541.0
13 495562.0
14 278161.0
15 138907.0
16 120960.0
17 181786.0
18 196891.0
19 59289.0
20 189795.0
21 230299.0
22 419351.0
23 224426.0
24 129554.0
25 235437.0
26 44449.0
27 79975.0
28 57605.0
29 47999.0
...
1462 NaN
1463 NaN
1464 NaN
1465 NaN
1466 NaN
1467 NaN
1468 NaN
1469 NaN
1470 NaN
1471 NaN
1472 NaN
1473 NaN
1474 NaN
1475 NaN
1476 NaN
1477 NaN
1478 NaN
1479 NaN
1480 NaN
1481 NaN
1482 NaN
1483 NaN
1484 NaN
1485 NaN
1486 NaN
1487 NaN
1488 NaN
1489 NaN
1490 NaN
1491 NaN
Name: GRADES_9_12_G, Length: 1492, dtype: float64
In the output I got
00
With Pandas, using loops is almost always the wrong way to go. You probably want something like this instead:
print(len(df.loc[df['GRADES_9_12_G'] < 5000]))
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))
I downloaded your data set, and there are multiple ways to go about this. First of all, you do not need to subset your data if you do not want to. Your problem can be solved like this:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52
The line df.loc[df['GRADES_9_12_G'] < 5000] is telling pandas to query the dataframe for all rows in column df['GRADES_9_12_G'] that are less than 5000. I am then calling python's builtin len function to return the length of the returned, which outputs 184. This is essentially a boolean masking process which returns all True values for your df that meet the conditions you give it.
The second query df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)]
uses an & operator which is a bitwise operator that requires both conditions to be met for a row to be returned. We then call the len function on that as well to get an integer value of the number of rows which outputs 52.
To go off your method:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range
Why did I change the code in my answer instead of using yours?
Simply enough to say, it is more pandonic. Where we can, it is cleaner to use pandas built-ins than iterating over dataframes with for loops, as this is what pandas was designed for.

Python pandas show repeated values

I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]
You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]
A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!

Replacing a pandas DataFrame value with np.nan when the values ends with '_h'

Below is the code that I have been working with to replace some values with np.NaN. My issue is how to replace'47614750_h' at index 111 with np.NaN. I can do this directly with drop_list, however, I need to iterate this with different values ending in '_h' over many files and would like to do this automatically.
I have tried some searches on regex as it seems the way to go, but could not find what i needed.
drop_list = ['dash_code', 'SONIC WELD']
df_clean.replace(drop_list, np.NaN).tail(10)
DASH_CODE Name Quantity
107 1011567 .156 MALE BULLET TERM INSUL 1.0
108 102066901 .032 X .187 FEMALE Q.D. TERM. 1.0
109 105137901 TERM,RING,10-12AWG,INSULATED 1.0
110 101919701 1/4 RING TERM INSUL 2.0
111 47614750001_h HARNESS, MAIN, AC, LIO 1.0
112 NaN NaN 19.0
113 7685 5/16 RING TERM INSUL. 1.0
114 102521601 CLIP,HARNESS 2.0
115 47614808001 CAP, RESISTOR, TERMINATION 1.0
116 103749801 RECPT, DEUTSCH, DTM04-4P 1.0
You can use pd.Series.apply for this with a lambda:
df['DASH_CODE'] = df['DASH_CODE'].apply(lambda x: np.NaN if x.endswith('_h') else x)
From the documentation:
Invoke function on values of Series. Can be ufunc (a NumPy function
that applies to the entire Series) or a Python function that only
works on single values
It may be faster to try to convert all the rows to float using pd.to_numeric:
In [11]: pd.to_numeric(df.DASH_CODE, errors='coerce')
Out[11]:
0 1.011567e+06
1 1.020669e+08
2 1.051379e+08
3 1.019197e+08
4 NaN
5 NaN
6 7.685000e+03
7 1.025216e+08
8 4.761481e+10
9 1.037498e+08
Name: DASH_CODE, dtype: float64
In [12]: df["DASH_CODE"] = pd.to_numeric(df["DASH_CODE"], errors='coerce')

Categories