Using different dataframes to change column values - Python Pandas - python

I have a dataframe1 like the following:
A B C D
1 111 a 9
2 121 b 8
3 122 c 7
4 121 d 6
5 131 e 5
Also, I have another dataframe2:
Code String
111 s
12 b
13 u
What I want is to creat a dataframe like the following:
A B C D
1 111 S 9
2 121 b 8
3 122 c 7
4 121 b 6
5 131 u 5
That would be, take the first n digits (where n is the number of digits in Code column of dataframe2) and if it has the same numbers that the code, then the column C in dataframe1 would change for the string in dataframe2.

Is this what you want ? The code is not very neat but work..
import pandas as pd
DICT=df2.set_index('Code').T.to_dict('list')
Temp=[]
for key, value in DICT.items():
n=len(str(key))
D1={str(key):value[0]}
T=df1.B.astype(str).apply(lambda x: x[:n]).map(D1)
Temp2=(df1.B.astype(str).apply(lambda x: x[:n]))
Tempdf=pd.DataFrame({'Ori':df1.B,'Now':Temp2,'C':df1.C})
TorF=(Tempdf.groupby(['Now'])['Ori'].transform(min) == Tempdf['Ori'])
for n, i in enumerate(T):
if TorF[n]==False:
T[n]=Tempdf.ix[n,0]
Temp.append(T)
df1.C=pd.DataFrame(data=Temp).fillna(method='bfill').T.ix[:,0]
Out[255]:
A B C D
0 1 111 s 9
1 2 121 b 8
2 3 122 c 7
3 4 121 b 6
4 5 131 u 5

Related

Find missing numbers in a column dataframe pandas

I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
Store Invoice
0 A 1
1 A 2
2 A 5
3 A 6
4 A 8
5 B 20
6 B 23
7 B 24
8 B 30
9 C 200
10 C 202
11 C 203
12 D 204
13 D 206
And I want a dataframe like this:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Thanks in advance!
You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:
(df1.astype({'Invoice': int})
.groupby('Store')['Invoice']
.apply(lambda s: set(range(s.min(), s.max())).difference(s))
.explode().reset_index()
)
NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).
Output:
Store Invoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Here's an approach:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)
df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1),
df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
The resulting dataframe df2:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.

Count consecutive numbers from a column of a dataframe in Python

I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

Pandas - For Each Index, Put All Columns Into Rows [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 3 years ago.
I'm trying to avoid looping, but the title sort of explains the issue.
import pandas as pd
df = pd.DataFrame(columns=['Index',1,2,3,4,5])
df = df.append({'Index':333,1:'A',2:'C',3:'F',4:'B',5:'D'}, ignore_index=True)
df = df.append({'Index':234,1:'B',2:'D',3:'C',4:'A',5:'Z'}, ignore_index=True)
df.set_index('Index', inplace=True)
print(df)
1 2 3 4 5
Index
333 A C F B D
234 B D C A Z
I want to preserve the index, and for each column turn it into a row with the corresponding value like this:
newcol value
Index
333 1 A
333 2 C
333 3 F
333 4 B
333 5 C
234 1 B
234 2 D
234 3 C
234 4 A
234 5 Z
It's somewhat of a transpose issue, but not exactly like that. Any ideas?
You need:
df.stack().reset_index(1, name='value').rename(columns={'level_1':'newcol'})
# OR df.reset_index().melt('Index',var_name='new_col',value_name='Value').set_index('Index')
#(cc: #anky_91)
Output:
newcol value
Index
333 1 A
333 2 C
333 3 F
333 4 B
333 5 D
234 1 B
234 2 D
234 3 C
234 4 A
234 5 Z
Another solution using to_frame and rename_axis:
df.stack().to_frame('value').rename_axis(index=['','newcol']).reset_index(1)
newcol value
333 1 A
333 2 C
333 3 F
333 4 B
333 5 D
234 1 B
234 2 D
234 3 C
234 4 A
234 5 Z

dropping columns by condition where dtypes are string and numeric

I have the following data (# of columns can vary):
NAME ID POTENTIAL_VOTERS VOTES SPOILT_VOTES LEGAL_VOTES אמת ג ודעם ז ... נץ ע פה ף ףץ קנ קץ רק שס voter_turnout
0 תל אביב - יפו 5000 403338 263205 1860 261345 89567 2628 8488 9 ... 34 132 30241 105 124 2667 2906 209 10189 0.647955
1 ירושלים 3000 385888 258879 3593 255286 24696 53948 3148 10 ... 54 215 10752 37 148 1619 18330 121 30579 0.661555
2 חיפה 4000 243274 151318 1758 149560 37805 4894 12363 24 ... 16 103 16826 40 87 1596 1648 142 3342 0.614780
3 ראשון לציון 8300 195958 138998 1188 137810 31492 924 86 8 ... 16 5 19953 26 68 1821 2258 121 4095 0.703263
4 פתח תקווה 7900 177367 125633 1223 124410 22103 4810 85 8 ... 14 9 14661 15 65 1224 3227 74 6946 0.701427
5 אשדוד 70 170193 115145 1942 113203 9694 11132 33 7 ... 14 10 8841 26 74 1322 4180 80 11923 0.665145
6 נתניה 7400 168914 106738 1270 105468 14575 2921 65 5 ... 14 9 11035 40 63 1089 3177 103 8319 0.624389
When I try to remove columns by condition of sum (where the total sum is less than 40000 I don't need this column), using this code:
df.drop([col for col, val in df.sum().iteritems() if val < 40000], axis=1, inplace=True)
I am getting the following error:
TypeError: '<' not supported between instances of 'str' and 'int'
I assume this is because some of the columns are not integers (as the have text). Any idea how to solve this?
The problem here is that sum will concatenate all the strings, you need to filter the df to select just the numeric dtypes and then filter them:
In[27]:
df = pd.DataFrame({'a': list('abcd'), 'b':np.random.randn(4), 'c':np.arange(4)})
df
Out[27]:
a b c
0 a -0.053771 0
1 b 0.124416 1
2 c -2.024073 2
3 d -2.541324 3
We can select just the numeric types using select_dtypes and pass np.number
In[28]:
df1 = df.select_dtypes([np.number])
df1
Out[28]:
b c
0 -0.053771 0
1 0.124416 1
2 -2.024073 2
3 -2.541324 3
Now we can filter the columns:
In[29]:
df1.loc[:,df1.sum() > 1]
Out[29]:
c
0 0
1 1
2 2
3 3
You can see that sum is returning the strings concatenated
In[30]:
df.sum()
Out[30]:
a abcd
b -4.49475
c 6
dtype: object
If need remove only numeric columns by condition:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,100005,5,4],
'C':[7,8,9,4,2,3],
'D':[10111,30000,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 10111 5 a
1 b 5 8 30000 3 a
2 c 4 9 5 6 a
3 d 100005 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
k = 40000
a = df.loc[:, pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1) > k]
print (a)
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
Detail:
First convert summed Series to_numeric with errors='coerce' for replace not parseable strings columns to NaNs:
print (pd.to_numeric(df.sum(), errors='coerce'))
A NaN
B 100027.0
C 33.0
D 40124.0
E 29.0
F NaN
dtype: float64
And then replace NaNs by value + 1 which need filter for including non numeric columns:
print (pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1))
A 40001.0
B 100027.0
C 33.0
D 40124.0
E 29.0
F 40001.0
dtype: float64
Last compare:
print (pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1) > k)
A True
B True
C False
D True
E False
F True
dtype: bool
And filter by boolean indexing:
print (df.loc[:, pd.to_numeric(df.sum(), errors='coerce').fillna(k + 1) > k])
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
---
Alternative solution with omiting strings columns and then added Trues to mask by reindex:
df = df.loc[:, (df.sum(numeric_only=True) > 40000).reindex(df.columns, fill_value=True)]
print (df)
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
Detail:
First sum only numeric columns by parameter numeric_only=True:
print (df.sum(numeric_only=True))
B 100027
C 33
D 40124
E 29
dtype: int64
Compare by 40000
print (df.sum(numeric_only=True) > 40000)
B True
C False
D True
E False
dtype: bool
Add strings columns by reindex:
print ((df.sum(numeric_only=True) > 40000).reindex(df.columns, fill_value=True))
A True
B True
C False
D True
E False
F True
dtype: bool
Last filtering:
print (df.loc[:, (df.sum(numeric_only=True) > 40000).reindex(df.columns, fill_value=True)])
A B D F
0 a 4 10111 a
1 b 5 30000 a
2 c 4 5 a
3 d 100005 7 b
4 e 5 1 b
5 f 4 0 b
sum has a parameter numeric_only that you can make use of.
df.drop(
[col for col, greater in (df.sum(numeric_only=True) > 40000).to_dict().items()
if greater is False], axis=1, inplace=True
)

Categories