I have a data set where there are name and id columns. In theory the name should always correspond to the same id, but due to some system errors and data quality issues in practice this is not always the case.
Generally the scenario is that the wrong id's occur at an extremely negligible rate compare to the right id's. So for example there will be a 1000 rows where the name 'a' and id '1' match but there will be 2 rows where the name is 'a' and id '7'.
So the logic to resolve what the proper id would simply be to find the most frequently occurring id for each name.
d = {'id': ['1', '1', '2', '2',], 'name': ['a', 'a', 'a', 'b'], 'value': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
store name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
The first question is what is the best way to find the proper id for each name and drop the rows where the proper id does not occur, the result being the following:
store name value
0 1 a 1
1 1 a 2
2 2 b 4
The second part is, in the scenarios where the mismatched id is actually the id of another name, then fix the name to match the proper id, example output:
store name value
0 1 a 1
1 1 a 2
2 2 b 3
3 2 b 4
The actual data has thousands of names/ids, the example is just a simplification.
Here is my solution. It's a bit a makeshift job but it should work as a temporary solution
d = {'id': ['1', '1', '2', '2', '2', '3','3', '4', '4'],
'name': ['a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c'],
'value': ['1', '2', '3', '4', '5', '6', '7', '8', '9']}
df = pd.DataFrame(data=d)
Following the raw DataFrame, without id changes:
id name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
4 2 b 5
5 3 b 6
6 3 c 7
7 4 c 8
8 4 c 9
Workflow:
# convert id, value from string to flat
df['id'] = [float(id) for id in df['id']]
df['value'] = [float(value) for value in df['value']]
# extract most repeated id for one name
def most_common(lst):
return max(set(lst), key=lst.count)
count = dict()
for name in pd.unique(df['name']):
temp = {name: most_common(list(df[df['name'] == name]['id']))}
count.update(temp)
# correct wrong id
replace = [[count[name], name] if id != count[name] else [id, name] for id, name in zip(df['id'],df['name'])]
df['id'] = [item[0] for item in replace]
df['name'] = [item[1] for item in replace]
output:
In [3]: count
Out[3]: {'a': 1.0, 'b': 2.0, 'c': 4.0}
In [1]: df
Out[1]:
id name value
0 1.0 a 1.0
1 1.0 a 2.0
2 1.0 a 3.0
3 2.0 b 4.0
4 2.0 b 5.0
5 2.0 b 6.0
6 4.0 c 7.0
7 4.0 c 8.0
8 4.0 c 9.0
This solution might not work if you have the exact same count of two differents 'id' for the same 'name'
Related
I have a pandas Dataframe named dataframe.
I want to add two rows at the start and end of the data frame with 0s.
#create DataFrame
df_x = pd.DataFrame({'logvalue': ['20', '20.5', '18.5', '2', '10'],
'ID': ['1', '2', '3', '4', '5']})
Output should look like below.
logvalue
ID
violatedInstances
0
0
0
20
1
0
20.5
2
1
18.5
3
0
2
4
1
10
5
1
0
0
0
The output should rearrange the indexes of the dataframe as well.
How can I do this in pandas?
You can use concat:
First create a new dataframe (df_y) that contains the zero'd row
Use the concat function to join this dataframe with the original
Use the reset_index(drop=True) function to reset the index.
Code:
df_x = pd.DataFrame({ 'logvalue': [20.0, 20.5, 18.5, 2.0, 10.0, 0.0],
'ID': [1, 2, 3, 4, 5, 0],
'violatedInstances': [0, 1, 0, 1, 1, 0]})
# Extract the column names from the original dataframe
column_names = df_x.columns
number_of_columns = len(column_names)
row_of_zeros = [0]*number_of_columns
# Create a new dataframe that has a row of zeros
df_y = pd.DataFrame([row_of_zeros], columns=column_names)
# Join the dataframes together
output = pd.concat([df_y, df_x, df_y]).reset_index(drop=True)
print(output)
Output:
logvalue ID violatedInstances
0 0.0 0 0
1 20.0 1 0
2 20.5 2 1
3 18.5 3 0
4 2.0 4 1
5 10.0 5 1
6 0.0 0 0
7 0.0 0 0
Example
df_x = pd.DataFrame({'logvalue': ['20', '20.5', '18.5', '2', '10'],
'ID': ['1', '2', '3', '4', '5']})
df_x
logvalue ID
0 20 1
1 20.5 2
2 18.5 3
3 2 4
4 10 5
Code
use reindex with fill_value
idx = ['start'] + df_x.index.tolist() + ['end']
df_x.reindex(idx, fill_value=0).reset_index(drop=True)
result:
logvalue ID
0 0 0
1 20 1
2 20.5 2
3 18.5 3
4 2 4
5 10 5
6 0 0
['start'] and ['end'] of idx variable : any label that is not in index of df_x.
>>> df = pd.DataFrame({'id': ['1', '1', '2', '2', '3', '4', '4', '5', '5'],
... 'value': ['keep', 'y', 'x', 'keep', 'x', 'Keep', 'x', 'y', 'x']})
>>> print(df)
id value
0 1 keep
1 1 y
2 2 x
3 2 keep
4 3 x
5 4 Keep
6 4 x
7 5 y
8 5 x
In this example, the idea would be to keep index values 0, 3, 4, 5 since they are asscoiated with a duplicate id with a particular value == 'Keep' and 7 (since it is the first of the duplicates for id 5).
In your case try with idxmax
out = df.loc[df['value'].eq('keep').groupby(df.id).idxmax()]
Out[24]:
id value
0 1 keep
3 2 keep
4 3 x
5 4 Keep
7 5 y
Based on this problem: find duplicated groups in dataframe and this dataframe
df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
})
How can i mark in this dataframe in the additional column duplicated the different duplicate groups (in the value columns) by unique label, like "1" for one duplicated group, "2" for the next and so on? I found examples here on slack to identify them as false and true, but one only with "ngroup", but did not work.
My real example has 20+ columns and also NaNs in between. I have created the wide format by pivot_table from original long format, since i thought getting duplicated entries is the better from wide. Duplicates should be found in N-1 columns, which names I summarize by using subset on a list comprehension excluding this identifier column
That is what i had so far:
df = df_long.pivot_table(index="Y",columns="Z",values="value").reset_index()
subset = [c for c in df.columns if not c=="id"]
df = df.loc[df.duplicated(subset=subset,keep=False)].copy()
We use pandas 0.22, if that does matter.
The problem is, that when I use
for i, group in df.groupby(subset):
print(group)
I basically don't get back any group.
Use groupby_ngroup as suggested by #Chris:
df['duplicated'] = df.groupby(df.filter(like='value').columns.tolist()).ngroup()
print(df)
# Output:
id value1 value2 value3 duplicated
0 A 1 1 1 0 # Group 0 (all 1)
1 A 2 2 2 1
2 A 3 3 3 2
3 A 4 4 4 3
4 B 1 1 1 0 # Group 0 (all 1)
5 B 2 2 2 1
6 C 1 1 1 0 # Group 0 (all 1)
7 C 2 2 2 1
8 C 3 3 3 2
9 C 4 4 4 3
10 D 1 1 1 0 # Group 0 (all 1)
11 D 2 2 2 1
12 D 3 3 3 2
Ok the last comment above was the correct hint: The NaNs in my real data are the problems, which also groupby does not allow for identifying groups. By using fillna() before using groupby, the groups can be identified and ngroup does add me the group numbers.
df['duplicated'] = df.fillna(-1).groupby(df.filter(like='value').columns.tolist()).ngroup()
This been bugging me for a while now. How can I achieve =INDEX(A:A,MATCH(E1&F1,B:B&C:C,0))in python? This will return an error if not found.
So I started playing with the pd.merge_asof. But either way I try it only returns errors.
df_3 = pd.merge_asof(df_1, df_2, on=['x', 'y'], allow_exact_matches=False)
Would give the error:
pandas.tools.merge.MergeError: can only asof on a key for left
Edit:
import pandas as pd
df_1 = pd.DataFrame({'x': ['1', '1', '2', '2', '3', '3', '4', '5', '5', '5'],
'y': ['smth1', 'smth2', 'smth1', 'smth2', 'smth1', 'smth2', 'smth1', 'smth1', 'smth2', 'smth3']})
df_2 = pd.DataFrame({'x': ['1', '2', '2', '3', '4', '5', '5'],
'y': ['smth1','smth1','smth2','smth3','smth1','smth1','smth3'],
'z': ['other1','other1','other2','other3','other1','other1','other3',]})
So that's a sample, where I could simply do this in excel with above formula and get something like this:
x y z
1 smth1 other1
1 smth2 #NA
2 smth1 other1
2 smth2 other2
3 smth1 #NA
3 smth2 #NA
4 smth1 other1
5 smth1 other1
5 smth2 #NA
5 smth3 other3
So, is there an easy way to achieve the INDEX MATCH formula in excel in pandas?
Let's try merge with how='left':
df_1.merge(df_2, on=['x','y'], how='left')
Output:
x y z
0 1 smth1 other1
1 1 smth2 NaN
2 2 smth1 other1
3 2 smth2 other2
4 3 smth1 NaN
5 3 smth2 NaN
6 4 smth1 other1
7 5 smth1 other1
8 5 smth2 NaN
9 5 smth3 other3
I need to use the third row as the labels for a dataframe, but keep the first two rows for other uses. How can you change the labels on an existing dataframe to an existing row?
So basically this dataframe
A B C D
1 2 3 4
5 7 8 9
a b c d
6 4 2 1
becomes
a b c d
6 4 2 1
And I cannot just set the headers when the file is read in because I need the first two rows and labels for some processing
One way would be just to take a slice and then overwrite the columns:
In [71]:
df1 = df.loc[3:]
df1.columns = df.loc[2].values
df1
Out[71]:
a b c d
3 6 4 2 1
You can then assign back to df a slice of the rows of interest:
In [73]:
df = df[:2]
df
Out[73]:
A B C D
0 1 2 3 4
1 5 7 8 9
First copy the first two rows into a new DataFrame. Then rename the columns using the data contained in the second row. Finally, delete the first three rows of data.
import pandas as pd
df = pd.DataFrame({'A': {0: '1', 1: '5', 2: 'a', 3: '6'},
'B': {0: '2', 1: '7', 2: 'b', 3: '4'},
'C': {0: '3', 1: '8', 2: 'c', 3: '2'},
'D': {0: '4', 1: '9', 2: 'd', 3: '1'}})
df2 = df.loc[:1, :].copy()
df.columns = [c for c in df.loc[2, :]]
df.drop(df.index[:3], inplace=True)
>>> df
a b c d
3 6 4 2 1
>>> df2
A B C D
0 1 2 3 4
1 5 7 8 9