I'm working with a 2D list of numbers similar to the example below I and am trying to reorder the columns:
D C B A
1 3 2 0
1 3 2 0
1 3 2 0
The first row of the list is reserved for letters to reference each column.
How can I sort this list so that these columns are placed in alphabetical order to achieve the following:
D C B A A B C D
1 3 2 0 0 2 3 1
1 3 2 0 0 2 3 1
1 3 2 0 0 2 3 1
I've found examples that make use of lambdas for sorting, but have not found any similar examples that sort columns by characters.
I'm not sure how to achieve this sorting and would really appreciate some help.
zip() the 2D list, sort by the first item, then zip() again.
>>> table = [['D', 'C', 'B', 'A',],
... [1, 3, 2, 0,],
... [1, 3, 2, 0],
... [1, 3, 2, 0]]
>>> for row in zip(*sorted(zip(*table), key=lambda x: x[0])):
... print(*row)
...
A B C D
0 2 3 1
0 2 3 1
0 2 3 1
Assume values stored row-by-row in list, like that:
a = [['D', 'C', 'B', 'A'],
['1', '3', '2', '0'],
['1', '3', '2', '0']]
To sort this array you can use following code:
zip(*sorted(zip(*a), key=lambda column: column[0]))
where column[0] - value to be sorted by (you can use column1 etc.)
Output:
[('A', 'B', 'C', 'D'),
('0', '2', '3', '1'),
('0', '2', '3', '1')]
Note:
If you are working with pretty big arrays and execution time does matter, consider using numpy, it has appropriate method: NumPy sort
Related
>>> df = pd.DataFrame({'id': ['1', '1', '2', '2', '3', '4', '4', '5', '5'],
... 'value': ['keep', 'y', 'x', 'keep', 'x', 'Keep', 'x', 'y', 'x']})
>>> print(df)
id value
0 1 keep
1 1 y
2 2 x
3 2 keep
4 3 x
5 4 Keep
6 4 x
7 5 y
8 5 x
In this example, the idea would be to keep index values 0, 3, 4, 5 since they are asscoiated with a duplicate id with a particular value == 'Keep' and 7 (since it is the first of the duplicates for id 5).
In your case try with idxmax
out = df.loc[df['value'].eq('keep').groupby(df.id).idxmax()]
Out[24]:
id value
0 1 keep
3 2 keep
4 3 x
5 4 Keep
7 5 y
Based on this problem: find duplicated groups in dataframe and this dataframe
df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
})
How can i mark in this dataframe in the additional column duplicated the different duplicate groups (in the value columns) by unique label, like "1" for one duplicated group, "2" for the next and so on? I found examples here on slack to identify them as false and true, but one only with "ngroup", but did not work.
My real example has 20+ columns and also NaNs in between. I have created the wide format by pivot_table from original long format, since i thought getting duplicated entries is the better from wide. Duplicates should be found in N-1 columns, which names I summarize by using subset on a list comprehension excluding this identifier column
That is what i had so far:
df = df_long.pivot_table(index="Y",columns="Z",values="value").reset_index()
subset = [c for c in df.columns if not c=="id"]
df = df.loc[df.duplicated(subset=subset,keep=False)].copy()
We use pandas 0.22, if that does matter.
The problem is, that when I use
for i, group in df.groupby(subset):
print(group)
I basically don't get back any group.
Use groupby_ngroup as suggested by #Chris:
df['duplicated'] = df.groupby(df.filter(like='value').columns.tolist()).ngroup()
print(df)
# Output:
id value1 value2 value3 duplicated
0 A 1 1 1 0 # Group 0 (all 1)
1 A 2 2 2 1
2 A 3 3 3 2
3 A 4 4 4 3
4 B 1 1 1 0 # Group 0 (all 1)
5 B 2 2 2 1
6 C 1 1 1 0 # Group 0 (all 1)
7 C 2 2 2 1
8 C 3 3 3 2
9 C 4 4 4 3
10 D 1 1 1 0 # Group 0 (all 1)
11 D 2 2 2 1
12 D 3 3 3 2
Ok the last comment above was the correct hint: The NaNs in my real data are the problems, which also groupby does not allow for identifying groups. By using fillna() before using groupby, the groups can be identified and ngroup does add me the group numbers.
df['duplicated'] = df.fillna(-1).groupby(df.filter(like='value').columns.tolist()).ngroup()
I have a data set where there are name and id columns. In theory the name should always correspond to the same id, but due to some system errors and data quality issues in practice this is not always the case.
Generally the scenario is that the wrong id's occur at an extremely negligible rate compare to the right id's. So for example there will be a 1000 rows where the name 'a' and id '1' match but there will be 2 rows where the name is 'a' and id '7'.
So the logic to resolve what the proper id would simply be to find the most frequently occurring id for each name.
d = {'id': ['1', '1', '2', '2',], 'name': ['a', 'a', 'a', 'b'], 'value': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
store name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
The first question is what is the best way to find the proper id for each name and drop the rows where the proper id does not occur, the result being the following:
store name value
0 1 a 1
1 1 a 2
2 2 b 4
The second part is, in the scenarios where the mismatched id is actually the id of another name, then fix the name to match the proper id, example output:
store name value
0 1 a 1
1 1 a 2
2 2 b 3
3 2 b 4
The actual data has thousands of names/ids, the example is just a simplification.
Here is my solution. It's a bit a makeshift job but it should work as a temporary solution
d = {'id': ['1', '1', '2', '2', '2', '3','3', '4', '4'],
'name': ['a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c'],
'value': ['1', '2', '3', '4', '5', '6', '7', '8', '9']}
df = pd.DataFrame(data=d)
Following the raw DataFrame, without id changes:
id name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
4 2 b 5
5 3 b 6
6 3 c 7
7 4 c 8
8 4 c 9
Workflow:
# convert id, value from string to flat
df['id'] = [float(id) for id in df['id']]
df['value'] = [float(value) for value in df['value']]
# extract most repeated id for one name
def most_common(lst):
return max(set(lst), key=lst.count)
count = dict()
for name in pd.unique(df['name']):
temp = {name: most_common(list(df[df['name'] == name]['id']))}
count.update(temp)
# correct wrong id
replace = [[count[name], name] if id != count[name] else [id, name] for id, name in zip(df['id'],df['name'])]
df['id'] = [item[0] for item in replace]
df['name'] = [item[1] for item in replace]
output:
In [3]: count
Out[3]: {'a': 1.0, 'b': 2.0, 'c': 4.0}
In [1]: df
Out[1]:
id name value
0 1.0 a 1.0
1 1.0 a 2.0
2 1.0 a 3.0
3 2.0 b 4.0
4 2.0 b 5.0
5 2.0 b 6.0
6 4.0 c 7.0
7 4.0 c 8.0
8 4.0 c 9.0
This solution might not work if you have the exact same count of two differents 'id' for the same 'name'
I have a complex dictionary, such as the below
d = {'a':[['x','y'],['1','2']],'b':[['x','y'],['3','4']]}
I want to convert it into a set of pd dataframes in a loop, such as
a
x y
1 2
b
x y
3 4
Any suggestions how that can be accomplished?
You can use the first list as columns name and the rest as rows
import pandas as pd
d = {'a': [['x', 'y'], ['1', '2'], ['1', '2'], ['1', '2'], ['1', '2']],
'b': [['x', 'y'], ['3', '4'], ['3', '4'], ['3', '4'], ['3', '4']]}
# create dict of dataframes with dict comprehension
df_dict = {k: pd.DataFrame(v[1:], columns=v[0]) for k, v in d.items()}
# iterate through df_dict
for k, v in df_dict.items():
print(k)
print(v)
#############
a
x y
0 1 2
1 1 2
2 1 2
3 1 2
b
x y
0 3 4
1 3 4
2 3 4
3 3 4
I am dealing with pandas series like the following
x=pd.Series([1, 2, 1, 4, 2, 6, 7, 8, 1, 1], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f', 'g', 'g'])
The indices are non unique, but will always map to the same value, for example 'a' always corresponds to '1' in my sample, b always maps to '2' etc. So if I want to see which values correspond to each index value I simply need to write
x.mean(level=0)
a 1
b 2
c 4
d 6
e 7
f 8
g 1
dtype: int64
The difficulty arises when the values are strings, I can't call 'mean()' on strings but I would still like to return a similar list in this case. Any ideas on a good way to do that?
x=pd.Series(['1', '2', '1', '4', '2', '6', '7', '8', '1', '1'], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f', 'g', 'g'])
So long as your indices map directly to the values then you can simply call drop_duplicates:
In [83]:
x.drop_duplicates()
Out[83]:
a 1
b 2
c 4
d 6
e 7
f 8
dtype: int64
example:
In [86]:
x = pd.Series(['XX', 'hello', 'XX', '4', 'hello', '6', '7', '8'], index=['a', 'b', 'a', 'c', 'b', 'd', 'e', 'f'])
x
Out[86]:
a XX
b hello
a XX
c 4
b hello
d 6
e 7
f 8
dtype: object
In [87]:
x.drop_duplicates()
Out[87]:
a XX
b hello
c 4
d 6
e 7
f 8
dtype: object
EDIT a roundabout method would be to reset the index so that the index values are a new column, drop duplicates and then set the index back again:
In [100]:
x.reset_index().drop_duplicates().set_index('index')
Out[100]:
0
index
a 1
b 2
c 4
d 6
e 7
f 8
g 1
pandas.Series.values are numpy ndarrays. Perhaps doing a values.astype(int) would solve your problem?
You can also ensure that you're getting all of the unique indices without reshaping the array by getting a list of the unique index values and plugging that back into the index using iloc. Numpy's unique method includes a return_index arg which provides a tuple of (unique_values, indices):
In [3]: x.iloc[np.unique(x.index.values, return_index=True)[1]]
Out[3]:
a 1
b 2
c 4
d 6
e 7
f 8
g 1
dtype: int64