pandas DataFrame (easy?) manipulation - python

pd.DataFrame({'id': ['id1', 'id1', 'id2', 'id2'],
'value': ['1', '2', '10', '20'],
'index': ['day1', 'day2', 'day1', 'day2']})
how can I transform this data correctly (and concisely) with pandas that it results in:
| id1 | id2
day1 : 1 | 10
day2 : 2 | 20
Maybe something with groupby but without aggregation, I dont know what to google, can you help me?
thank you very much

Use pandas pivot. It reshapes datframe based on the input conditions
pd.pivot_table(df, index=['index'], columns=['id'],values='value').reset_index()
Just remember to set value to float or integer

Related

How to convert elements in Series to Dataframe in python?

I'm new to python.
I got a Dataframe like this:
df = pd.DataFrame({'column_a' : [1, 2, 3],
'conversions' : [[{'action_type': 'type1',
'value': '1',
'value_plus_10': '11'},
{'action_type': 'type2',
'value': '2',
'value_plus_10': '12'}],
np.nan,
[{'action_type': 'type3',
'value': '3',
'value_plus_10': '13'},
{'action_type': 'type4',
'value': '4',
'value_plus_10': '14'}]]} )
where values in the column conversions is either a list or a NaN.
values in conversions looks like this:
print(df['conversions'][0])
>>> [{'action_type': 'type1', 'value': '1', 'value_plus_10': '11'}, {'action_type': 'type2', 'value': '2', 'value_plus_10': '12'}]
But it's kinda hard to manipulate, so I want elements in conversions to be either a dataFrame or a NaN, like this:
print(df['conversions'][0])
>>>
action_type value value_plus_10
0 type1 1 11
1 type2 2 12
print(df['conversions'][1])
>>> nan
print(df['conversions'][2])
>>>
action_type value value_plus_10
0 type3 3 13
1 type4 4 14
Here's what I tried:
df['conversions'] = df['conversions'].apply(lambda x : pd.DataFrame(x) if type(x)=='list' else x)
which works, but nothing really changes.
I could only find ways to convert a series to a dataframe, but what I'm trying to do is converting elements in a series to dataframes.
Is it possible to do? Thanks a lot!
Edit: Sorry for the unclear expected output , hope it's clear now.
You can apply the DataFrame constructor to the conversions columns:
df['conversions'] = df['conversions'].apply(lambda x: pd.DataFrame(x) if isinstance(x, list) else x)
print(df['conversions'][0])
Output:
action_type value value_plus_10
0 type1 1 11
1 type2 2 12
Edit: it seems I misread your question (which is a bit unclear tbf) since you claim that this doesn't get the expected result. Are you trying to get all elements in one df? In that case you can use concat:
df_out = pd.concat([
pd.DataFrame(x) for x in df['conversions'] if isinstance(x, list)
])
print(df_out)
Output:
action_type value value_plus_10
0 type1 1 11
1 type2 2 12
0 type3 3 13
1 type4 4 14

pandas dataframe boolean indexing with multiple conditions from another df

I'm trying to identify the rows between 2 df which shared the same values for some columns for the SAME row.
Example:
import pandas as pd
df = pd.DataFrame([{'energy': 'power', 'id': '123'}, {'energy': 'gas', 'id': '456'}])
df2 = pd.DataFrame([{'energy': 'power', 'id': '456'}, {'energy': 'power', 'id': '123'}])
df =
energy id
0 power 123
1 gas 456
df2 =
energy id
0 power 456
1 power 123
Therefore, I'm trying to get the rows from df where energy & id matches exactly in the same row in df2.
If I do like this, I get a false result:
df2.loc[(df2['energy'].isin(df['energy'])) & (df2['id'].isin(df['id']))]
because this will match the 2 rows of df2 whereas I would expect only power / 123 to be matched
How should I do to do boolean indexing with multiple "dynamic" conditions based on another df rows and matching the values for the same rows in the other df ?
Hope it's clear
pd.merge(df, df2, on=['id','energy'], how='inner')

Pandas / Python remove duplicates based on specific row values

I am trying to remove duplicated based on multiple criteria:
Find duplicated in column df['A']
Check column df['status'] and prioritize OK vs Open and Open vs Close
if we have a duplicate with same status pick the lates one based on df['Col_1]
df = pd.DataFrame({'A' : ['11', '11', '12', np.nan, '13', '13', '14', '14', '15'],
'Status' : ['OK','Close','Close','OK','OK','Open','Open','Open',np.nan],
'Col_1' :[2000, 2001, 2000, 2000, 2000, 2002, 2000, 2004, 2000]})
df
Expected output:
I have tried differente solutions like the links below (map or loc) but I am unable to find the correct way:
Pandas : remove SOME duplicate values based on conditions
Create ordered categorical for prioritize Status, then sorting per all columns, remove duplicates by first column A and last sorting index:
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = df.sort_values(['A','Status','Col_1']).drop_duplicates('A').sort_index()
print (df)
A Status Col_1
0 11 OK 2000
2 12 Close 2000
3 NaN OK 2000
4 13 OK 2000
6 14 Open 2000
8 15 NaN 2000
EDIT If need avoid NaNs are removed add helper column:
df['test'] = df['A'].isna().cumsum()
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = (df.sort_values(['A','Status','Col_1', 'test'])
.drop_duplicates(['A', 'test'])
.sort_index())

pandas pivot table aggfunc troubleshooting

This DataFrame has two columns, both are object type.
Dependents Married
0 0 No
1 1 Yes
2 0 Yes
3 0 Yes
4 0 No
I want to aggregate 'Dependents' based on 'Married'.
table = df.pivot_table(
values='Dependents',
index='Married',
aggfunc = lambda x: x.map({'0':0,'1':1,'2':2,'3':3}).mean())
This works, however, surprisingly, the following doesn't:
table = df.pivot_table(values = 'Dependents',
index = 'Married',
aggfunc = lambda x: x.map(int).mean())
It will produce a None instead.
Can anyone help explain?
Both examples of code provided in your question work. However, they are not the idiomatic way to achieve what you want to do -- particularly the first one.
I think this is the proper way to obtain the expected behavior.
# Test data
df = DataFrame({'Dependents': ['0', '1', '0', '0', '0'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No']})
# Converting object to int
df['Dependents'] = df['Dependents'].astype(int)
# Computing the mean by group
df.groupby('Married').mean()
Dependents
Married
No 0.00
Yes 0.33
However, the following code works.
df.pivot_table(values = 'Dependents', index = 'Married',
aggfunc = lambda x: x.map(int).mean())
It is equivalent (and more readable) of converting to int with map before pivoting data.
df['Dependents'] = df['Dependents'].map(int)
df.pivot_table(values = 'Dependents', index = 'Married')
Edit
I you have messy DataFrame, you can use to_numeric with the error parameter set to coerce.
If coerce, then invalid parsing will be set as NaN
# Test data
df = DataFrame({'Dependents': ['0', '1', '2', '3+', 'NaN'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No']})
df['Dependents'] = pd.to_numeric(df['Dependents'], errors='coerce')
print(df)
Dependents Married
0 0.0 No
1 1.0 Yes
2 2.0 Yes
3 NaN Yes
4 NaN No
print(df.groupby('Married').mean())
Dependents
Married
No 0.0
Yes 1.5
My originally question was why the method 2 using map(int) doesn't work. None of the above answers my question. Therefore there is no best answer.
However, as I look back, I found now in pandas 0.22, method 2 does work. I guess the problem is in pandas.
To robustly do the aggregation, my solution would be
df.pivot_table(
values='Dependents',
index='Married',
aggfunc = lambda x: x.map(lambda x:int(x.strip("+"))).mean())
To make it cleaner, I guess you could first translate the column "Dependents" to integer then do the aggregation.

Convert row to column header for Pandas DataFrame,

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Categories