i want to make a dataframe with defined labels. Dont know how to tell panda to take the labels from the list. Hope someone can help
import numpy as np
import pandas as pd
df = []
thislist = []
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(data= thisdict[thislist]) # <- here is my problem
I want to get this:
df = A D
1 7
2 8
3 9
Use:
df = pd.DataFrame(thisdict)[thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
We could also use DataFrame.drop
df = pd.DataFrame(thisdict).drop(columns = ['B','C'])
or DataFrame.reindex
df = pd.DataFrame(thisdict).reindex(columns = thislist)
or DataFrame.filter
df = pd.DataFrame(thisdict).filter(items=thislist)
We can also use filter to filter thisdict.items()
df = pd.DataFrame(dict(filter(lambda item: item[0] in thislist, thisdict.items())))
print(df)
A D
0 1 7
1 2 8
2 3 9
I think this answer is completed with the solution of #anky_91
Finally, I recommend you see how to index
IIUC, use .loc[] with the dataframe constructor:
df = pd.DataFrame(thisdict).loc[:,thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
Use a dict comprehension to create a new dictionary that is a subset of your original so you only construct the DataFrame you care about.
pd.DataFrame({x: thisdict[x] for x in thislist})
A D
0 1 7
1 2 8
2 3 9
If you want to deal with the possibility of missing Keys, add some logic so it's similar to reindex
pd.DataFrame({x: thisdict[x] if x in thisdict.keys() else np.NaN for x in thislist})
df = pd.DataFrame(thisdict)
df[['A', 'D']]
another alternative for your input:
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(thisdict)
and than simply remove your columns not in thelist (you can do it directly from the df or aggregate them):
remove_columns = []
for c in df.columns:
if c not in thislist:
remove_columns.append(c)
and remove it:
df.drop(columns=remove_columns, inplace=True)
Related
I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.
I have written this code to drop a column in my dataset but the message is this
df = Data.drop('URL', axis=1)
----> 1 df = Data.drop('URL', axis=1)
AttributeError: '_io.TextIOWrapper' object has no attribute 'drop'
Was hopping to drop the column using pandas but it did not work
Hi 👋 Hope you are doing well!
Looks like Data has type _io.TextIOWrapper, but it should be only a DataFrame type then you will be able to drop columns:
import pandas as pd
dummy_df = pd.DataFrame(
{
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9]
},
)
print(dummy_df)
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
dummy_df = dummy_df.drop(columns=["B", "C"])
print(dummy_df)
# A
# 0 1
# 1 2
# 2 3
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
i have data like this:
data = {'Host' : ['A','A','A','A','A','A','B','B','B'], 'Duration' : ['1','2',None,'4','5',None,'7','8',None], 'Predict' : [None,None,'3',None,None,'6',None,None,'9']}
df = pd.DataFrame(data)
It looks like:
Host Duration Predict
0 A 1 None
1 A 2 None
2 A None 3
3 A 4 None
4 A 5 None
5 A None 6
6 B 7 None
7 B 8 None
8 B None 9
What i expected to get:
A 1, 2, 3
A 4, 5, 6
B 7, 8, 9
I got what I wanted, but the way that I decided I do not like:
def create_vector(group):
result = []
df_array = []
for index, item in enumerate(group.Duration.ravel()):
if (item != None):
result.append(item)
else:
result.append(group.Predict.ravel()[index])
result.append(-1)
result = np.array(list(map(int, result)))
splitted = np.split(result, np.where(result == -1)[0] + 1)
for arr in splitted:
if (len(arr) > 3):
seq = ', '.join(str(e) for e in arr[:-1])
df_array.append(seq)
return pd.DataFrame(df_array,columns=['seq'])
Minimal length of arr must be one 'Duration' plus one 'Predict'
df= df.groupby(['host']).apply(create_vector)
df= df.reset_index().rename(columns={'level_1':'Index'})
df= df.drop(columns = {'Index'})
Would like to solve this problem using pandas.
Waiting for comments and advice
I believe you can replace missing values from Duration by Predict column, so solution is simplify:
df['new'] = df['Duration'].fillna(df['Predict']).astype(str)
If need groups each 3 values by Host groups:
g = df.groupby('Host').cumcount() // 3
Or if need groups by Predict column with None separator - only necessary default index:
g = df.index.where(df['Predict'].notna()).to_series().bfill()
#if always unique values in Predic column
#g = df['Predict'].bfill()
df = (df.groupby(['Host', g])['new']
.apply(', '.join)
.reset_index(level=1, drop=True)
.reset_index(name='Seq'))
print (df)
Host Seq
0 A 1, 2, 3
1 A 4, 5, 6
2 B 7, 8, 9
Another solution with reshape by DataFrame.stack - Nones or missing values are removed by default with again aggregate join:
g = df.groupby('Host').cumcount() // 3
df = (df.set_index(['Host', g])
.stack()
.astype(str)
.groupby(level=[0,1])
.apply(', '.join)
.reset_index(level=1, drop=True)
.reset_index(name='Seq')
)
print (df)
Host Seq
0 A 1, 2, 3
1 A 4, 5, 6
2 B 7, 8, 9
One way would be to melt, dropna to remove invalid values, then groupby and join the valid values:
(df.melt(id_vars='Host')
.dropna(subset=['value'])
.groupby('Host').value
.agg(', '.join)
.reset_index())
Host value
0 A 1, 2, 3, 4, 5
1 B 6, 7, 8, 9, 0
Wanted a new column based on certain conditions of existing columns, below is what I am doing right now, but it takes too much time for huge data. Is there any efficient or faster way to do it.
DF["A"][0] = 0
for x in range(1,rows):
if(DF["B"][x]>DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] + DF["C"][x]
elif(DF["B"][x]<DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] - DF["C"][x]
else:
DF["A"][x] = DF["A"][x-1]
If I got you right this is what you want:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
df['A'] = np.where(df.index==0,
0,
np.where(df['B']>df['B'].shift(),
df['A']-df['A'].shift(),
np.where(df['B']<df['B'].shift(),
df['A'].shift()-df['C'],
df['A'].shift())))
df
# A B C
#0 0.0 12 3
#1 1.0 15 9
#2 -10.0 9 12
#3 -3.0 8 6
#4 1.0 15 8
a new column based on certain conditions of existing columns,
I'm using the DataFrame provided by #zipa:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
First approach
Here's a function that implements efficiently as you specified. It works by leveraging Pandas' indexing features, specifically row masks
def update(df):
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
for cond, sign in [(cond_larger, +1), # A[x-1] + C[x]
(cond_smaller, -1), # A[x-1] - C[x]
(cond_else, 0)]: # A[x-1] + 0
if any(cond):
df.loc[cond, 'A_updated'] = (df['A'].shift().fillna(0) +
sign * df[cond]['C'])
df['A'] = df['A_updated']
df.drop(columns=['A_updated'], inplace=True)
return df
update(df)
=>
A B C
0 3.0 12 3
1 10.0 15 9
2 -10.0 9 12
3 -3.0 8 6
4 12.0 15 8
Optimized
It turns out you can use DataFrame.mask to achieve the same as above. Note you could combine the conditions into the call of mask, however I find it easier to read like this:
# specify conditions
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
# apply
A_shifted = (df['A'].shift().fillna(0)).copy()
df.mask(cond_larger, A_shifted + df['C'], axis=0, inplace=True)
df.mask(cond_smaller, A_shifted - df['C'], axis=0, inplace=True)
df.mask(cond_else, A_shifted, axis=0, inplace=True)
=>
(same results as above)
Notes:
I'm assuming default value 0 for A/B[x-1]. If the first row should be treated differently remove or replace .fillna(0). Results will be different.
Conditions are checked in sequence. Depending on whether updates should use the original values in A or those updated in the previous condition you may not need the helper column A_updated
See previous versions of this answer for a history of how I got here
I have a pandas data frame known as "df":
x y
0 1 2
1 2 4
2 3 8
I am splitting it up into two frames, and then trying to merge back together:
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
My goal is to get it back in the same order, but when I concat, I am getting the following:
frames = [df_1, df_2]
solution = pd.concat(frames)
solution.sort_values(by='x', inplace=False)
x y
1 2 4
2 3 8
0 1 2
The problem is I need the 'x' values to go back into the new dataframe in the same order that I extracted. Is there a solution?
use .loc to specify the order you want. Choose the original index.
solution.loc[df.index]
Or, if you trust the index values in each component, then
solution.sort_index()
setup
df = pd.DataFrame([[1, 2], [2, 4], [3, 8]], columns=['x', 'y'])
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_1, df_2]
solution = pd.concat(frames)
Try this:
In [14]: pd.concat([df_1, df_2.sort_values('y')])
Out[14]:
x y
0 1 2
1 2 4
2 3 8
When you are sorting the solution using
solution.sort_values(by='x', inplace=False)
you need to specify inplace = True. That would take care of it.
Based on these assumptions on df:
Columns x and y are note necessarily ordered.
The index is ordered.
Just order your result by index:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_2, df_1]
solution = pd.concat(frames).sort_index()
Now, solution looks like this:
x y
0 1 2
1 2 4
2 3 8