I am filling a DataFrame by transposing some numpy array :
for symbol in syms[:5]:
price_p = Share(symbol)
closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
dump = np.array(closes_p)
na_price_ar.append(dump)
print symbol
df = pd.DataFrame(na_price_ar).transpose()
df, the DataFrame is well filled however, the column name are 0,1,2...,5 I would like to rename them with the value of the element syms[:5]. I googled it and I found this:
for symbol in syms[:5]:
df.rename(columns={ ''+ str(i) + '' : symbol}, inplace=True)
i = i+1
But if I check the variabke df I still have the same column name.
Any ideas ?
Instead of using a list of arrays and transposing, you could build the DataFrame from a dict whose keys are symbols and whose values are arrays of column values:
import numpy as np
import pandas as pd
np.random.seed(2016)
syms = 'abcde'
na_price_ar = {}
for symbol in syms[:5]:
# price_p = Share(symbol)
# closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
# dump = np.array(closes_p)
dump = np.random.randint(10, size=3)
na_price_ar[symbol] = dump
print(symbol)
df = pd.DataFrame(na_price_ar)
print(df)
yields
a b c d e
0 3 3 8 2 4
1 7 8 7 6 1
2 2 4 9 3 9
You can use:
na_price_ar = [['A','B','C'],[0,2,3],[1,2,4],[5,2,3],[8,2,3]]
syms = ['q','w','e','r','t','y','u']
df = pd.DataFrame(na_price_ar, index=syms[:5]).transpose()
print (df)
q w e r t
0 A 0 1 5 8
1 B 2 2 2 2
2 C 3 4 3 3
You may use as dictionary key into the .rename() method the df.columns[ number ] statement
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1], 'd': [4, 1, 3, 1], 'e': [5, 2, 6, 0]}
df = pd.DataFrame(dic)
number = 0
for symbol in syms[:5]:
df.rename( columns = { df.columns[number]: symbol}, implace = True)
number = number + 1
and the result is
i f g h i
0 4 4 5 4 5
1 1 2 7 1 2
2 3 1 9 3 6
3 1 4 1 1 0
Related
I have two data frame df1 is 26000 rows, df2 is 25000 rows.
Im trying to find data points that are in d1 but not in d2, vice versa.
This is what I wrote (below code) but when I cross check it shows me shared data point
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df_join = pd.concat([df1,df2], axis = 1).drop_duplicates(keep = FALSE)
only_df1 = df_join.loc[df_join[df2.columns.to_list()].isnull().all(axis = 1), df1.columns.to_list()]
Order doesn't matter just want to know whether that data point exist in one or the other data frame.
With two dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 1, 1, 1, 1]})
df2 = pd.DataFrame({'a': [2, 3, 4, 5, 6], 'b': [1, 1, 1, 1, 1]})
print(df1)
print(df2)
a b
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
a b
0 2 1
1 3 1
2 4 1
3 5 1
4 6 1
You could do:
df_differences = df1.merge(df2, how='outer', indicator=True)
print(df_differences)
Result:
a b _merge
0 1 1 left_only
1 2 1 both
2 3 1 both
3 4 1 both
4 5 1 both
5 6 1 right_only
And then:
only_df1 = df_differences[df_differences['_merge'].eq('left_only')].drop(columns=['_merge'])
only_df2 = df_differences[df_differences['_merge'].eq('right_only')].drop(columns=['_merge'])
print(only_df1)
print()
print(only_df2)
a b
0 1 1
a b
5 6 1
I'm a python newbie and need help with a specfic task. My main goal is to identify all indicies with their specific values and column-names which are greater than 0 within a row and to sum up these values below each other into another column within the same row.
Here is what I tried:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
# create a new column that sums up the row
df['summary'] = 'NoData'
# print the header
print(df.columns.values)
A B C D summary
0 0 4 2 1 NoData
1 2 1 9 0 NoData
2 0 3 0 1 NoData
3 5 0 6 6 NoData
# get length of rows and columns
row = len(df.index)
column = len(df.columns)
# If a value at a spefic index is greater
# than 0, take the column name and the value at that index and print it into the column
#'summary'. Also write all values greater than 0 within a row below each other
for i in range(row):
for j in range(column):
if df.iloc[i][j] > 0:
df.at[i,'summary'] = df.columns(df.iloc[i][j]) + '\n'
I hope it is a bit clear what I want to achieve. Here is a picture of how the result should look in the column 'summary'
You don't really need a for loop.
Starting with df:
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
You can do:
# Define an helper function
def f(val, col_name):
# You can modify this function in order to customize the summary string
return "" if val == 0 else str(val) + col_name + "\n"
# Assign summary column
df["summary"] = df.apply(lambda x: x.apply(f, args=(x.name,))).sum(axis=1).str[:-1]
Output:
A B C D summary
0 0 4 2 1 4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 3B\n1D
3 5 0 6 6 5A\n6C\n6D
It works for longer column names as well:
one two three four summary
0 0 4 2 1 4two\n2three\n1four
1 2 1 9 0 2one\n1two\n9three
2 0 3 0 1 3two\n1four
3 5 0 6 6 5one\n6three\n6four
Try this:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
print(f'\n\n-------------BREAK-----------\n\n')
def func(line):
templist = ''
list_col = line.index.values.tolist()
temp = line.values.tolist()
for x in range(0, len(temp)):
if (temp[x] <= 0):
pass
else:
if (x == 0 ):
templist = f"{temp[x]}{list_col[x]}"
else:
templist = f"{templist}\n{temp[x]}{list_col[x]}"
return templist
df['summary'] = df.apply(func, axis = 1)
print(df)
EXIT
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
-------------BREAK-----------
A B C D summary
0 0 4 2 1 \n4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 \n3B\n1D
3 5 0 6 6 5A\n6C\n6D
I would like to emulate an Excel formula in Pandas I've tried this:
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
df['c'] = lambda x : df.a + df.b + 1 # Displays <function <lambda> ..> instead of the result
df['d'] = df.a + df.b + 1 # Static computation
df.a *= 2
df # Result of column c and d not updated :(
a b c d
0 6 5 <function <lambda> at 0x7f2354ddcca0> 9
1 4 3 <function <lambda> at 0x7f2354ddcca0> 6
2 2 2 <function <lambda> at 0x7f2354ddcca0> 4
3 0 1 <function <lambda> at 0x7f2354ddcca0> 2
What I expect is:
df
a b c
0 6 5 12
1 4 3 8
2 2 2 5
3 0 1 2
df.a /= 2
a b c
0 3 5 9
1 2 3 6
2 1 2 4
3 0 1 2
Is this possible to have a computed column dynamically in Pandas?
Maybe this code might give you a step in the right direction:
import pandas as pd
c_list =[]
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
c_list2 = list(map(lambda x: x + df.b + 1 , list(df.a)))
for i in range (0,4):
c_list.append(pd.DataFrame(c_list2[i])["b"][i])
df['c'] = c_list
df['d'] = df.a + df.b # Static computation
df.a *= 2
df
Reactivity between columns in a DataFrame does not seem practically feasible. My cellopype package does give you Excel-like reactivity between DataFrames. Here's my take on your question:
pip install cellopype
import pandas as pd
from cellopype import Cell
# define source df and wrap it in a Cell:
df_ab = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
cell_ab = Cell(recalc=lambda: df_ab.copy())
# define the dependent/reactive Cell (with a single column 'c')
cell_c = Cell(
recalc=lambda df: pd.DataFrame(df.a + df.b, columns=['c']),
sources=[cell_ab]
)
# and get its value
print(cell_c.value)
c
0 8
1 5
2 3
3 1
# change source df and recalc its Cell...
df_ab.loc[0,'a']=100
cell_ab.recalc()
# cell_c has been updated in response
print(cell_c.value)
c
0 105
1 5
2 3
3 1
Also see my response to this question.
Given a dataframe, I want to obtain a list of distinct dataframes which together concatenate into the original.
The separation is by indices of rows like so
import pandas as pd
import numpy as np
data = {"a": np.arange(10)}
df = pd.DataFrame(data)
print(df)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
separate_by = [1, 5, 6, ]
should give a list of
df1 =
a
0 0
df2 =
a
1 1
2 2
3 3
4 4
df3 =
a
5 5
df4 =
a
6 6
7 7
8 8
9 9
How can this be done in pandas?
Try:
groups = (pd.Series(1, index=separate_by)
.reindex(df.index,fill_value=0)
.cumsum()
)
out = {k:v for k,v in df.groupby(groups)}
then for example, out[2]:
a
5 5
Similar logic:
groups = np.zeros(len(df))
groups[separate_by] = 1
groups = np.cumsum(groups)
out = {k:v for k,v in df.groupby(groups)}
separate_by = [1, 5, 6, ]
separate_by.append(len(df))
separate_by.append(0, 0)
dfs = [df.loc[separate_by[i]: separate_by[i+1]] for i in range(len(separate_by)-1)]
Let us try
d = dict(tuple(df.groupby(df.index.isin(separate_by).cumsum())))
d[0]
Out[364]:
a
0 0
d[2]
Out[365]:
a
5 5
I have a df looks like below, I would like to get rows from 'D' column based on my list without changing or unique the order of list. .
A B C D
0 a b 1 1
1 a b 1 2
2 a b 1 3
3 a b 1 4
4 c d 2 5
5 c d 3 6 #df
My list
l = [4, 2, 6, 4] # my list
df.loc[df['D'].isin(l)].to_csv('output.csv', index = False)
When I use isin() the result would change the order and unique my result, df.loc[df['D'] == value only print the last line.
A B C D
3 a b 1 4
1 a b 1 2
5 c d 3 6
3 a b 1 4 # desired output
Any good way to do this? Thanks,
A solution without loop but merge:
In [26]: pd.DataFrame({'D':l}).merge(df, how='left')
Out[26]:
D A B C
0 4 a b 1
1 2 a b 1
2 6 c d 3
3 4 a b 1
You're going to have to iterate over your list, get copies of them filtered and then concat them all together
l = [4, 2, 6, 4] # you shouldn't use list = as list is a builtin
cache = {}
masked_dfs = []
for v in l:
try:
filtered_df = cache[v]
except KeyError:
filtered_df = df[df['D'] == v]
cache[v] = filtered_df
masked_dfs.append(filtered_df)
new_df = pd.concat(masked_dfs)
UPDATE: modified my answer to cache answers so that you don't have to do multiple searches for repeats
just collect the indices of the values you are looking for, put in a list and then use that list to slice the data
import pandas as pd
df = pd.DataFrame({
'C' : [6, 5, 4, 3, 2, 1],
'D' : [1,2,3,4,5,6]
})
l = [4, 2, 6, 4]
i_locs = [ind for elem in l for ind in df[df['D'] == elem].index]
df.loc[i_locs]
results in
C D
3 3 4
1 5 2
5 1 6
3 3 4