So I have multiple lists that I would like to convert them to some soft of table format.
list1 has
1
2
3
list2 has
4
5
6
etc.
I would like to save this into a table format such as
list_1, list 2
1, 4
2, 5
3, 6
I've tried
col_a_c_df = pd.DataFrame(columns=['Column A and C', 'Column A and B and C', 'Column A and D and F', 'Column A and B and D and F'],
data=[col_a_c, col_a_b_c, col_a_d_f, col_a_b_d_f])
col_a_c_df.to_csv("result.csv")
but it tells me that ValueError: 4 columns passed, passed data had 17181 columns
How do I do this?
You can call DataFrame constructor after zipping both lists, where A, B represents column names and a, b are lists
df = pd.DataFrame(columns=['A','B'], data=zip(a, b))
If lists are of uneven lengths
from itertools import zip_longest
df = pd.DataFrame(columns=['A','B'], data=zip_longest(a, b)
You can do it like this:
list1 = [1,2,3]
list2 = [4,5,6]
df = pd.DataFrame({'list1': list1, 'list2': list2})
list1 list2
0 1 4
1 2 5
2 3 6
This is a reference example.
feature = ['Column A and C', 'Column A and B and C', 'Column A and D and F', 'Column A and B and D and F']
data = np.array([
[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]
]).T
df = pd.DataFrame(data)
df.columns = feature
print(df)
df.to_csv("result.csv")
Related
I would like to conditionally replace values in a column that contains a series of arrays.
Example dataset below: (my real dataset contains many more columns and rows)
index lists condition
0 ['5 apples', '2 pears'] B
1 ['3 apples', '3 pears', '1 pumpkin'] A
2 ['4 blueberries'] A
3 ['5 kiwis'] C
4 ['1 pumpkin'] B
... ... ...
For example, if the condition is A and the row contains '1 pumpkin', then I would like to replace the value with XXX. But if the condition is B and the row contains 1 pumpkin, then I would like to replace the value with YYY.
Desired output
index lists condition
0 ['5 apples', '2 pears'] B
1 ['3 apples', '3 pears', 'XXX'] A
2 ['4 blueberries'] A
3 ['5 kiwis'] C
4 ['YYY'] B
... ... ...
The goal is, in fact, to replace all these values but 1 pumpkin is just one example. Importantly, I would like to maintain the array structure. Thanks!
Let us do explode then np.select
s = df.explode('lists')
cond = s['lists']=='1 pumpkin'
c1 = cond&s['condition'].eq('A')
c2 = cond&s['condition'].eq('B')
s['lists'] = np.select([c1,c2],['XXX','YYY'],default = s.lists.values )
df['lists'] = s.groupby(level=0)['lists'].agg(list)
You can define a function with the logic you want to apply to the Dataframe and then call df.apply(function) to pass this logic over the df
def pumpkin(row):
if '1 pumpkin' in row['lists']:
data = row['lists'][:]
if row['condition'] == 'A':
data[data.index('1 pumpkin')] = 'XXX'
elif row['condition'] == 'B':
data[data.index('1 pumpkin')] = 'YYY'
return data
return row['lists']
df['lists'] = df.apply(pumpkin, axis=1)
Output
lists condition
0 [5 apples, 2 pears] B
1 [3 apples, 3 pears, XXX] A
2 [4 blueberries] A
3 [5 kiwis] C
4 [YYY] B
I have a dictionary like this:
{'a': {'col_1': [1, 2], 'col_2': ['a', 'b']},
'b': {'col_1': [3, 4], 'col_2': ['c', 'd']}}
When I try to convert this to a dataframe a get this:
col_1 col_2
a [1, 2] [a, b]
b [3, 4] [c, d]
But what I need is this:
col_1 col_2
a 1 a
2 b
b 3 c
4 d
How can I get this format. Maybe I should change my input format as well?
Thanks for help=)
You can use pd.DataFrame.from_dict setting orient='index' so the dictionary keys are set as the dataframe's indices, and then explode all columns by applying pd.Series.explode:
pd.DataFrame.from_dict(d, orient='index').apply(pd.Series.explode)
col_1 col_2
a 1 a
a 2 b
b 3 c
b 4 d
you could run a generator comprehension and apply pandas concat ... the comprehension works on the values of the dictionary, which are themselves dictionaries :
pd.concat(pd.DataFrame(entry).assign(key=key) for key,entry in data.items()).set_index('key')
col_1 col_2
key
a 1 a
a 2 b
b 3 c
b 4 d
update:
Still uses concatenation; no need to assign key to individual dataframes:
(pd.concat([pd.DataFrame(entry)
for key, entry in data.items()],
keys=data)
.droplevel(-1))
I have two dataframes. The first one (let's call it A) has a column (let's call it 'col1') whose elements are lists of strings. The other one (let's call it B) has a column (let's call it 'col2') whose elements are strings. I want to do a join between these two dataframes where B.col2 is in the list in A.col1. This is one-to-many join.
Also, I need the solution to be scalable since I wanna join two dataframes with hundreds of thousands of rows.
I have tried concatenating the values in A.col1 and creating a new column (let's call it 'col3') and joining with this condition: A.col3.contains(B.col2). However, my understanding is that this condition triggers a cartesian product between the two dataframes which I cannot afford considering the size of the dataframes.
def joinIds(IdList):
return "__".join(IdList)
joinIds_udf = udf(joinIds)
pnr_corr = pnr_corr.withColumn('joinedIds', joinIds_udf(pnr_corr.pnrCorrelations.correlationPnrSchedule.scheduleIds)
pnr_corr_skd = pnr_corr.join(skd, pnr_corr.joinedIds.contains(skd.id), how='inner')
This is a sample of the join that I have in mind:
dataframe A:
listColumn
["a","b","c"]
["a","b"]
["d","e"]
dataframe B:
valueColumn
a
b
d
output:
listColumn valueColumn
["a","b","c"] a
["a","b","c"] b
["a","b"] a
["a","b"] b
["d","e"] d
I don't know if there is an efficient way to do it, but this gives the correct output:
import pandas as pd
from itertools import chain
df1 = pd.Series([["a","b","c"],["a","b"],["d","e"]])
df2 = pd.Series(["a","b","d"])
result = [ [ [el2,list1] for el2 in df2.values if el2 in list1 ]
for list1 in df1.values ]
result_flat = list(chain(*result))
result_df = pd.DataFrame(result_flat)
You get:
In [26]: result_df
Out[26]:
0 1
0 a [a, b, c]
1 b [a, b, c]
2 a [a, b]
3 b [a, b]
4 d [d, e]
Another approach is to use the new explode() method from pandas>=0.25 and merge like this:
import pandas as pd
df1 = pd.DataFrame({'col1': [["a","b","c"],["a","b"],["d","e"]]})
df2 = pd.DataFrame({'col2': ["a","b","d"]})
df1_flat = df1.col1.explode().reset_index()
df_merged = pd.merge(df1_flat,df2,left_on='col1',right_on='col2')
df_merged['col2'] = df1.loc[df_merged['index']].values
df_merged.drop('index',axis=1, inplace=True)
This gives the same result:
col1 col2
0 a [a, b, c]
1 a [a, b]
2 b [a, b, c]
3 b [a, b]
4 d [d, e]
How about:
df['col1'] = [df['col1'].values[i] + [df['col2'].values[i]] for i in range(len(df))]
Where 'col1' is the list of strings and 'col2' is the strings.
You can also drop 'col2' if you don't want it anymore with:
df = df.drop('col2',axis=1)
If I have a dataframe,
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
I'd like to attach a mark '#' with the name if column['name'] equal to list containing 'A' and 'B'. Then I can see something like below in the result, does anyone know how to do it using pandas in elegant way?
name_list = ['A','B','D'] # But we only have A and B in df.
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
If name_list is the same length as the length of the Series name, then you could try this:
df1['name_list'] = ['A','B','D']
df1.ix[df1.name == df1.name_list, 'name'] = '#'+df1.name
This would only prepend a '#' when the value of name and name_list are the same for the current index.
In [81]: df1
Out[81]:
john_01 mary_02 name name_list
0 1 4 #A A
1 2 5 #B B
2 3 6 C D
In [82]: df1.drop('name_list', axis=1, inplace=True) # Drop assist column
If the two are not the same length - and therefore you don't care about index - then you could try this:
In [84]: name_list = ['A','B','D']
In [87]: df1.ix[df1.name.isin(name_list), 'name'] = '#'+df1.name
In [88]: df1
Out[88]:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
I hope this helps.
Use df.loc[row_indexer,column_indexer] operator with isin method of a Series object:
df.loc[df.name.isin(name_list), 'name'] = '#'+df.name
print(df)
The output:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
http://pandas.pydata.org/pandas-docs/stable/indexing.html
You can use isin to check whether the name is in the list, and use numpy.where to prepend #:
df['name'] = np.where(df['name'].isin(name_list), '#', '') + df['name']
df
Out:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
import pandas as pd
def exclude_list (x):
list_exclude = ['A','B']
if x in list_exclude:
x = '#' + x
return x
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
df['name'] = df['name'].apply(lambda row: exclude_list(row))
print(df)
When I make dataframe using list, error is occured.
My code is :
a=[1,2,3,4,5]
b=['a','b','c','d','e']
df=pd.DataFrame(a,columns=[b])
I want this dataframe output :
a b c d e
1 2 3 4 5
error code is assert(len(items) == len(values))
what should I do, I hope to solve this ploblem.
There are strict requirements on the shape and form of the data being passed, you can pass just the data and transpose it to get the initial data as a single row and then overwrite the column names:
In [166]:
a=[1,2,3,4,5]
b=['a','b','c','d','e']
df=pd.DataFrame(data=a).T
df.columns=b
df
Out[166]:
a b c d e
0 1 2 3 4 5
Another method would be to construct a dict and perform a list comprehension on your data elements and makes these a list:
In [170]:
df=pd.DataFrame(dict(zip(b,[[x] for x in a])))
df
Out[170]:
a b c d e
0 1 2 3 4 5
inline dict output:
In [169]:
dict(zip(b,[[x] for x in a]))
Out[169]:
{'a': [1], 'b': [2], 'c': [3], 'd': [4], 'e': [5]}
You are actually sending the columns' parameter as - [['a','b','c','d','e']] . It needs to be a single list, not a list of lists.
Also, when you send in a as the data , you are actually creating 5 rows for 1 column . instead you want to send in [a] that would create 1 row and 5 columns.
Try -
df=pd.DataFrame([a],columns=b)