I have a pandas data frame with 2 columns:
{'A':[1, 2, 3],'B':[4, 5, 6]}
I want to create a new column where:
{'C':[1 4,2 5,3 6]}
Setup
df = pd.DataFrame({'A':[1, 2, 3],'B':[4, 5, 6]})
Solution
Keep in mind, per your expected output, [1 4,2 5,3 6] isn't a thing. I'm interpreting you to mean either [(1, 4), (2, 5), (3, 6)] or ["1 4", "2 5", "3 6"]
First assumption
df.apply(lambda x: tuple(x.values), axis=1)
0 (1, 4)
1 (2, 5)
2 (3, 6)
dtype: object
Second assumption
df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
0 1 4
1 2 5
2 3 6
dtype: object
If you don't mind zip object, then you can usedf['C'] = zip(df.A,df.B).
If you like tuple then you can cast zip object with list(). Please refer to this post. It's pretty handy to use zip in this kind of scenarios.
Related
I have a pandas data-frame with a IntervalIndex. Some of the columns will have several connecting intervals with the same data, and I would like to break the data-frame into a dictionary of series and then do an in-place compression where I merge adjacent intervals that have the same data value.
I have what I think is a pretty ugly solution using a while loop and a list of compressed intervals as a hack around the fact that an IntervalIndex is immutable. I am not posting this on code review because I think it is just terrible. Only posting here because it shows what I am trying to do.
Is there some built-in function that I am unaware of or maybe a more pythonic combination I can use to do this?
import pandas as pd
# create data
index = pd.IntervalIndex.from_breaks([1, 2, 3, 4, 5])
data = {"a": [1, 2, 3, 4], "b": [5, 5, 6, 6], "c": [7, 7, 7, 7]}
df = pd.DataFrame(index=index, data=data)
# break dataframe into a dict of series to we can compress
dict_of_series = df.to_dict("series")
# combine overlapping bins with redundant data
def compress(s):
# we need to build the compressed index rather than modify in-place, as IntervalIndex is immuatable
compressed_index = [s.index[0]]
i = 1
while True:
try:
i0 = s.index[i - 1]
i1 = s.index[i]
d0 = s.iloc[i - 1]
d1 = s.iloc[i]
except IndexError:
break
# if intervals don't touch, we cannot combine bins
if compressed_index[-1].right < i1.left:
compressed_index.append(i1)
i += 1
continue
# if intervals touch and the data is the same, combine bins
if d0 == d1:
s.drop(i1, inplace=True)
compressed_index[-1] = pd.Interval(compressed_index[-1].left, i1.right)
continue
compressed_index.append(i1)
i += 1
# replace the index with the compressed index we built
s.index = pd.IntervalIndex(compressed_index)
print("Original Data for 'a'")
print(dict_of_series["a"])
print()
print("Compressed Data for 'a'")
compress(dict_of_series["a"])
print(dict_of_series["a"])
print()
print("Original Data for 'b'")
print(dict_of_series["b"])
print()
print("Compressed Data for 'b'")
compress(dict_of_series["b"])
print(dict_of_series["b"])
print()
print("Original Data for 'c'")
print(dict_of_series["c"])
print()
print("Compressed Data for 'c'")
compress(dict_of_series["c"])
print(dict_of_series["c"])
Which gives the following result:
Original Data for 'a'
(1, 2] 1
(2, 3] 2
(3, 4] 3
(4, 5] 4
Name: a, dtype: int64
Compressed Data for 'a'
(1, 2] 1
(2, 3] 2
(3, 4] 3
(4, 5] 4
Name: a, dtype: int64
Original Data for 'b'
(1, 2] 5
(2, 3] 5
(3, 4] 6
(4, 5] 6
Name: b, dtype: int64
Compressed Data for 'b'
(1, 3] 5
(3, 5] 6
Name: b, dtype: int64
Original Data for 'c'
(1, 2] 7
(2, 3] 7
(3, 4] 7
(4, 5] 7
Name: c, dtype: int64
Compressed Data for 'c'
(1, 5] 7
Name: c, dtype: int64
Try Series.duplicated:
>>> df.loc[~df['a'].duplicated(), 'a']
(1, 2] 1
(2, 3] 2
(3, 4] 3
(4, 5] 4
Name: a, dtype: int64
>>> df.loc[~df['b'].duplicated(), 'b']
(1, 2] 5
(3, 4] 6
Name: b, dtype: int64
>>> df.loc[~df['c'].duplicated(), 'c']
(1, 2] 7
Name: c, dtype: int64
I have a pandas dataframe:
df = pd.DataFrame({'one' : [1, 2, 3, 4] ,'two' : [5, 6, 7, 8]})
one two
0 1 5
1 2 6
2 3 7
3 4 8
Column "one" and column "two" together comprise (x,y) coordinates
Lets say I have a list of coordinates: c = [(1,5), (2,6), (20,5)]
Is there an elegant way of obtaining the rows in df
with matching coordinates? In this case, given c, the matching rows would be 0 and 1
Related question: Using pandas to select rows using two different columns from dataframe?
And: Selecting rows from pandas DataFrame using two columns
This approaching using pd.merge should perform better than the iterative solutions.
import pandas as pd
df = pd.DataFrame({"one" : [1, 2, 3, 4] ,"two" : [5, 6, 7, 8]})
c = [(1, 5), (2, 6), (20, 5)]
df2 = pd.DataFrame(c, columns=["one", "two"])
pd.merge(df, df2, on=["one", "two"], how="inner")
one two
0 1 5
1 2 6
You can use
>>> set.union(*(set(df.index[(df.one == i) & (df.two == j)]) for i, j in c))
{0, 1}
I need to generate a list from pandas DataFrame . I am able to print the result,but i don't know how to convert it to the list format.The code i used to print the result(without converting to list) is
df=pandas.DataFrame(processed_data_format, columns=["file_name", "innings", "over","ball", "individual ball", "runs","batsman", "wicket_status","bowler_name","fielder_name"])**
#processed_data_format is the list passing to the DataFrame
t = df.groupby(['batsman','over'])['runs','ball'].sum()
print t
i am getting the result like
Sangakara 1 10 5
2 0 2
3 3 1
sewag 1 2 1
2 1 1
I would like to convert this data into list format like
[ [sangakara,1,10,5],[sangakara,2,0,2],[sangakara,3,3,1],[sewag,1,2,1][sewag,2,1,1] ]
You can use to_records:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 3, 4, 5]})
>>> list(df.to_records())
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5)]
This will make a list of tuples. Converting this to a list of lists (if you really need this at all), is easy.
I know how to get the most frequent element of list of list, e.g.
a = [[3,4], [3,4],[3,4], [1,2], [1,2], [1,1],[1,3],[2,2],[3,2]]
print max(a, key=a.count)
should print [3, 4] even though the most frequent number is 1 for the first element and 2 for the second element.
My question is how to do the same kind of thing with Pandas.DataFrame.
For example, I'd like to know the implementation of the following method get_max_freq_elem_of_df:
def get_max_freq_elem_of_df(df):
# do some things
return freq_list
df = pd.DataFrame([[3,4], [3,4],[3,4], [1,2], [1,2], [1,1],[1,3],[2,2],[4,2]])
x = get_max_freq_elem_of_df(df)
print x # => should print [3,4]
Please notice that DataFrame.mode() method does not work. For above example, df.mode() returns [1, 2] not [3,4]
Update
have explained why DataFrame.mode() doesn't work.
You could use groupby.size and then find the max:
>>> df.groupby([0,1]).size()
0 1
1 1 1
2 2
3 1
2 2 1
3 4 3
4 2 1
dtype: int64
>>> df.groupby([0,1]).size().idxmax()
(3, 4)
In python you'd use Counter*:
In [11]: from collections import Counter
In [12]: c = Counter(df.itertuples(index=False))
In [13]: c
Out[13]: Counter({(3, 4): 3, (1, 2): 2, (1, 3): 1, (2, 2): 1, (4, 2): 1, (1, 1): 1})
In [14]: c.most_common(1) # get the top 1 most common items
Out[14]: [((3, 4), 3)]
In [15]: c.most_common(1)[0][0] # get the item (rather than the (item, count) tuple)
Out[15]: (3, 4)
* Note that your solution
max(a, key=a.count)
(although it works) is O(N^2), since on each iteration it needs to iterate through a (to get the count), whereas Counter is O(N).
I am new to Python and this seems to be a bit tricky for me:
I have 3 columns:
column1 : id
column2 : size
column3 : rank
I now want to re-align the id column and keeping size,rank in order together (size,rank)
so it would look like:
id:1 (size,rank):4,5
id:2 (size,rank):5,8
So the id column has to be reorded from 1 to 1000 and not messing up the (size,rank) tupel
I tried to do:
combined = zip(size,rank)
id, combined = zip(*sorted(zip(id, combined)))
Is this correct? And if yes, how can I seperate the tupel to 2 arrays size and rank again.
I heard about zip(*combined)?
unziped = zip(*combined)
then size equals unziped[0] and rank equals unziped[1] ?
Thank you for help!
ADDED from Numpy genfromtxt function
size= [x[2] for x in mydata]
rank= [x[1] for x in mydata]
Your main problem is you are using the id column as both an identifier and as the value by which the data is ordered. This is wrong. Leave the id column alone; and then sort the data by only size; once you have it sorted use enumerate to list the "order".
Here is an example, which sorts the data by the second column (size), then prints the data along with their "rank" or "order":
>>> data = ((1, 4, 5), (3, 6, 7), (4, 3, 3), (19, 32, 0))
>>> data_sorted = sorted(data, key=lambda x: x[1])
>>> for k,v in enumerate(data_sorted):
... print('{}: {}'.format(k+1, v))
...
1: (4, 3, 3)
2: (1, 4, 5)
3: (3, 6, 7)
4: (19, 32, 0)