Append a Single Line by Elements into a Pandas Dataframe - python

I have the following (sample) dataframe:
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
....
There are no duplicate values in the index.
I am in the unenviable position of having to append to this dataframe using elements from a number of other dataframes. So I'm appending as follows:
names_df = names_df.append({'Age': someage,
'height': someheight,
'weight':someweight,
'haircolor': somehaircolor'},
ignore_index=True)
My question is using this method how do I set the new index value in names_df equal to the person's name?
The only thing I can think of is to reset the df index before I append and then re-set it afterward. Ugly. Has to be a better way.
Thanks in advance.

I am not sure in what format are you getting the data that you are appending to the original df but one way is as follows:
df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
new_name someage someheight someweight somehaircolor
Time Testing:
%timeit df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
1000 loops, best of 3: 408 µs per loop
%timeit df.append(pd.DataFrame({'Age': 'someage', 'height': 'someheight','weight':'someweight','haircolor': 'somehaircolor'}, index=['some_person']))
100 loops, best of 3: 2.59 ms per loop

Here's another way using append. Instead of passing a dictionary, pass a dataframe (created with dictionary) while specifying index:
names_df = names_df.append(pd.DataFrame({'Age': 'someage',
'height': 'someheight',
'weight':'someweight',
'haircolor': 'somehaircolor'}, index=['some_person']))

Related

pandas getting most frequent names from a column which has list of names

my dataframe is like this
star_rating actors_list
0 9.3 [u'Tim Robbins', u'Morgan Freeman']
1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 [u'Al Pacino', u'Robert De Niro']
3 9.0 [u'Christian Bale', u'Heath Ledger']
4 8.9 [u'John Travolta', u'Uma Thurman']
I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.
import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
expected output for (this data)
robert de niro 13
tom hanks 12
clint eastwood 11
johnny depp 10
al pacino 10
james stewart 9
By my tests, it would be much faster to do the regex cleanup after counting.
from itertools import chain
import re
p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]
ser.head()
Robert De Niro 18
Brad Pitt 14
Clint Eastwood 14
Tom Hanks 14
Al Pacino 13
dtype: int64
Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.
If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.
df = pd.concat([df]*1000) # For the sake of large df.
%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop
%%timeit
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop
%%timeit
words = {}
for i in df['actors_list']:
for w in i :
if w in words:
words[w]+=1
else:
words[w]=1
100 loops, best of 3: 5.44 ms per loop
I will using ast convert the list like to list
import ast
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()
according to this code I got below chart
which
coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()

optimize a string query with pandas. large data

I have a dataframe data, which has close to 4 millions rows. It is a list of cities on the world. I need to query the city name as fast as possible.
I found out that one with 346ms via indexing the city name:
d2=data.set_index("city",inplace=False)
timeit d2.loc[['PARIS']]
1 loop, best of 3: 346 ms per loop
This is still much too slow. I wonder if with group-by I could achieve faster query (how to do such a query). Each city has around 10 rows in the dataframe (duplicate city).
I searched several days and could not find a clear solution on the internet
thank you
Setup
df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(100000)]*10,columns=['city','value'])
Baseline
df2 = df.set_index('city')
%timeit df2.loc[['Paris9999']]
10 loops, best of 3: 45.6 ms per loop
Solution
Using a lookup dict and then use iloc:
idx_dict = df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
%timeit df.iloc[d['Paris9999']]
1000 loops, best of 3: 432 µs per loop
It seems this approach is almost 100 times faster than the baseline.
Comparing to other approaches:
%timeit df2[df2.index.values=="Paris9999"]
100 loops, best of 3: 16.7 ms per loop
%timeit full_array_based(df2, "Paris9999")
10 loops, best of 3: 19.6 ms per loop
Working with the array data for the index, comparing against the needed index and then using the mask from the comparison might be one option when looking for performance. A sample case might make things clear.
1) Input dataframes :
In [591]: df
Out[591]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [592]: d2 = df.set_index("city",inplace=False)
In [593]: d2
Out[593]:
population
city
Delhi 1000
Paris 56
NY 89
Paris 36
Delhi 300
Paris 52
Paris 34
Delhi 40
NY 89
Delhi 450
2) Indexing with .loc :
In [594]: d2.loc[['Paris']]
Out[594]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
3) Use mask based indexing :
In [595]: d2[d2.index.values=="Paris"]
Out[595]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
4) Finally timings :
In [596]: %timeit d2.loc[['Paris']]
1000 loops, best of 3: 475 µs per loop
In [597]: %timeit d2[d2.index.values=="Paris"]
10000 loops, best of 3: 156 µs per loop
Further boost
Going further with using array data, we can extract the entire input dataframe as array and index into it. Thus, an implementation using that philosophy would look something like this -
def full_array_based(d2, indexval):
df0 = pd.DataFrame(d2.values[d2.index.values==indexval])
df0.index = [indexval]*df0.shape[0]
df0.columns = d2.columns
return df0
Sample run and timings -
In [635]: full_array_based(d2, "Paris")
Out[635]:
population
Paris 56
Paris 36
Paris 52
Paris 34
In [636]: %timeit full_array_based(d2, "Paris")
10000 loops, best of 3: 146 µs per loop
If we are allowed to pre-process to setup a dictonary that could be indexed for extracting city string based data extraction from the input dataframe, here's one solution using NumPy to do so -
def indexed_dict_numpy(df):
cs = df.city.values.astype(str)
sidx = cs.argsort()
scs = cs[sidx]
idx = np.concatenate(( [0], np.flatnonzero(scs[1:] != scs[:-1])+1, [cs.size]))
return {n:sidx[i:j] for n,i,j in zip(cs[sidx[idx[:-1]]], idx[:-1], idx[1:])}
Sample run -
In [10]: df
Out[10]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [11]: dict1 = indexed_dict_numpy(df)
In [12]: df.iloc[dict1['Paris']]
Out[12]:
city population
1 Paris 56
3 Paris 36
5 Paris 52
6 Paris 34
Runtime test against #Allen's solution to setup a similar dictionary with 4 Mil rows -
In [43]: # Setup 4 miliion rows of df
...: df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(400000)]*10,\
...: columns=['city','value'])
...: np.random.shuffle(df.values)
...:
In [44]: %timeit df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
1 loops, best of 3: 2.01 s per loop
In [45]: %timeit indexed_dict_numpy(df)
1 loops, best of 3: 1.15 s per loop

Python: Using apply efficiently on a pandas GroupBy object

I am trying to perform a task which is conceptually simple, but my code seems to be way too expensive. I am looking for a faster way, potentially utilizing pandas' built-in functions for GroupBy objects.
The starting point is a DataFrame called prices, with columns=[['item', 'store', 'day', 'price']], in which each observatoin is the most recent price update specific to a item-store combination. The problem is that some price updates are the same as the last price update for the same item-store combination. For example, let us look at a particular piece:
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
66386 56 85376 211 6.00
69477 69 85376 211 5.95
In this example I would like the observation where day equals 56 to be dropped (because price is the same as the last observation in this group). My code is:
def removeSameLast(df):
shp = df.shape[0]
lead = df['price'][1:shp]
lag = df['price'][:shp-1]
diff = np.array(lead != lag)
boo = np.array(1)
boo = np.append(boo,diff)
boo = boo.astype(bool)
df = df.loc[boo]
return df
gCell = prices.groupby(['item_id', 'store_id'])
prices = gCell.apply(removeSameLast)
This does the job, but is ugly and slow. Sorry for being a noob, but I assume that this can be done much faster. Could someone please propose a solution? Many thanks in advance.
I would suggest going for a simple solution using the shift function from Pandas. This would remove the use of the groupby and your function call.
The idea is to see where the Series [5.95, 6, 5.95, 6, 6, 5.95] is equal to the shifted one, [nan, 5.95, 6, 5.95, 6, 6] and delete(or just don't select) the rows where this condition happens.
>>> mask = ~np.isclose(prices['price'], prices['price'].shift())
>>> prices[mask]
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
69477 69 85376 211 5.95
Simple benchmark:
%timeit prices = gCell.apply(removeSameLast)
100 loops, best of 3: 4.46 ms per loop
%timeit mask = df.price != df.price.shift()
1000 loops, best of 3: 183 µs per loop

Txt file into a list of lists in Python

I have looked at other posts here but I cant seem to find any help with what im specifically trying to do.
So I have data in 'food.txt'. It represents the annual consumption per capita and i have to open and read the txt file into a list of lists named data[]
FOOD | 1980 1985 1990 1995 2000 2005
-----+-------------------------------------------------
BEEF | 72.1 68.1 63.9 63.5 64.5 62.4
PORK | 52.1 49.2 46.4 48.4 47.8 46.5
FOWL | 40.8 48.5 56.2 62.1 67.9 73.6
FISH | 12.4 13.7 14.9 14.8 15.2 16.1
This is what i have so far, to make it into lines
data = []
filename = 'food.txt'
with open('food.txt' , 'r') as inputfile:
for line in inputfile:
data.append(line.strip().split(','))
This separates them in separate lines but I cant use this as inputs for graphs which is the second part that i know how to do. I should be able to call on it like i put below because this will only give the numerical values which is what i need.
years = data[0][1:]
porkconsumption = [2][1:]
Any help would be appreciated, thank you.
I suspect what you have, after a processing, is a list with strings in it like so
['ABC 345 678','DEF 789 657']
Change your code to say line.strip().split(), and you will see your data list will be filled with lists like so:
[['ABC', '345', '678'],['DEF','789','657']]
Then loop through these to turn them into numbers you can plot:
pork = map(int,data[2][1:])

Python and Pandas - Moving Average Crossover

There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.
I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64
The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105

Categories