Say I have two pandas Series in python:
import pandas as pd
h = pd.Series(['g',4,2,1,1])
g = pd.Series([1,6,5,4,"abc"])
I can create a DataFrame with just h and then append g to it:
df = pd.DataFrame([h])
df1 = df.append(g, ignore_index=True)
I get:
>>> df1
0 1 2 3 4
0 g 4 2 1 1
1 1 6 5 4 abc
But now suppose that I have an empty DataFrame and I try to append h to it:
df2 = pd.DataFrame([])
df3 = df2.append(h, ignore_index=True)
This does not work. I think the problem is in the second-to-last line of code. I need to somehow define the blank DataFrame to have the proper number of columns.
By the way, the reason I am trying to do this is that I am scraping text from the internet using requests+BeautifulSoup and I am processing it and trying to write it to a DataFrame one row at a time.
So if you don't pass an empty list to the DataFrame constructor then it works:
In [16]:
df = pd.DataFrame()
h = pd.Series(['g',4,2,1,1])
df = df.append(h,ignore_index=True)
df
Out[16]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
The difference between the two constructor approaches appears to be that the index dtypes are set differently, with an empty list it is an Int64 with nothing it is an object:
In [21]:
df = pd.DataFrame()
print(df.index.dtype)
df = pd.DataFrame([])
print(df.index.dtype)
object
int64
Unclear to me why the above should affect the behaviour (I'm guessing here).
UPDATE
After revisiting this I can confirm that this looks to me to be a bug in pandas version 0.12.0 as your original code works fine:
In [13]:
import pandas as pd
df = pd.DataFrame([])
h = pd.Series(['g',4,2,1,1])
df.append(h,ignore_index=True)
Out[13]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
I am running pandas 0.13.1 and numpy 1.8.1 64-bit using python 3.3.5.0 but I think the problem is pandas but I would upgrade both pandas and numpy to be safe, I don't think this is a 32 versus 64-bit python issue.
Related
I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
I am trying to convert some Pandas code to Dask.
I have a dataframe that looks like the following:
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2
0 1 1
1 1 0
2 1 1
3 1 1
4 1 1
In Pandas, I can use create a Lists column which includes the List if the row value is 1 like so:
df['Lists'] = df.dot(df.columns+",").str.rstrip(",").str.split(",")
So the Lists column looks like:
Lists
0 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
1 [ListView_Lead_MyUnreadLeads]
2 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
3 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
4 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
In Dask, the dot function doesn't seem to work the same way. How can I get the same behavior / output?
Any help would be appreciated. Thanks!
Related question in Pandas: How to return headers of columns that match a criteria for every row in a pandas dataframe?
Here's some alternative ways to do it in Pandas. You can try whether it works equally well in Dask.
cols = df.columns.values
df['Lists'] = [list(cols[x]) for x in df.eq(1).values]
or try:
df['Lists'] = df.eq(1).apply(lambda x: list(x.index[x]), axis=1)
The first solution using list comprehension provides better performance if your dataset is large.
Result:
print(df)
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2 Lists
0 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
1 1 0 [ListView_Lead_MyUnreadLeads]
2 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
3 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
4 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
Here's a Dask version with map_partitions:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'ListView_Lead_MyUnreadLeads': [1,1,1,1,1], 'ListView_Lead_ViewCustom2': [1,0,1,1,1] })
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
df = df.copy()
df['Lists'] = df.dot(df.columns+",").str.rstrip(",").str.split(",")
return df
ddf.map_partitions(myfunc).compute()
Good day everyone! I had trouble putting a nested dictionary as separate columns. However, I fixed it using the concat and json.normalize function. But for some reason the code I used removed all the column names and returned NaN as values for the columns...
Does someone knows how to fix this?
Code I used:
import pandas as pd
c = ['photo.photo_replace', 'photo.photo_remove', 'photo.photo_add', 'photo.photo_effect', 'photo.photo_brightness',
'photo.background_color', 'photo.photo_resize', 'photo.photo_rotate', 'photo.photo_mirror', 'photo.photo_layer_rearrange',
'photo.photo_move', 'text.text_remove', 'text.text_add', 'text.text_edit', 'text.font_select', 'text.text_color', 'text.text_style',
'text.background_color', 'text.text_align', 'text.text_resize', 'text.text_rotate', 'text.text_move', 'text.text_layer_rearrange']
df_edit = pd.concat([json_normalize(x)[c] for x in df['editables']], ignore_index=True)
df.columns = df.columns.str.split('.').str[1]
Current problem:
Result I want:
df= pd.DataFrame({
'A':[1,2,3],
'B':[3,3,3]
})
print(df)
A B
0 1 3
1 2 3
2 3 3
c=['new_name1','new_name2']
df.columns=c
print(df)
new_name1 new_name2
0 1 3
1 2 3
2 3 3
remember , lenght of column names (c) should be equal to column amount
I have a DataFrame like:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
df = pd.DataFrame(df)
and my expected output is the majority value of each row, like:
0 5
1 2
2 6
I'm new with Pandas. Thank you for any help.
with pandas version 0.13.0, you can use df.mode(axis = 1)
(check your version with pd.__version__)
df.mode(axis=1)
0
0 5
1 2
2 6
[3 rows x 1 columns]
The concept you are looking for is a mode, which is the most commonly occurring number in a set. Scipy and Pandas both have ways to handle modes, through scipy.stats.mode and pandas.DataFrame.mode(works along an axis). So for this example you could say:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
for i in np.arange(len(df)):
results = np.zeros(len(df))
results[i] = scipy.stats.mode(df[i])
This should return a numpy array with the modes of each array. To do this same thing with Pandas you can do:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
df = pd.DataFrame(df)
df.mode(axis = 1)
The documentation is here: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.mode.html
The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.