Mapping pandas DataFrame rows to a pandas Series - python

I need to create a pandas series whose elements are each a function of a row from a DataFrame. Specifically the is a 'metadata' column which is a json string and I want a Series of dicts that are the json plus the rest of the columns. Ideally I would want something equivalent to a map method for a dataframe:
df.map(lambda row: json.loads(row.metadata).update({'timestamp':row.timestamp}))
(update is destructive and does not return a new dict but you get the point)
EDIT: You can copy this
metadata timestamp
"{'a':1,'b':2}" 000000001
"{'a':1,'c':2}" 000000002
"{'a':1,'c':2}" 000000003
And load it with
In [8]: import pandas as pd
In [9]: pd.read_clipboard()
Out[9]:
metadata timestamp
0 {'a':1,'b':2} 1
1 {'a':1,'c':2} 2
2 {'a':1,'c':2} 3
The desired result should be a pandas.Series with the contents of this list:
[{"a":1,"b":2,"timestamp":000000001}
{"a":1,"c":2,"timestamp":000000002}
{"a":1,"c":2,"timestamp":000000003}]

What about to modify the strings?
Something like:
new_metadata = df.apply(lambda x: '{}\b,"timestamp":{}}}'.format(x.metadata,x.timestamp),axis=1)
Which produces:
In [1]: new_metadata
Out[2]:
0 {'a':1,'b':2,"timestamp":1}
1 {'a':1,'c':2,"timestamp":2}
2 {'a':1,'c':2,"timestamp":3}

Related

pd.merge returning Key Error when sending dataframe into a function

EMAP is a dataframe and I am using "apply" function to perform some action on every row of EMAP dataframe.
The function "Merge" returns "Key Error" on the columns of "row" argument.
But, when I am using the original dataframe (commented in the code) inside the function for Merge, I receive no error.
def merge(row):
a = row[col_select_Event]
#a = EMAP[col_select_Event][1:2]
filtered_RCA = pd.merge(RCA,a, on = col_select_Event, how = 'inner')
return a
j = EMAP.apply(merge, axis = 1)
EMAP data frame is like this
A
B
C
Apple
1
abc
Orange
2
abc
Starwberry
3
abc
RCA data frame is like this
A
B
Apple
1
Orange
2
col_select_Event = ['A','B']
How do I resolve the error?
Apply function always each row of a dataframe as Pandas Series.
So, "Row" argument is of Pandas Series datatype.
EMAP[col_select_Event][1:2] ------> This is of type DataFrame and hence it works
whereas
row[col_select_Event] ---------> This is Pandas Series
You cannot merge Pandas series to a Pandas Dataframe. This is because when using the apply function, columns of dataframe is converted to index of pandas series.
To enable usage of merge function, you must convert pandas series to pandas dataframe
row[col_select_Event].to_frame().T
The above code should work.

Python Pandas: Filling data frame with pd.Series in each element

The library sktime requires a very "particular" data format. For n time series the T values of each series need to be stored in a pandas Dataframe of pandas Series of length T like this:
DataFrame:
index | Data
0 | pd.Series
1 | pd.Series
... | ...
n-1 | pd.Series
My attempt to fill an empty data frame with n = 2 and T = 3 in a loop by reading from another data frame did not work. Here is my reduced version that uses a constant pd.Series in each row:
import pandas as pd
df = pd.DataFrame(columns=["Data"])
for i in range(2):
df.loc[i] = pd.Series([2, 4, 5])
Note that from many examples on the site, I know (1) how to fill a normal data frame in a for loop and (2) my attempt is not efficient even if it was working.
pandas doesn't want you to store complex objects in a cell, so if you try to create a DataFrame from Series, pandas will flatten it to a 2-d structure. To avoid that we need to work with a Series; the 1-D structure ensures the Series are placed in a single cell.
Append your Series to a dict construct the Series of Series with the basic constructor and make it a DataFrame with Series.to_frame
d = {}
for i in range(2):
d[i] = pd.Series([2, 4, 5]*(i+1))
df = pd.Series(d).to_frame('Data')
# Check they're Series
print(df.applymap(type))
# Data
#0 <class 'pandas.core.series.Series'>
#1 <class 'pandas.core.series.Series'>

Convert pandas data frame to series

I'm somewhat new to pandas. I have a pandas data frame that is 1 row by 23 columns.
I want to convert this into a series? I'm wondering what the most pythonic way to do this is?
I've tried pd.Series(myResults) but it complains ValueError: cannot copy sequence with size 23 to array axis with dimension 1. It's not smart enough to realize it's still a "vector" in math terms.
Thanks!
You can transpose the single-row dataframe (which still results in a dataframe) and then squeeze the results into a series (the inverse of to_frame).
df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df.squeeze(axis=0)
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
Note: To accommodate the point raised by #IanS (even though it is not in the OP's question), test for the dataframe's size. I am assuming that df is a dataframe, but the edge cases are an empty dataframe, a dataframe of shape (1, 1), and a dataframe with more than one row in which case the use should implement their desired functionality.
if df.empty:
# Empty dataframe, so convert to empty Series.
result = pd.Series()
elif df.shape == (1, 1)
# DataFrame with one value, so convert to series with appropriate index.
result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
# Convert to series per OP's question.
result = df.T.squeeze()
else:
# Dataframe with multiple rows. Implement desired behavior.
pass
This can also be simplified along the lines of the answer provided by #themachinist.
if len(df) > 1:
# Dataframe with multiple rows. Implement desired behavior.
pass
else:
result = pd.Series() if df.empty else df.iloc[0, :]
It's not smart enough to realize it's still a "vector" in math terms.
Say rather that it's smart enough to recognize a difference in dimensionality. :-)
I think the simplest thing you can do is select that row positionally using iloc, which gives you a Series with the columns as the new index and the values as the values:
>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
a0 a1 a2 a3 a4
0 0 1 2 3 4
>>> df.iloc[0]
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
You can retrieve the series through slicing your dataframe using one of these two methods:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))
series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series
You can also use stack()
df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])
After u run df, then run:
df.stack()
You obtain your dataframe in series
If you have a one column dataframe df, you can convert it to a series:
df.iloc[:,0] # pandas Series
Since you have a one row dataframe df, you can transpose it so you're in the previous case:
df.T.iloc[:,0]
Another way -
Suppose myResult is the dataFrame that contains your data in the form of 1 col and 23 rows
# label your columns by passing a list of names
myResult.columns = ['firstCol']
# fetch the column in this way, which will return you a series
myResult = myResult['firstCol']
print(type(myResult))
In similar fashion, you can get series from Dataframe with multiple columns.
data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)
This gives a dataframe with index as column name of data and all data are present in "values" column
Another way is very simple
df= df.iloc[3].reset_index(drop=True).squeeze()
Squeeze -> is the one that converts to Series.

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Drop non-numeric columns from a pandas DataFrame [duplicate]

This question already has answers here:
How do I find numeric columns in Pandas?
(13 answers)
Closed 3 years ago.
In my application I load text files that are structured as follows:
First non numeric column (ID)
A number of non-numeric columns (strings)
A number of numeric columns (floats)
The number of the non-numeric columns is variable. Currently I load the data into a DataFrame like this:
source = pandas.read_table(inputfile, index_col=0)
I would like to drop all non-numeric columns in one fell swoop, without knowing their names or indices, since this could be doable reading their dtype. Is this possible with pandas or do I have to cook up something on my own?
To avoid using a private method you can also use select_dtypes, where you can either include or exclude the dtypes you want.
Ran into it on this post on the exact same thing.
Or in your case, specifically:
source.select_dtypes(['number']) or source.select_dtypes([np.number]
It`s a private method, but it will do the trick: source._get_numeric_data()
In [2]: import pandas as pd
In [3]: source = pd.DataFrame({'A': ['foo', 'bar'], 'B': [1, 2], 'C': [(1,2), (3,4)]})
In [4]: source
Out[4]:
A B C
0 foo 1 (1, 2)
1 bar 2 (3, 4)
In [5]: source._get_numeric_data()
Out[5]:
B
0 1
1 2
This would remove each column which doesn't include float64 numerics.
df = pd.read_csv('sample.csv', index_col=0)
non_floats = []
for col in df:
if df[col].dtypes != "float64":
non_floats.append(col)
df = df.drop(columns=non_floats)
I also have another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our DataFrame
df before dropping:
to_be_dropped=pd.DataFrame(df.categorical).columns
df= df.drop(to_be_dropped,axis=1)
df after dropping:

Categories