Convert pandas data frame to series - python

I'm somewhat new to pandas. I have a pandas data frame that is 1 row by 23 columns.
I want to convert this into a series? I'm wondering what the most pythonic way to do this is?
I've tried pd.Series(myResults) but it complains ValueError: cannot copy sequence with size 23 to array axis with dimension 1. It's not smart enough to realize it's still a "vector" in math terms.
Thanks!

You can transpose the single-row dataframe (which still results in a dataframe) and then squeeze the results into a series (the inverse of to_frame).
df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df.squeeze(axis=0)
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
Note: To accommodate the point raised by #IanS (even though it is not in the OP's question), test for the dataframe's size. I am assuming that df is a dataframe, but the edge cases are an empty dataframe, a dataframe of shape (1, 1), and a dataframe with more than one row in which case the use should implement their desired functionality.
if df.empty:
# Empty dataframe, so convert to empty Series.
result = pd.Series()
elif df.shape == (1, 1)
# DataFrame with one value, so convert to series with appropriate index.
result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
# Convert to series per OP's question.
result = df.T.squeeze()
else:
# Dataframe with multiple rows. Implement desired behavior.
pass
This can also be simplified along the lines of the answer provided by #themachinist.
if len(df) > 1:
# Dataframe with multiple rows. Implement desired behavior.
pass
else:
result = pd.Series() if df.empty else df.iloc[0, :]

It's not smart enough to realize it's still a "vector" in math terms.
Say rather that it's smart enough to recognize a difference in dimensionality. :-)
I think the simplest thing you can do is select that row positionally using iloc, which gives you a Series with the columns as the new index and the values as the values:
>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
a0 a1 a2 a3 a4
0 0 1 2 3 4
>>> df.iloc[0]
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>

You can retrieve the series through slicing your dataframe using one of these two methods:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))
series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series

You can also use stack()
df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])
After u run df, then run:
df.stack()
You obtain your dataframe in series

If you have a one column dataframe df, you can convert it to a series:
df.iloc[:,0] # pandas Series
Since you have a one row dataframe df, you can transpose it so you're in the previous case:
df.T.iloc[:,0]

Another way -
Suppose myResult is the dataFrame that contains your data in the form of 1 col and 23 rows
# label your columns by passing a list of names
myResult.columns = ['firstCol']
# fetch the column in this way, which will return you a series
myResult = myResult['firstCol']
print(type(myResult))
In similar fashion, you can get series from Dataframe with multiple columns.

data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)
This gives a dataframe with index as column name of data and all data are present in "values" column

Another way is very simple
df= df.iloc[3].reset_index(drop=True).squeeze()
Squeeze -> is the one that converts to Series.

Related

Pandas check if cell is null in any of two dataframes and if it is, make both cells nulls

I have two dataframes with same shape:
>>> df1.shape
(400,1200)
>>> df2.shape
(400,1200)
I would like to compare cell-by-cell and if a value is missing in one of the dataframes make the equivalent value in the other dataframe NaN as well.
Here's a (pretty inefficient) piece of code that works:
for i in df.columns: # iterate over columns
for j in range(len(df1): # iterate over rows
if pd.isna(df1[i][j]) | pd.isna(df2[i][j]):
df1[i][j] = np.NaN
df2[i][j] = np.NaN
How would be a better way to do this? I'm very sure there is.
This is a simple problem to solve with pandas. You can use this code:
df1[df2.isna()] = df2[df1.isna()] = np.nan
It first creates mask of df2, i.e., a copy of dataframe containing only True or False values. Each NaN in df2 will have a True in the mask, and every other value will have a False in the mask.
With pandas, you can use such masks to do bulk operations. So you can pass that mask to the [] of df1, and then assign it a value, and where each value in the mask is True, the corresponding value in df1 will be assigned the value.

Python Pandas: Filling data frame with pd.Series in each element

The library sktime requires a very "particular" data format. For n time series the T values of each series need to be stored in a pandas Dataframe of pandas Series of length T like this:
DataFrame:
index | Data
0 | pd.Series
1 | pd.Series
... | ...
n-1 | pd.Series
My attempt to fill an empty data frame with n = 2 and T = 3 in a loop by reading from another data frame did not work. Here is my reduced version that uses a constant pd.Series in each row:
import pandas as pd
df = pd.DataFrame(columns=["Data"])
for i in range(2):
df.loc[i] = pd.Series([2, 4, 5])
Note that from many examples on the site, I know (1) how to fill a normal data frame in a for loop and (2) my attempt is not efficient even if it was working.
pandas doesn't want you to store complex objects in a cell, so if you try to create a DataFrame from Series, pandas will flatten it to a 2-d structure. To avoid that we need to work with a Series; the 1-D structure ensures the Series are placed in a single cell.
Append your Series to a dict construct the Series of Series with the basic constructor and make it a DataFrame with Series.to_frame
d = {}
for i in range(2):
d[i] = pd.Series([2, 4, 5]*(i+1))
df = pd.Series(d).to_frame('Data')
# Check they're Series
print(df.applymap(type))
# Data
#0 <class 'pandas.core.series.Series'>
#1 <class 'pandas.core.series.Series'>

Process for multiple columns

I have this code which works for one pandas series. How to apply it to all columns of my large dataset? I have tried many solutions, but none works for me.
c = data["High_banks"]
c2 = pd.to_numeric(c.str.replace(',',''))
data = data.assign(High_banks = c2)
What is the best way to do this?
i think you can do it like this
df = df.replace(",","",regex=True )
after that you can convert datatype
You can use a combination of the methods apply and applymap.
Take this for an example:
df = pd.DataFrame([['1,', '2,12'], ['3,356', '4,567']], columns = ['a','b'])
new_df = (df.applymap(lambda x: x.replace(',',''))
.apply(pd.to_numeric, axis = 1))
new_df.dtypes
>> #successfully converted to numeric types
a int64
b int64
dtype: object
The first method, applymap runs element wise on the dataframe to remove , then apply applies the pd.to_numeric function across the column axis of the dataframe.

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Drop non-numeric columns from a pandas DataFrame [duplicate]

This question already has answers here:
How do I find numeric columns in Pandas?
(13 answers)
Closed 3 years ago.
In my application I load text files that are structured as follows:
First non numeric column (ID)
A number of non-numeric columns (strings)
A number of numeric columns (floats)
The number of the non-numeric columns is variable. Currently I load the data into a DataFrame like this:
source = pandas.read_table(inputfile, index_col=0)
I would like to drop all non-numeric columns in one fell swoop, without knowing their names or indices, since this could be doable reading their dtype. Is this possible with pandas or do I have to cook up something on my own?
To avoid using a private method you can also use select_dtypes, where you can either include or exclude the dtypes you want.
Ran into it on this post on the exact same thing.
Or in your case, specifically:
source.select_dtypes(['number']) or source.select_dtypes([np.number]
It`s a private method, but it will do the trick: source._get_numeric_data()
In [2]: import pandas as pd
In [3]: source = pd.DataFrame({'A': ['foo', 'bar'], 'B': [1, 2], 'C': [(1,2), (3,4)]})
In [4]: source
Out[4]:
A B C
0 foo 1 (1, 2)
1 bar 2 (3, 4)
In [5]: source._get_numeric_data()
Out[5]:
B
0 1
1 2
This would remove each column which doesn't include float64 numerics.
df = pd.read_csv('sample.csv', index_col=0)
non_floats = []
for col in df:
if df[col].dtypes != "float64":
non_floats.append(col)
df = df.drop(columns=non_floats)
I also have another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our DataFrame
df before dropping:
to_be_dropped=pd.DataFrame(df.categorical).columns
df= df.drop(to_be_dropped,axis=1)
df after dropping:

Categories