Hi I am trying to make some contingency tables. I want it in a function so I can use it for various columns/dataframes/combinations etc.
current I have a dataframe that looks like this
df = pd.DataFrame(data={'group' : ['A','A','B','B','C','D'],
'class': ['g1','g2','g2','g3','g1','g2'],
'total' : ['0-10','20-30','0-10','30-40','50-60','20-30'],
'sub' : ['1-4', '5-9','10-14', '15-19','1-4','15-19'],
'n': [3,14,12,11,21,9]})
and a function that looks like this
def cts(tabs, df):
out=[]
for col in df.loc[:,df.columns != tabs]:
a = pd.crosstab([df[tabs]], df[col])
out.append(a)
return(out)
cts('group', df)
which works for cross tabulations for one column against the rest. But I want to add two (or more!) levels to the grouping e.g.
pd.crosstab([df['group'], df['class']], df['total'])
where total is cross tabulated against both group and class.
I think the 'tabs' var in the function should be a list of column names, but when i try and make it a list i get errors re invalid syntax. I hope this makes sense.. thank you!
Try:
def cts(tabs, df):
out=[]
cols = [col for col in df.columns if col not in tabs]
for col in df.loc[:,cols]:
a = pd.crosstab([df[tab] for tab in tabs], df[col])
out.append(a)
return(out)
Related
I have a pandas dataframe and I want to filter/select conditions based on elements of an input list. So, for example, I have something like:
filters = ['category', 'name']
# I am just trying to select the columns which would match as follows:
data = {'category_name_courses': ["Spark","PySpark","Python","pandas"], 'category_name_area': ["cloud", "cloud", "prog", "ds"], 'some_other_column': [0, 0, 0, 0]
x = pd.DataFrame(data)
selections = list()
for col in x.columns:
if ('name' in col) and ('category' in col):
selections.append(col)
In my case, this if condition or some other way of selection should be built by 'ANDing' everything from this input list
IIUC, do you want?
reg_str = '&'.join(filters)
x.filter(regex=reg_str)
Output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds
Your edit shows that you want to filter columns based on their name.
Simply use:
filters = ['category', 'name']
for col in x.columns:
if all(x in col for x in filters):
print(col)
Output:
category_name_courses
category_name_area
older answer: filtering values
You can do almost what you suggested:
x = pd.DataFrame([['flow', 'x', 'category'],['x','x','flow']])
for col in x.columns:
if ('flow' in x[col].values) and ('category' in x[col].values):
# Do something with this column...
print(f'column "{col}" matches')
Using a list of matches:
filters = ['category', 'flow']
for col in x.columns:
if all(x in x[col].values for x in filters):
# Do something with this column...
print(f'column "{col}" matches')
Or, more efficiently, using a set:
filters = set(['category', 'flow'])
for col in x.columns:
if set(x[col]) >= filters:
# Do something with this column...
print(f'column "{col}" matches')
Example:
column "2" matches
Try it (if I understand your problem correctly):
You have a list of condition, and you want to sure, that the column includes all of it.
filters = ['category', 'name']
# x is some pandas dataframe
# I am trying to create something which would be equivalent of:
for col in x.columns:
boo = True
for condition in filters:
if not condition in col:
boo = False
break
if boo == True:
#to do something
I really did not understand the question. I assume you want to find all column names witch contains any value in the given list.
The pandas data frame has loads of methods and functionalities. You may be familiar with dt which allows one to apply datetime functionalities to a given column.
In the same way one can use str for string functionalities.
So you can see if any string is in any value (row) of a column:
a_df.str.contains("string")
it will return a list of bool where condition is met.
One can use | (or) to check multiple strings.
a_df.str.contains("string|someting")
So the line should look like:
x.columns.str.contains("|".join(filters))
Now you have the mask you can apply it:
x.columns[x.columns.str.contains("|".join(filters))]
Since you have the column names you can access the column data itself:
x[x.columns[x.columns.str.contains("|".join(filters))]]
output:
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python prog
3 pandas ds
I have a simple question for style and how to do something correctly.
I want to take all the unique values of certain columns in a pandas dataframe and create a map ['columnName'] -> [valueA,valueB,...]. Here is my code that does that:
listUnVals = {}
for col in df:
if ((col != 'colA') and (col != 'colB')):
listUnVals[col] = (df[col].unique())
I want to exclude some columns like colA and colB. Is there a better way to filter out the columns I don't want, except writing an if (( != ) and ( != ...) . I hoped to create a lambda expression that filters this values but I can't create it correctly.
Any answer would be appreciated.
Couple of ways to remove unneeded columns
df.columns[~df.columns.isin(['colA', 'colB'])]
Or,
df.columns.difference(['colA', 'colB'])
And, you can ignore loop with
{c: df[c].unique() for c in df.columns[~df.columns.isin(['colA', 'colB'])]}
You can create a list of unwanted columns and then check for in status
>>> unwanted = ['columnA' , 'columnB']
>>> for col in df:
if col not in unwanted:
listUnVals[col] = (df[col].unique())
Or using dict comprehension:
{col : df[col].unique() for col in df if col not in unwanted}
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff
I have a dataframe (df) and want to print the unique values from each column in the dataframe.
I need to substitute the variable (i) [column name] into the print statement
column_list = df.columns.values.tolist()
for column_name in column_list:
print(df."[column_name]".unique()
Update
When I use this: I get "Unexpected EOF Parsing" with no extra details.
column_list = sorted_data.columns.values.tolist()
for column_name in column_list:
print(sorted_data[column_name].unique()
What is the difference between your syntax YS-L (above) and the below:
for column_name in sorted_data:
print(column_name)
s = sorted_data[column_name].unique()
for i in s:
print(str(i))
It can be written more concisely like this:
for col in df:
print(df[col].unique())
Generally, you can access a column of the DataFrame through indexing using the [] operator (e.g. df['col']), or through attribute (e.g. df.col).
Attribute accessing makes the code a bit more concise when the target column name is known beforehand, but has several caveats -- for example, it does not work when the column name is not a valid Python identifier (e.g. df.123), or clashes with the built-in DataFrame attribute (e.g. df.index). On the other hand, the [] notation should always work.
Most upvoted answer is a loop solution, hence adding a one line solution using pandas apply() method and lambda function.
print(df.apply(lambda col: col.unique()))
This will get the unique values in proper format:
pd.Series({col:df[col].unique() for col in df})
If you're trying to create multiple separate dataframes as mentioned in your comments, create a dictionary of dataframes:
df_dict = dict(zip([i for i in df.columns] , [pd.DataFrame(df[i].unique(), columns=[i]) for i in df.columns]))
Then you can access any dataframe easily using the name of the column:
df_dict[column name]
We can make this even more concise:
df.describe(include='all').loc['unique', :]
Pandas describe gives a few key statistics about each column, but we can just grab the 'unique' statistic and leave it at that.
Note that this will give a unique count of NaN for numeric columns - if you want to include those columns as well, you can do something like this:
df.astype('object').describe(include='all').loc['unique', :]
I was seeking for a solution to this problem as well, and the code below proved to be more helpful in my situation,
for col in df:
print(col)
print(df[col].unique())
print('\n')
It gives something like below:
Fuel_Type
['Diesel' 'Petrol' 'CNG']
HP
[ 90 192 69 110 97 71 116 98 86 72 107 73]
Met_Color
[1 0]
The code below could provide you a list of unique values for each field, I find it very useful when you want to take a deeper look at the data frame:
for col in list(df):
print(col)
print(df[col].unique())
You can also sort the unique values if you want them to be sorted:
import numpy as np
for col in list(df):
print(col)
print(np.sort(df[col].unique()))
cu = []
i = []
for cn in card.columns[:7]:
cu.append(card[cn].unique())
i.append(cn)
pd.DataFrame( cu, index=i).T
Simply do this:
for i in df.columns:
print(df[i].unique())
Or in short it can be written as:
for val in df['column_name'].unique():
print(val)
Even better. Here's code to view all the unique values as a dataframe column-wise transposed:
columns=[*df.columns]
unique_values={}
for i in columns:
unique_values[i]=df[i].unique()
unique=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in unique_vals.items() ]))
unique.fillna('').T
This solution constructs a dataframe of unique values with some stats and gracefully handles any unhashable column types.
Resulting dataframe columns are: col, unique_len, df_len, perc_unique, unique_values
df_len = len(df)
unique_cols_list = []
for col in df:
try:
unique_values = df[col].unique()
unique_len = len(unique_values)
except TypeError: # not all cols are hashable
unique_values = ""
unique_len = -1
perc_unique = unique_len*100/df_len
unique_cols_list.append((col, unique_len, df_len, perc_unique, unique_values))
df_unique_cols = pd.DataFrame(unique_cols_list, columns=["col", "unique_len", "df_len", "perc_unique", "unique_values"])
df_unique_cols = df_unique_cols[df_unique_cols["unique_len"] > 0].sort_values("unique_len", ascending=False)
print(df_unique_cols)
The best way to do that:
Series.unique()
For example students.age.unique() the output will be the different values that occurred in the age column of the students' data frame.
To get only the number of how many different values:
Series.nunique()
I know we can access columns using table.cols.somecolumn, but I need to apply the same operation on 10-15 columns of my table. So I'd like an iterative solution. I have the names of the columns as strings in a list : ['col1','col2','col3'].
So I'm looking for something along the lines of:
for col in columnlist:
thiscol = table.cols[col]
#apply whatever operation
Try this:
columnlist = ['col1','col2','col3']
for col in columnlist:
thiscol = getattr(table.cols, col)