The following code finds any strings for column B. Is it possible to loop over multiple columns of a dataframe outputting the cells containing strings for each column?
import pandas as pd
for i in df:
print(df[df['i'].str.contains(r'^[a-zA-Z]+$')])
Link to code above
https://stackoverflow.com/a/65410078/12801962
Here is how to loop through columns
import pandas as pd
colList = ['ColB', 'Some_other', 'ColC']
for col in colList:
subdf = df[df[col].str.contains(r'^[a-zA-Z]+$')]
#do something with sub DF
or do it in one long test and get all the problem rows in one dataframe
import pandas as pd
subdf = df[((df['ColB'].str.contains(r'^[a-zA-Z]+$')) |
(df['Some_other'].str.contains(r'^[a-zA-Z]+$')) |
(df['ColC'].str.contains(r'^[a-zA-Z]+$')))]
Not sure if it's what you are intending to do
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColA'] = ['ABC', 'DEF', 12345, 23456]
df['ColB'] = ['abc', 12345, 'def', 23456]
all_trues = pd.Series(np.ones(df.shape[0], dtype=np.bool))
for col in df:
all_trues &= df[col].str.contains(r'^[a-zA-Z]+$')
df[all_trues]
Which will give the result:
ColA ColB
0 ABC abc
Try:
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')])
Or, for the values only (no index nor column information):
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')].values)
Note, both of the above only work because you just want to print the matching values in the columns, not return a new structure with filtered entries.
If you tried to make a new DataFrame with cells filtered by the condition, then that would lead to ragged arrays, which are not implemented (you could replace these cells by a marker of your choice, but you cannot cut them away). Another possibility would be to select rows where any or all the cells present the condition you are testing for (that way, the result is an homogeneous array, not a ragged one).
Yet another option would be to return a list of Series, each representing a column, or a dict of colname: Series:
{k: s.loc[s.str.contains(r'^[a-zA-Z]+$')] for k, s in df.astype(str).items()}
Related
I'm trying to group values of below list in a dataframe based on Style,Gender and Region but with
values filled down.
My cuurent attempt gets a dataframe without style and region filled down. Not sure if it is good approach or would better
to manipulate the list lst
import pandas as pd
lst = [
['Tee','Boy','East','12','11.04'],
['Golf','Boy','East','12','13'],
['Fancy','Boy','East','12','11.96'],
['Tee','Girl','East','10','11.27'],
['Golf','Girl','East','10','12.12'],
['Fancy','Girl','East','10','13.74'],
['Tee','Boy','West','11','11.44'],
['Golf','Boy','West','11','12.63'],
['Fancy','Boy','West','11','12.06'],
['Tee','Girl','West','15','13.42'],
['Golf','Girl','West','15','11.48']
]
df1 = pd.DataFrame(lst, columns = ['Style','Gender','Region','Units','Price'])
df2 = df1.groupby(['Style','Region','Gender']).count()
Current output (content of df2)
output I'm looking for
You just need to use reset_index that will reset back to normal
df2.reset_index(inplace=True)
I have a numeric np array which I want to use that as a condition/filter over a column number 4 of a dataframe (df) to extract a subset of dataframe (sale_data_sub). However, I am getting an empty sale_data_sub (with just the name of all the columns and no rows) as a result of the code
sale_data_sub = df.loc[df[4].isin(sale_condition_arr)].values
sale_condition_arr is a numpy array
df is the original dataframe with 100 columns
sale_data_subset is the desired sub_dataframe
Sorry that I didn't include a working sample.
the issue is that your df dataframe don't have headers assigned.
try:
#give your dataframe a header:
df = df.set_axis([str(i) for i in list(range(len(df.columns)))], axis='columns')
#then proceed to your usual work with df:
sale_data_sub = df.loc[df["4"].isin(sale_condition_arr)].values #be careful, it's df["4"] not df[4]
I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.
Sum Boolean selection
(data['income'].eq('<50K')).sum()
The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))
I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.
I tried to do it with iloc and loc, but it doesn't work:
import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()
It returns me an error: 'DataFrame' object has no attribute 'str'
I suggest doing it without a for loop like this:
df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only string columns
df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only some string columns
selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
You can do this but if your all column type is StringType:
for column in foo.columns:
df = df[~df[c].str.contains('DDD')]
You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:
>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
['DDDE', 'DDDF', 'DDDG', 'DHDD'],
['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
['DMDD', 'DNDN', 'DDOD', 'DDDP']],
columns=['A', 'B', 'C', 'D'])
>>> for column in df.columns:
df = df[df[column].str.contains('DDD')]
In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.
This gives you:
>>> print(df)
A B C D
0 DDDA DDDB DDDC DDDD
2 DDDI DDDJ DDDK DDDL
As you're only looping over 10 columns this shouldn't be too slow.
Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.
I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes