I have a dataframe:
d= {'page_number':[0,0,0,0,0,0,1,1,1,1], 'text':[aa,ii,cc,dd,ee,ff,gg,hh,ii,jj]}
df = pd.DataFrame(data=d)
df
page_number text
0 0 aa
1 0 ii
2 0 cc
3 0 dd
4 0 ee
5 0 ff
6 1 gg
7 1 hh
8 1 ii
9 1 jj
I want to spot the page_numer where 'gg' appears, now on the same page_number there can be many different substrings, but I'm interested in extracting the row number of where 'ii' appears on the same page_number of 'gg' (not interested in getting results of other 'ii' substrings appearances)
idx=np.where(df['text'].str.contains(r'gg', na=True))[0][0]
won't necessarily help here as it retrieves the row number of 'gg' but not its 'page_number'.
Many thanks
You first leave only 'ii' and 'gg' appearances:
df = df[df['text'].isin(['ii', 'gg'])
Then by groupby page number we can assume that when ever we got 2 then they are on the same page:
df2 = df.groupby('page_number').count()
df2[df2['text'] == 2]
You can use pandas to retrieve column value on the basis of another column value. I hope this will retrieve what you are looking for.
df[df['text']=='gg']['page_number']
In case you have several 'gg's and 'ii's on any page:
This will return a boolean Series:
df = df.groupby(by='page_number').agg(lambda x: True if 'gg' in x.values
and 'ii' in x.values else False)
And this will get you the numbers of pages
df[df.text].index
Related
I've a text file containg some data for correlation function. It is structured like
The first two rows are bin numbers and 45 entries, while the remaining rows are the values at the given location and containers 46 entries in each row. In these remaining rows, i.e, from row 2+, the first column is the order of the values.
I want to read this as a pandas data frame. Since there is a mismatch of dimension, pandas show an error.
ParserError: Error tokenizing data. C error: Expected 45 fields in line 9, saw 46
To fix this error, I modify the txt file by adding 'r1' and 'r2' in place of the blank space in the first two rows.
This is a solution if there are only a few files, but unfortunately, I've 100s of files structured in the same way. Is there a way to read the data from this. It would be fine for me if I can just skip first column entirely from row two onwards.
It seems like the two first rows are some kind of Multiindex-Columns, thus the entries of the row-index are missing (45 instead of 46 columns).
If my guess is correct you can extend Oivalfs answer by pandas Multiindex:
import pandas as pd
rows_uneven_dimensions = 2
df_first_two_rows = pd.read_csv("test.txt", header=None, nrows=rows_uneven_dimensions, sep='\t')
df_all_other_rows = pd.read_csv("test.txt", header=None, skiprows=rows_uneven_dimensions, sep='\t', index_col=0) #Defining first col as index
df_all_other_rows.index.name = 'Index' # Optional: Set index Name
cols=pd.MultiIndex.from_arrays([df_first_two_rows.iloc[idx] for idx in df_first_two_rows.index], names=("First Level", "Second Level")) # Define Multinidex based on Rows of df_first_two_rows
# Names of Levels is just for illustration purpose
df_all_other_rows.columns=cols # Replacing the old col-index with new Multiindex
print(df_all_other_rows)
Given the test.txt given by Oivalf the result will look like this:
First Level 0
Second Level 1 2 2 3 4 5 6
Index
0 A B C D E F G
1 H I J K L M N
2 O P Q R S T U
3 V W X Y Z A B
I may would try to read at first all rows except the first two with the skiprows attribute in the read_csv() function. Afterwards read the first two rows with the nrows=2 attribute.
But thats only a workaround. Maybe there are better solutions.
In the end combine your given dataframes.
Example:
import pandas as pd
rows_uneven_dimensions = 2
df_first_two_rows = pd.read_csv("test.txt", header=None, nrows=rows_uneven_dimensions, sep='\t')
df_all_other_rows = pd.read_csv("test.txt", header=None, skiprows=rows_uneven_dimensions, sep='\t')
frames = [df_first_two_rows, df_all_other_rows]
result = pd.concat(frames, ignore_index=True, axis=0)
test.txt
0 0 0 0 0 0 0
1 2 2 3 4 5 6
0 A B C D E F G
1 H I J K L M N
2 O P Q R S T U
3 V W X Y Z A B
result dataframe value:
[[0 0 0 0 0 0 0 nan]
[1 2 2 3 4 5 6 nan]
[0 'A' 'B' 'C' 'D' 'E' 'F' 'G']
[1 'H' 'I' 'J' 'K' 'L' 'M' 'N']
[2 'O' 'P' 'Q' 'R' 'S' 'T' 'U']
[3 'V' 'W' 'X' 'Y' 'Z' 'A' 'B']]
The last value in the first two rows gets filled with nan by default in the pd.concat function.
I want to create two binary indicators by checking to see if the characters in the
first and third positions for column 'A' matches the characters found in the first and third positions of column 'B'.
Here is a sample data frame:
df = pd.DataFrame({'A' : ['a%d', 'a%', 'i%'],
'B' : ['and', 'as', 'if']})
A B
0 a%d and
1 a% as
2 i% if
I would like the data frame to look like below:
A B Match_1 Match_3
0 a%d and 1 1
1 a% as 1 0
2 i% if 1 0
I tried using the following string comparison, but it the column just returns '0' values for the match_1 column.
df['match_1'] = np.where(df['A'][0] == df['B'][0], 1, 0)
I am wondering if there is a function that is similar to the substr function found in SQL.
You could use pandas str method, that can work to slice the elements:
df['match_1'] = df['A'].str[0].eq(df['B'].str[0]).astype(int)
df['match_3'] = df['A'].str[2].eq(df['B'].str[2]).astype(int)
output:
A B match_1 match_3
0 a%d and 1 1
1 a% as 1 0
2 i% if 1 0
If you have many positions to test, you can use a loop:
for pos in (1, 3):
df['match_%d' % pos] = df['A'].str[pos-1].eq(df['B'].str[pos-1]).astype(int)
I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary
I have a Pandas dataframe as follow:
data = pd.DataFrame({'w1':[0,1,0],'w2':[5,8,0],'w3':[0,0,0],'w4' :[5,1,0], 'w5' : [7,1,0],'condition' : [5,1,0]})
I need to have a column that for each row,counts the number of columns( columns other than "condition") which their values are equal to "condition".
The final output should look like bellow:
I don't want to write a for loop.
As a solution,I wanted to replace the values which are equal to the "condition" with 1 and others with 0 by np.where as bellow, and then sums the 1s of each row, which was not helpful:
data = pd.DataFrame(np.where(data.loc[:,data.columns != 'condition'] == data['condition'], 1, 0), columns = data.columns)
That was just an idea (I mean replacing the values with 1 and 0) but any pythonic solution isappreciated.
Compare all columns without last by column condition with DataFrame.eq and count Trues by sum:
data['new'] = data.iloc[:, :-1].eq(data['condition'], axis=0).sum(axis=1)
Another idea is compare all columns with remove condition col:
data['new'] = data.drop('condition', axis=1).eq(data['condition'], axis=0).sum(axis=1)
Thank you for comment #Sayandip Dutta, your idea is compare all columns and remove 1:
data['new'] = data.eq(data['condition'], axis=0).sum(axis=1).sub(1)
print (data)
w1 w2 w3 w4 w5 condition new
0 0 5 0 5 7 5 2
1 1 8 0 1 1 1 3
2 0 0 0 0 0 0 5
We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0