I got a weird one today. I am scraping several thousand PDFs using Tabula-py and, for whatever reason, the same table (different PDF) which has wrapped text can be auto-merged based on the tables actual split but on other occasion the pandas dataframe will have many NaN rows to account account for the wrapped text. Generally the ratio is 50:1 are merged. So it makes since to automate the merging process. Here is the example:
Desired DataFrame:
Column1 | Column2 | Column3
A Many Many ... Lots and ... This keeps..
B lots of text.. Many Texts.. Johns and jo..
C ...
D
Scraped returned Dataframe
Column1 | Column2 | Column3
A Many Many Lots This keeps Just
Nan Many Many and lots Keeps Going!
Nan Texts Nan Nan
B lots of Many Texts John and
Nan text here Johnson inc.
C ...
In this case the text should be merged up, such that "Many Many Many Many Texts" are all in cell A Column1 and so on.
I have solved this problem with the below solution, but it feels very dirty. There are a ton of index settings to avoid having to manage the columns and avoid dropping needed values. Is anyone aware of a better solution?
df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')
Cheers
Similar to Ben's idea:
# fill the missing index
df.index = df.index.to_series().ffill()
(df.stack() # stack to kill the other NaN values
.groupby(level=(0,1)) # grouby (index, column)
.apply(' '.join) # join those strings
.unstack(level=1) # unstack to get columns back
)
Output:
Column1 Column2 Column3
A Many Many Many Many Texts Lots and lots This keeps Just Keeps Going!
B lots of text Many Texts here John and Johnson inc.
Try this:
df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)
Out[1390]:
Column1 Column2 \
Unnamed: 0
A Many Many Many Many Texts Lots and lots
B lots of text Many Texts here
Column3
Unnamed: 0
A This keeps Just Keeps Going!
B John and Johnson inc.
I think you can use ffill on the index directly in the groupby. Then use agg instead of transform.
# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')},
index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
a b
A a 1
NaN b 2
NaN c 3
B d 4
C e 5
NaN f 6
#then groupby on the filled index and agg
new_df = (df.fillna('')
.groupby(pd.Series(df.index).ffill().values)[df.columns]
.agg(lambda x: ' '.join(x)))
print (new_df)
a b
A a b c 1 2 3
B d 4
C e f 5 6
Related
I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1.
Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.
data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn)
Result of above snippet:
Name Age Sex
0 ABC 1 M
1 ABC 3 M
2 ABC 4 M
3 MNO 4 M
4 XYZ 2 M
5 XYZ 1 M
6 PQR 2 F
This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn
I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1.
Apologies as im not able to share the actual data as it is too large!
Edit:
A sample result from the actual data:
df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:
Thanks in advance!
Try to drop_duplicates from df1 too:
dfn = pd.merge(df1, df2.drop_duplicates('Name'),
on='Name', how='left)
I am trying to split a dataframe column into multiple columns as under:
There are three columns overall. Two should rename in the new dataframe, while the third one to be split into new columns.
Split is to be done using a specific character (say ":")
The column that requires split can have varied number of ":" split. So new columns can be different for different rows, leaving some column values as NULL for some rows. That is okay.
Each subsequently formed column has a specific name. Max number of columns that can be formed is known.
There are four dataframes. Each one has this same formatted column that has to be split.
I came across following solutions but they don't work for the reasons mentioned:
Link
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
This creates columns with names as 0,1,2... I need the new columns to have specific names.
Link
df = df.apply(lambda x:pd.Series(x))
This does no change to the dataframe. Couldn't understand why.
Link
df['command'], df['value'] = df[0].str.split().str
Here the column names are renamed properly, but this requires knowing beforehand how many columns will be formed. In my case, it is dynamic for each dataframe. For rows, the split successfully puts NULL value in extra columns. But using the same code for another dataframe generates an error saying number of keys should be same.
I couldn't post comments on these answers as I am new here on this community. I would appreciate if someone can help me understand how I can achieve my objective - which is: Dynamically use same code to split one column into many for different dataframes on multiple occasions while renaming the newly generated columns to predefined name.
For example:
Dataframe 1:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D:E A
2 A A:B A
Dataframe 2:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D A
2 A A:B A
Output should be:
New dataframe 1:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D E A
2 A A B NaN NaN NaN A
New dataframe 2:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D NaN A
2 A A B NaN NaN NaN A
(If ColE is not there, then also it is fine.)
After this, I will be concatenating these dataframes into one, where I will need counts of all ColA to ColE for individual dataframes against Col1 and Col3 combinations. So, we need to keep this in mind.
You can do it this way:
columns = df.Col2.max().split(':')
#['A', 'B', 'C', 'D', 'E']
new = df.Col2.str.split(":", expand = True)
new.columns = columns
new = new.add_prefix("Col")
df.join(new).drop("Col2", 1)
# Col1 Col3 ColA ColB ColC ColD ColE
#0 A A A B C None None
#1 A A A B C D E
#2 A A A B None None None
I have multiple categorical columns like Marital Status, Education, Gender, City and I wanted to check all the unique values inside these columns at once instead of writing this code every time.
df['Education'].value_counts()
I can only give an example of a few features but I need a solution when there are so many categorical features and its not possible to write code again and again to examine them.
Maritial_Status Education City
Married UG LA
Single PHD CA
Single UG Ca
Expected output:
Maritial_Status Education City
Married 1 UG 2 LA 1
Single 2 PHD 1 CA 2
Is there any kind of method to do this in Python?
Thanks
Yes, you can get what you're looking for with the following approach (also you don't have to worry about if your df has more data than the 4 columns you specified):
Get (only) all your categorical columns from your df in a list:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
Then, run a loop performing .size() on your grouped object, over your categorical columns, and store each result (which is a df object) in an empty list.
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
Lastly, concat the newly created dataframes within your list, into 1.
dat = pd.concat(li,axis=1)
All in 1 block:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
dat = pd.concat(li,axis=1)# use axis=1, so that the concatenation is column-wise
Marital Status Marital Status_count ... City City_count
0 Divorced 4.0 ... Athens 4
1 Married 3.0 ... Berlin 2
2 Single 3.0 ... London 2
3 Widowed 2.0 ... New York 2
4 NaN NaN ... Singapore 2
Using value_counts, you can do the following
res = (df
.apply(lambda x: x.value_counts()) # column by column value_counts would be applied
.stack()
.reset_index(level=0).sort_index(axis=0)
.rename(columns={'level_0': 'Value', 0: 'value_counts'}))
Another format of the the output:
res['Id'] = res.groupby(level=0).cumcount()
res.set_index('Id', append=True)
Explanation:
After applying value_counts, you will get the following:
Then using stack you can remove the NAN and get all things "stacked up" and then you can do the formatting/ ordering of the output.
To know how many repeated unique values you have for each column, you can try drop_duplicates() method:
dataset.drop_duplicates()
I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning
Very new to python and using pandas, I only use it every once in a while when I'm trying to learn and automate otherwise a tedious Excel task. I've come upon a problem where I haven't exactly been able to find what I'm looking for through Google or here on Stack Overflow.
I currently have 6 different excel (.xlsx) files that I am able to parse and read into data frames. However, whenever I try to append them together, they're simply added on as new rows in a final output excel files, but instead I'm trying to append similar data values onto the same row, and not the same column so that I can see whether or not this unique value shows up in these data sets or not. A shortened example is as follows
[df1]
0 Col1 Col2
1 XYZ 41235
2 OAIS 15123
3 ABC 48938
[df2]
0 Col1 Col2
1 KFJ 21493
2 XYZ 43782
3 SHIZ 31299
4 ABC 33347
[Expected Output]
0 Col1 [df1] [df2]
1 XYZ 41235 43782
2 OAIS 15123
3 ABC 48938 33347
4 KFJ 21493
5 SHIZ 31299
I've tried to use a merge, however the actual data sheets are much more complicated in that I want to append 23 columns of data associated with each unique identifier in each data set. Such as, [XYZ] in [df2] has associated information across the next 23 columns that I would want to append after the 23 columns from the [XYZ] values in [df1].
How should I go about that? There are approximately 200 rows in each excel sheet and I would only need to essentially loop through until a matching unique identifier was found in [df2] with [df1], and then [df3] with [df1] and so on until [df6] and append those columns onto a new dataframe which would eventually be output as a new excel file.
df1 = pd.read_excel("set1.xlsx")
df2 = pd.read_excel("set2.xlsx")
df3 = pd.read_excel("set3.xlsx")
df4 = pd.read_excel("set4.xlsx")
df5 = pd.read_excel("set5.xlsx")
df6 = pd.read_excel("set6.xlsx")
Is currently the way I am reading into the excel files into data frames, I'm sure I could loop it, however, I am unsure of the best practices in doing so instead of hard coding each initialization of the data frame.
You need merge with the parameter how = 'outer'
new_df = df1.merge(df2, on = 'Col1',how = 'outer', suffixes=('_df1', '_df2'))
You get
Col1 Col2_df1 Col2_df2
0 XYZ 41235.0 43782.0
1 OAIS 15123.0 NaN
2 ABC 48938.0 33347.0
3 KFJ NaN 21493.0
4 SHIZ NaN 31299.0
For iterative merging, consider storing data frames in a list and then run the chain merge with reduce(). Below creates a list of dataframes from a list comprehension through the Excel files where enumerate() is used to rename the Col2 successively as df1, df2, etc.
from functools import reduce
...
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)})
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df = reduce(lambda x,y: pd.merge(x, y, on=['Col1'], how='outer'), dfList)
# Col1 df1 df2
# 0 XYZ 41235.0 43782.0
# 1 OAIS 15123.0 NaN
# 2 ABC 48938.0 33347.0
# 3 KFJ NaN 21493.0
# 4 SHIZ NaN 31299.0
Alternatively, use pd.concat and outer join the dataframes horizontally where you need to set Col1 as index:
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)}).set_index('Col1')
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df2 = pd.concat(dfList, axis=1, join='outer', copy=False)\
.reset_index().rename(columns={'index':'Col1'})
# Col1 df1 df2
# 0 ABC 48938.0 33347.0
# 1 KFJ NaN 21493.0
# 2 OAIS 15123.0 NaN
# 3 SHIZ NaN 31299.0
# 4 XYZ 41235.0 43782.0
You can use the merge function.
pd.merge(df1, df2, on=['Col1'])
You can use multiple keys by adding to the list on.
You can read more about the merge function in here
If you need only certain of the columns you can reach it by:
df1.merge(df2['col1','col2']], on=['Col1'])
EDIT:
In case of looping through some df's you can loop through all df's except the first and merge them all:
df_list = [df2, df3, df4]
for df in df_list:
df1 = df1.merge(df['col1','col2']], on=['Col1'])