split rows in pandas dataframe based on spaces in cell entries

split rows in pandas dataframe based on spaces in cell entries - python

I have created the following dataframe in python using pandas
import numpy as np
import pandas as pd
WE create a list
A=["THIS IS A NEW WORLD WE NEED A NEW PARADIGM: FOR THE NATION FOR THE PEOPLE",
"THIS IS A NEW WORLD ORDER;. WE NEED A NEW PARADIGM-: FOR THE NATION FOR THE PEOPLE%",
"THIS IS A NEW WORLD? WE NEED A NEW PARADIGM FOR THE NATION FOR THE PEOPLE PRESENT."]
Next we create a dataframe
df1=pd.DataFrame()
df1["A"]=A
df1["B"]=["A1", "A2", "A3"]
The dataframe appears as follows
A B
0 THIS IS A NEW WORLD WE NEED A NEW PARADIGM: FOR THE NATION FOR THE PEOPLE A1
1 THIS IS A NEW WORLD ORDER;. WE NEED A NEW PARADIGM-: FOR THE NATION FOR THE PEOPLE% A2
2 THIS IS A NEW WORLD? WE NEED A NEW PARADIGM FOR THE NATION FOR THE PEOPLE PRESENT. A3
In the above dataframe the column A has character vectors separatde by a space
How do I transform the dataframe to yield the following dataframe
A B
0 THIS IS A NEW WORLD A1
1 WE NEED A NEW PARADIGM: A1
2 FOR THE NATION FOR THE PEOPLE A1
3 THIS IS A NEW WORLD ORDER;. A2
4 WE NEED A NEW PARADIGM-: A2
5 FOR THE NATION FOR THE PEOPLE% A2
6 THIS IS A NEW WORLD? A3
7 WE NEED A NEW PARADIGM A3
8 FOR THE NATION FOR THE PEOPLE PRESENT. A3
I request someone to take a look

If need split by 2 or more spaces add regex \s{2,} to Series.str.split and then use DataFrame.explode:
df1['A'] = df1['A'].str.split('\s{2,}')
df = df1.explode('A')
print (df)
A B
0 THIS IS A NEW WORLD A1
0 WE NEED A NEW PARADIGM: FOR THE NATION FOR THE... A1
1 THIS IS A NEW WORLD ORDER;. A2
1 WE NEED A NEW PARADIGM-: FOR THE NATION FOR TH... A2
2 THIS IS A NEW WORLD? A3
2 WE NEED A NEW PARADIGM A3
2 FOR THE NATION FOR THE PEOPLE PRESENT. A3

Related

How to transpose a column in a pandas dataframe with its values drawn from a different column? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 6 months ago.
I have a pandas dataframe with column "Code" (categorical) with more than 100 unique values. I have multiple rows for same "Name" and would like to capture all of information pertaining to a unique "Name" in one row. Therefore, I'd like to transpose the column "Code" with the values from "Counter".
How do I transpose "Code" in such a way that the following table:
Name
Col1
Col2
Col3
Code
Counter
Alice
a1
4
Alice
a2
3
Bob
b1
9
Bob
c2
1
Bob
a2
4
becomes this:
Name
Col1
Col2
Col3
a1
a2
b1
c2
Alice
4
3
0
0
Bob
0
4
9
1

I can't comment yet but the above answer (from Yuca) should work for you - you can assign the pivot table to a variable and it will be your dataframe. you can also to be sure use Pandas to make it a dataframe:
import pandas as pd
Pivoted = df.pivot(index='Name', columns='Code', values='Counter').fillna(0)
dataframe = pd.Dataframe (data = Pivoted)

try
df.pivot(index='Name', columns='Code', values='Counter').fillna(0)
output
Code a1 a2 b1 c2
Name
Alice 4.0 3.0 0.0 0.0
Bob 0.0 4.0 9.0 1.0

Pandas: Some MultiIndex values appearing as NaN when reading Excel sheets

When reading an Excel spreadsheet into a Pandas DataFrame, Pandas appears to be handling merged cells in an odd fashion. For the most part, it interprets the merged cells as desired, apart from the first merged cell for each column, which is producing NaN values where it shouldn't.
dataframes = pd.read_excel(
"../data/data.xlsx",
sheet_name=[0,1,2], # read the first three sheets as separate DataFrames
header=[0,1], # rows [1,2] in Excel
index_col=[0,1,2], # cols [A,B,C] in Excel
)
I load three sheets, but behaviour is identical for each so from now on I will only discuss one of them.
> dataframes[0]
Header 1
H2
H3
Value 1
Overall
Overall
A1
B1
0
10
NaN
NaN
1
11
NaN
B2
0
12
NaN
B2
1
13
--------
-------
-------
-------
A2
B1
0
11
A2
B1
1
12
A2
B2
0
13
A2
B2
1
14
As you can see, A1 loads with NaNs yet A2 (and all beyond it, in the real data) load fine. Both A1 and A1 are actually a single merged cell spanning 4 rows in the Excel spreadsheet itself.
What could be causing this issue? It would normally be a simple fix via a fillna(method="ffill") but MultiIndex does not support that. I have so far not found another workaround.

How can I use Python 3 and pandas to extract and put together same row numbers from multiple excel sheets?

This is for Python. I have an excel file named "translation.xlsx" with 3 sheets labeled with the names of people who have translated 8 lines of the same text from Russian into English. Here you can see the English translation in two of the sheets.
I would like to, using pandas if possible but another library is fine, take out row 1 from each sheet and put them together, so I would have
"Bob translation of row 1 , Fed translation of row 1, Raj row 1" together
then
"Bob translation of row 2 , Fed translation of row 2, Raj row 2" together
e.g.
row 1: French man sued Uber for breaking up his marriage (Fed) / French person sues Uber for ruining his marriage (Bob) / Frenchman sues Uber for ruining his marriage (Raj)
The output format is NOT important. It can be in lists, a dataframe, excel, dictionary etc. As long as I can take a picture of each person's translation for each line next to each other. Labels with the people's names and row numbers are also not important, if it is possible to include, ok, but if not, no problem.
There's no code here as I couldn't come close despite a long time trying

Use read_excel with sheet_name=None for all sheetnames to dictionary of DataFrames:
dfs = pd.read_excel('a.xlsx', sheet_name=None, header=None)
print (dfs)
OrderedDict([('Bob', 0
0 a
1 b
2 c), ('Fed', 0
0 a1
1 b1
2 c1), ('Raj', 0
0 a1
1 b2
2 c2)])
Then join together by concat:
df = pd.concat(dfs, axis=1)
print (df)
Bob Fed Raj
0 0 0
0 a a1 a1
1 b b1 b2
2 c c1 c2
And last join rows together by join and convert to one column DataFrame:
df1 = df.apply(' / '.join, axis=1).to_frame('out')
print (df1)
out
0 a / a1 / a1
1 b / b1 / b2
2 c / c1 / c2

df_Bob = pd.read_excel('translation.xlsx', sheet_name = 'Bob')
df_Fed = pd.read_excel('translation.xlsx', sheet_name = 'Fed')
df_Raj = pd.read_excel('translation.xlsx', sheet_name = 'Raj')
df_concat = pd.concat([df_Bob, df_Fed, df_Raj] , axis = 1)
df= df_concat.apply(' / '.join, axis=1).to_frame('ColumnName')
prinf(df)

Pandas: Comparing a row value and modify next column's row values

I have this Pandas Dataframe:
A B
0 xyz Lena
1 NaN J.Brooke
2 NaN B.Izzie
3 NaN B.Rhodes
4 NaN J.Keith
.....
I want to compare the values of column B such that if row value begins with B then in it's adjacent row of column A new should be written and similarly if J then old. Below is what I'm expecting:
A B
0 xyz Lena
1 old J.Brooke
2 new B.Izzie
3 new B.Rhodes
4 old J.Keith
.....
I'm unable to understand how I can do this. To begin with I can use startswith() but then how to compare one row value and then to have the required field values right in the adjacent row of another column?
This is a small case I'm trying a lot og messier things...Pandas is indeed powerful!

Use numpy.select with Series.str.startswith if need set new values by conditions:
m1 = df['B'].str.startswith('B')
m2 = df['B'].str.startswith('J')
If need also test missing values chain conditions by Series.isna:
m1 = df['B'].str.startswith('B') & df['A'].isna()
m2 = df['B'].str.startswith('J') & df['A'].isna()
df['A'] = np.select([m1, m2], ['new','old'], df['A'])
print (df)
A B
0 xyz Lena
1 old J.Brooke
2 new B.Izzie
3 new B.Rhodes
4 old J.Keith
Or use DataFrame.loc:
df.loc[m1, 'A'] = 'new'
df.loc[m2, 'A'] = 'old'

try using loc
I added .isnull() check because if already something exist in colA would not be replaced,but if you don't want you can ignore that check
import pandas
df = pd.DataFrame(data={'colA':["xyz",np.nan,np.nan,np.nan,np.nan],
"colB":['Lena','J.Brooke','B.Izzie','B.Rhodes','J.Keith']})
df.loc[(df['colA'].isnull()) &(df['colB'].str.startswith("B")),"colA"] = "new"
df.loc[(df['colA'].isnull()) &(df['colB'].str.startswith("J")),"colA"] = "old"
print(df)
colA colB
0 xyz Lena
1 old J.Brooke
2 new B.Izzie
3 new B.Rhodes
4 old J.Keith

Using pd.Series.fillna:
df['A'].fillna(df['B'].str[0].replace({'J': 'old', 'B': 'new'}))
Output:
A B
0 xyz Lena
1 old J.Brooke
2 new B.Izzie
3 new B.Rhodes
4 old J.Keith

Modify data in pandas

I have a Dataframe with several columns, for simplication this a reduced version:
ID geo value
a1 FR 3
a1 ES 7
a1 DE 6
a2 FR 3
a2 ES 5
a2 DE 10
I want to modify some of the values, my file is huge, based in some conditions.
Ideally I would do:
df[(df.ID=='1') & (df.geo=='DE')]['value']=9999
But this doesn't work, I guess because I obtaining a copy of my original dataframe instead the dataframe
Any simple way to update values based in complex conditions?

Try this:
condition = (df.ID=='a1') & (df.geo=='DE')
df.ix[condition, 'value'] = 9999

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split rows in pandas dataframe based on spaces in cell entries - python

Related

How to transpose a column in a pandas dataframe with its values drawn from a different column? [duplicate]

Pandas: Some MultiIndex values appearing as NaN when reading Excel sheets

How can I use Python 3 and pandas to extract and put together same row numbers from multiple excel sheets?

Pandas: Comparing a row value and modify next column's row values

Modify data in pandas

Categories

Resources