Merge two rows pandas dataframe - python

I have this data, and I need to merge the two selected columns with the other row because its duplicated rows cames from my code.
So, how could I do this?

Here is a way to do what your question asks:
df[['State_new', 'Solution_new']] = df[['Power State', 'Recommended Solution']].shift()
mask = ~df['State_new'].isna()
df.loc[mask, 'State'] = df.loc[mask, 'State_new']
df.loc[mask, 'Recommended Solutuin'] = df.loc[mask, 'Solution_new']
df = df.drop(columns=['State_new', 'Solution_new', 'Power State', 'Recommended Solution'])[~df['State'].isna()].reset_index(drop=True)
Explanation:
create versions of the important data from your code shifted down by one row
create a boolean mask indicating which of these shifted rows are not empty
use this mask to overwrite the content of the State and Recommended Solutuin columns (NOTE: using original column labels verbatim from OP's question) with the updated data from your code contained in the shifted columns
drop the columns used to perform the update as they are no longer needed
use reset_index to create a new integer range index without gaps.
In case it's helpful, here is sample code to pull the dataframe in from Excel:
import pandas as pd
df = pd.read_excel('TestBook.xlsx', sheet_name='TestSheet', usecols='AD:AM')
Here's the input dataframe:
MAC RLC RLC 2 PDCCH Down PDCCH Uplink Unnamed: 34 Recommended Solutuin State Power State Recommended Solution
0 122.9822 7119.503 125.7017 1186.507 784.9464 NaN Downtitlt antenna serving cell is overshooting NaN NaN
1 4.1000 7119.503 24.0000 11.000 51.0000 NaN Downtitlt antenna serving cell is overshooting NaN NaN
2 121.8900 2127.740 101.3300 1621.000 822.0000 NaN uptilt antenna bad coverage NaN NaN
3 86.5800 2085.250 94.6400 1650.000 880.0000 NaN uptilt antenna bad coverage NaN NaN
4 64.7500 1873.540 63.8600 1259.000 841.0000 NaN uptilt antenna bad coverage NaN NaN
5 84.8700 1735.070 60.3800 1423.000 474.0000 NaN uptilt antenna bad coverage NaN NaN
6 49.3400 1276.190 59.9600 1372.000 450.0000 NaN uptilt antenna bad coverage NaN NaN
7 135.0200 2359.840 164.1300 1224.000 704.0000 NaN NaN NaN Bad Power Check hardware etc.
8 135.0200 2359.840 164.1300 1224.000 704.0000 NaN uptilt antenna bad coverage NaN NaN
9 163.7200 1893.940 90.0300 1244.000 753.0000 NaN NaN NaN Bad Power Check hardware etc.
10 163.7200 1893.940 90.0300 1244.000 753.0000 NaN uptilt antenna bad coverage NaN NaN
11 129.6400 1163.140 154.3200 663.000 798.0000 NaN NaN NaN Bad Power Check hardware etc.
12 129.6400 1163.140 154.3200 663.000 798.0000 NaN uptilt antenna bad coverage NaN NaN
Here is the sample output:
MAC RLC RLC 2 PDCCH Down PDCCH Uplink Unnamed: 34 Recommended Solutuin State
0 122.9822 7119.503 125.7017 1186.507 784.9464 NaN Downtitlt antenna serving cell is overshooting
1 4.1000 7119.503 24.0000 11.000 51.0000 NaN Downtitlt antenna serving cell is overshooting
2 121.8900 2127.740 101.3300 1621.000 822.0000 NaN uptilt antenna bad coverage
3 86.5800 2085.250 94.6400 1650.000 880.0000 NaN uptilt antenna bad coverage
4 64.7500 1873.540 63.8600 1259.000 841.0000 NaN uptilt antenna bad coverage
5 84.8700 1735.070 60.3800 1423.000 474.0000 NaN uptilt antenna bad coverage
6 49.3400 1276.190 59.9600 1372.000 450.0000 NaN uptilt antenna bad coverage
7 135.0200 2359.840 164.1300 1224.000 704.0000 NaN Check hardware etc. Bad Power
8 163.7200 1893.940 90.0300 1244.000 753.0000 NaN Check hardware etc. Bad Power
9 129.6400 1163.140 154.3200 663.000 798.0000 NaN Check hardware etc. Bad Power

You can use groupby to combine the rows by columns:
df = pd.DataFrame(data)
new_df = df.groupby(['MAC', 'RLC1', 'RLC2', 'POCCH', 'POCCH Up']).sum()
new_df.reset_index()

You can do something like:
fill_cols = ['Power State', 'Recommended Solution 2']
dup_cols = ['MAC_UL','RLC_Through_1','RLC_Through_2','PDCCH Down', 'PDCCH Up']
m = df.duplicated(subset=dup_cols, keep=False)
df_fill = df.loc[m,fill_cols]
df_fill[df_fill['Power State']==''] = np.NaN
df_fill[df_fill['Recommended Solution 2']==''] = np.NaN
df.loc[m,fill_cols]=df_fill.ffill()
Get duplicated rows using duplicated
Fill empty values with NaN
Then use ffill

Related

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example
This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

How to get Dataframe with Table ID in Pandas?

I want to extract dataframe from HTML using URL.
The page contains 59 table/dataframe.
I want to extract 1 particular table which can be identified by its ID "ctl00_Menu1"
Following is my trail which is giving error.
import pandas as pd
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S12",attrs = {'id': 'ctl00_Menu1'})
As this is my very early stage in python so can be simple solution but I am unable to find. appreciate help.
I would look at how the URL passes params and probably try to read a dataframe directly from it. I'm unsure if you are trying to develop a function or a script or just exercising.
If you do (notice the 58 at the end of the url)
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S1258",attrs = {'id':
'ctl00_Menu1'})
It works and gives you table 59.
[ 0 1 2 \
0 Partywise Partywise NaN
1 Partywise NaN NaN
2 Constituencywise-All Candidates NaN NaN
3 Constituencywise Trends NaN NaN
3 4 5 \
0 Constituencywise-All Candidates Constituencywise-All Candidates NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
6 7
0 Constituencywise Trends Constituencywise Trends
1 NaN NaN
2 NaN NaN
3 NaN NaN ]
Unsure if that's the table you want to extract, but most of the time it's easier to pass it as a url parameter. If you try it without the 58 it works too, I believe the 'ElectionResult' argument might not be a table classifier hence why you can't find any tables with that name.

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!
Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.
Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0
The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

Python pandas - value_counts not working properly

Based on this post on stack i tried the value counts function like this
df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))
and it works fine apart from the fact that although my data has 22 unique genres and after the split i get 42 values, which of course are not unique.
Data example:
Action Adventure Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing Accounting Action Adventure Animation & Modeling Audio Production Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing nan
0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
(i have pasted the head and the first row only)
I have a feeling that the problem is caused from my original data.Well, my column (genres) was a list of lists which contained brackets
example :[Action,Indie]
so when python was reading it, it would read [Action and Action and Action] as different values and the output was 303 different values.
So what i did is that:
for i in df1['genres'].tolist():
if str(i) != 'nan':
i = i[1:-1]
new.append(i)
else:
new.append('nan')
You have to remove first and last [] from column genres by function str.strip and then replace spaces by empty string by function str.replace
import pandas as pd
df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")
df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')
df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))
#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
print df
#remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
u'initial_price', u'is_free', u'metacritic', u'release_date',
u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
u'WebPublishing'],
dtype='object')

Replacing/Stripping certain text from data in pandas?

I've got an issue with Pandas not replacing certain bits of text correctly...
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").replace(" N/A", "Non")
But yet when i print it hasn't replaced any as seen below by running print csvdata[-50:].head(50)
Pole KI DE Score STAT CTemp
4429 NaN NaN NaN 42 NaN Data N/A
4430 NaN NaN NaN 23.43 NaN Data (AMI)
4431 NaN NaN NaN 7.05 NaN Data (AMI)
4432 NaN NaN NaN 9.78 NaN Data
4433 NaN NaN NaN 169.68 NaN Data (AMI)
4434 NaN NaN NaN 26.29 NaN Data N/A
4435 NaN NaN NaN 83.11 NaN Data N/A
NOTE: The CSV is rather big so I have to use pandas.set_option('display.max_columns', 250) to be able to print the above.
Anyone know how I can make it replace those parts correctly in pandas?
EDIT, I've tried .str.replace("", "") and tried just .replace("", "")
Example CSV:
No,CDPure,Blank
1,Data Test,
2,Test N/A,
3,Data N/A,
4,Test Data,
5,Bla,
5,Stack,
6,Over (AMI),
7,Flow (AMI),
8,Test (AMI),
9,Data,
10,Ryflex (AMI),
Example Code:
# Import pandas
import pandas
# Open csv (I have to keep it all as dtype object otherwise I can't do the rest of my script)
csvdata = pandas.read_csv('test.csv', dtype=object)
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").str.replace(" N/A", " Non")
# Print
print csvdata.head(11)
Output:
No CDPure Blank CTemp
0 1 Data Test NaN Data Test
1 2 Test N/A NaN Test Non
2 3 Data N/A NaN Data Non
3 4 Test Data NaN Test Data
4 5 Bla NaN Bla
5 5 Stack NaN Stack
6 6 Over (AMI) NaN Over (AMI)
7 7 Flow (AMI) NaN Flow (AMI)
8 8 Test (AMI) NaN Test (AMI)
9 9 Data NaN Data
10 10 Ryflex (AMI) NaN Ryflex (AMI)
str.replace interprets its argument as a regular expression, so you need to escape the parentheses using dcol.str.replace(r" \(AMI\)", "").str.replace(" N/A", "Non").
This does not appear to be adequately documented; the docs mention that split and replace "take regular expressions, too", but doesn't make it clear that they always interpret their argument as a regular expression.

Categories