Column is not appended to pandas DataFrame - python

I want to add subtotals to my dataframe: groupby some index level then append new dataframe to the main one. For some unknown reason temp["Узел"] = "Итого" does nothing. It doesn't add new column, though temp["Узел2"] = "Итого" adds one. I think, it's because my 'pvt' dataframe already has "Узел" index level, but what impact does it have on a new 'temp' dataframe?
temp = pvt.groupby(level=["Принадл"]).sum()
temp["Узел"] = "Итого"
print(temp)
print(temp["Узел"])
Россия \
ОКТ КЛГ МСК ГОР СЕВ СКВ ЮВС ПРВ КБШ СВР
Принадл
ИП 783 14 172 398 248 1178 460 235 314 644
ПС 93900 5049 89815 36197 85619 55213 91681 26764 33869 154280
... \
СНГ и др. Итого
Принадл
ИП 46 9342
ПС 51529 1299784
[2 rows x 21 columns]
Empty DataFrame
Columns: []
Index: [ИП, ПС]
pandas 0.16.1, numpy 1.9.2
UPD: that's because of manually added multiindex level "Узел" or multilevel columns... or both. I'm not sure yet.
UPD2: i've been able to avoid the problem when i temporary switched to one-level column names before adding new columns
columns, temp.columns = temp.columns, [None]*len(temp.columns)
temp[...] = ...
<...>
temp.columns = columns

Related

pandas.core.frame.DataFrame rename index problems

In an existing table I got some summary by
df.groupby('bin_fare')['fare'].agg(['count', 'sum', 'mean'])
The result is table above. bin_fare name of Indexes
bin_fare count sum mean
1 491 3717.1413 7.570553
2 474 9000.3078 18.987991
3 259 14565.0003 14565.0003
4 84 16268.0375 16268.0375
I tried to rename indexes by adding this code
fare_rate_names = ['cheapest','avarage','above average','expensive']
df.groupby('bin_fare')['fare'].agg(['count','sum','mean']).rename(index=pd.Series(data=fare_rate_names))
But it renames only first 3 row !
bin_fare count sum mean
avarage 491 3717.1413 7.570553
above average 474 9000.3078 18.987991
expensive 259 14565.0003 14565.0003
4 84 16268.0375 16268.0375
How to fix it? Not adding element at the beginning of fare_rate_names
You can just set a new index:
df.index = pd.Series(fare_rate_names)
Or, the more pythonic ("pandastic"?):
df.set_index(pd.Series(fare_rate_names), inplace=True)
Also, you could create a dummy name for the 0th index:
fare_rate_names = ['foo', 'cheapest','avarage','above average','expensive']
df.groupby('bin_fare')['fare'].agg(['count','sum','mean']).rename(index=pd.Series(data=fare_rate_names))

How can I fetch only rows from left dataset but also a certain column from the right dataset, during inner join

I am trying to implement a logic where i have two dataframes. Say A (left) and B (right).
I need to find matching rows of A in B (i understand this can be done via a "inner" join). But my use case says i only need all rows from dataframe A but also a column from the matched record in B, the ID column (i understand this can be one via select). But here the problem arises, as after inner join the returned dataframe has rows from dataframe B also, which i dont need, but if i take leftsemi, then i wont be able to fetch the ID column from B dataframe.
For example:
def fetch_match_condition():
return ["EMAIL","PHONE1"]
match_condition = fetch_match_condition()
from pyspark.sql.functions import sha2, concat_ws
schema_col = ["FNAME","LNAME","EMAIL","PHONE1","ADD1"]
schema_col_master = schema_col.copy()
schema_col_master.append("ID")
data_member = [["ALAN","AARDVARK","lbrennan.kei#malchikzer.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","8653827956","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
master_df = spark.createDataFrame(data_member,schema_col)
master_df = master_df.withColumn("ID", sha2(concat_ws("||", *master_df.columns), 256))
test_member = [["ALAN","AARDVARK","lbren.kei#malchik.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","99997492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","87463829","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
test_member_1 = [["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"]]
test_df = spark.createDataFrame(test_member,schema_col)
test_df_1 = spark.createDataFrame(test_member_1,schema_col)
matched_df = test_df.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
master_df = master_df.union(matched_df.select(schema_col_master))
# Here only second last record will match and get added back to the master along with the same ID as it was in master_df
matched_df = test_df_1.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
# Here the problem arises since i need the match to be only 2, since the test_df_1 has only two records but the match will be 3 since master_df(david record) will also be selected.
PS : (Is there a way i can take leftsemi and then use withColumn and a UDF which fetches me this ID based on the row values of leftsemi dataframe from B dataframe ?)
Can anyone propose a solution for this ?

splitting html tag data to multiple coluumn

Actually I am stuck at a issue, my data is in the format given in below image,
Splitting data to multiple column
Is there any way in python dataframe to segregate this data to multiple column example,
Data in required format
Can anyone help me out.
I have tried to split it but it does not work,
df3=df.technische_daten.str.split('\s+(?=\<\/[a-z]+\>)', expand=True)
df3[0]=df3[0].str.replace(r'<li>', '', regex=False)
df3[1]=df3[1].str.replace(r'</li> ','',regex=True)
Data Snippet:
<ul><li>Höhe: 248 mm</li><li>Länge: 297 mm</li><li>Breite: 246 mm</li><li>Gewicht: 4,0 kg</li><li>Leerlaufdrehzahl: 5500 U/min</li><li>Sägeblattdurchmesser: 190 mm</li><li>Leistungsaufnahme: 1400 Watt</li><li>Standard: 821552-6,B-02939,195837-9,164095-8</li><li>Bohrung: 30 mm</li><li>Schnittleistung 45°: 48,5 mm</li><li>Vibration Sägen Holz: 2,5 m/s²</li><li>Schnittleistung 0°: 67 mm</li><li>Sägeblatt-Ø / Bohrung: 190/30 mm</li><li>Max. Schnitttiefe 90°: 67 mm</li><li>Schnittleistung 0°/45°: 67/48,5 mm</li></ul>
Helpfully it's a built in function of pandas, which will make a HTML table for you.
import pandas as pd
df_rows = []
# put the below in a for loop to get all of your rows
# rows = all_your_data
# for row in rows:
# remove this line and use the above for loop
row = "<ul><li>Höhe: 248 mm</li><li>Länge: 297 mm</li><li>Breite: 246 mm</li><li>Gewicht: 4,0 kg</li><li>Leerlaufdrehzahl: 5500 U/min</li><li>Sägeblattdurchmesser: 190 mm</li><li>Leistungsaufnahme: 1400 Watt</li><li>Standard: 821552-6,B-02939,195837-9,164095-8</li><li>Bohrung: 30 mm</li><li>Schnittleistung 45°: 48,5 mm</li><li>Vibration Sägen Holz: 2,5 m/s²</li><li>Schnittleistung 0°: 67 mm</li><li>Sägeblatt-Ø / Bohrung: 190/30 mm</li><li>Max. Schnitttiefe 90°: 67 mm</li><li>Schnittleistung 0°/45°: 67/48,5 mm</li></ul>"
values = row.split("</li><li>")
# clean the data
values[0] = values[0].replace("<ul><li>", "")
values[-1] = values[-1].replace("</li></ul>", "")
dict_of_values = {}
for value in values:
dict_of_values[value.split(": ")[0]] = value.split(": ")[1]
df_rows.append(dict_of_values)
# outside of for loop
df = pd.DataFrame.from_dict(df_rows, orient='columns')
# use df.drop to remove any columns you do not need
df = df.drop(['Leerlaufdrehzahl', 'Sägeblattdurchmesser'], axis=1)
your_html = df.to_html()
Hopefully this helps.

Identify all instances in dataframe based on the specific column value using loop

I have the following dataframe..
teamId matchId matchPeriod eventSec eventId eventName
190 8516 5237840 1H 721.2 5 Interruption
191 8516 5237840 1H 723.4 3 Free Kick
192 8516 5237840 1H 725.7 8 Pass
193 8516 5237840 1H 727.2 8 Pass
194 8516 5237840 1H 728.5 10 Shot
This goes on for around 1000 rows
I would like to identify all the instances of 'Shot' and then slice out that row AND the previous 4 rows and create a sequence so I can work with the data
Can anyone help please?
Try this code:
dta # your dataframe
index = dta[dta['eventName'] == 'Shot'].index
result = []
for i in range(5):
result = result + list(index - i)
result = set(result)
sub = dta[dta.index.isin(result)]
First it select the index of the rows with Value 'Shot' as their column 'eventName'. Then we create a set and iterative operations to get 4 rows previous to the selected rows.
In the end, we are selecting the rows that we have collected the index.
Seems like you want to slice the previous four rows where "Shot" appear. You can use the index value to find where "Shot" appears and then slice the DataFrame based on the index value.
Add Data to dataframe:
import pandas as pd
from tabulate import tabulate
dict = {
"teamid": [190,191,192,108,190,190,191,192,108,190,190,191,192,108,190,190,191,192,108,190],
"eventId": [5,2,4,5,6,5,2,4,5,6,5,2,4,5,6,5,2,4,5,6],
"eventname": ['hello','Free Kick','Pass','Pass','Shot','Interruption','Free Kick','Pass','Pass','Shot','Interruption','Free Kick','Pass','Pass','Shot','Interruption','Free Kick','Pass','Pass','Shot']
}
df=pd.DataFrame(data=dict)
print(tabulate(df, headers = 'keys', tablefmt = 'psql'))
Then slice the data and perform your task.
# Search for index values where "Shot" appear.
index_values = df[df['eventname'] == 'Shot'].index
# Add -1 at 0 index in index_value list
index_values = index_values.insert(0,-1)
#Slide the data. Over here you can perform your task on last four rows
for i in range(0,len(index_values)-1):
# perform your task here
print(tabulate(df[index_values[i]+1:index_values[i+1]], headers='keys', tablefmt='psql'))

Apply operation on columns of CSV file excluding headers and update results in last row

I have a CSV file created like this:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
Now I want the fourth row to get appended to the existing CSV file as followings:
First column: Remains same: 1213
Second column: Get max value: 898
Third column: Get min value: 009
Fourth column: Get avg value: 422.6
So the final CSV file should be:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.6
Please help me to achieve the same. It's not mandatory to use Pandas.
Thanks in advance!
df.agg(...) accepts a dict where the dict keys are the names of the columns and the values are strings that perform an aggregation that you want:
df_agg = df.agg({'keep_same': 'mode', 'get_max': 'max',
'get_min': 'min', 'get_avg': 'mean'})[df.columns]
Produces:
keep_same get_max get_min get_avg
0 1213 898 9 422.666667
Then you just append df_agg to df:
df = df.append(df_agg, ignore_index=False)
Result:
keep_same get_max get_min get_avg
0 1213 176 901 517.000000
1 1213 198 9 219.000000
2 1213 898 201 532.000000
0 1213 898 9 422.666667
Notice that the index of the appended row is 0. You can pass ignore_index=True to append if you desire.
Also note that if you plan to do this append operation a lot, it will be very slow. Other approaches do exist in that case but for once-off or just a few times, append is OK.
assuming you do not care about the index you can use loc[-1] to add the row:
df = pd.read_csv('file.csv', sep=';', dtype={'get_min':'object'}) # read csv set dtype to object for leading 0 col
row = [df['keep_same'].values[0], df['get_max'].max(), df['get_min'].min(), df['get_avg'].mean()] # create new row
df.loc[-1] = row # add row to a new line
df['get_avg'] = df['get_avg'].round(1) # round to 1
df['get_avg'] = df['get_avg'].apply(lambda x: '%g'%(x)) # strip .0 from the other records
df.to_csv('file1.csv', index=False, sep=';') # to csv file
out:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.7

Categories