I have data in a txt file and need to separate the data. Apologizes but i am really finding this hard (and maybe hard to explain aswell). the below is the top few lines of the txt file (there are 1000 lines). I need all the data between the first * in row 0 and the last * which is in row 700. I dont want to select by row number as the numbers can change but I want a code which will select the data between the *. Secondly the data is NOT separated into columns and it is one big row. I want a second piece of code which can separate the data into columns ie Latter REPORT, Calculation Date, Index Code are columns (I cant separate on space because it splits Calculation and Date into separate columns when they should be one column.) Please can someone help me and thank you!
0
0 *
1 #124 Latter REPORT D51D ...
2 # 1 Calculation Date calc_da...
3 # 2 Index Code modes2_in...
4 # 3 Index Name index_n...
120 #120 5 Years ADPS Growth Rate 5_years...
121 #121 1 Year ADPS Growth Rate 1_year_...
122 #122 Payout Ratio payout_...
123 #123 Reserved 26 reserve...
124 #124 Reserved 27 reserve...
125 *
Assuming the dataframe is called dat, for the first part to find the asterisks:
asterisk_location = dat[0] == '*'
asterisk_location = asterisk_location[asterisk_location]
start, finish = asterisk_location.index
dat = dat.iloc[start+1:finish]
This also assumes you want to get the region between the first two asterisks. If there's more, you'll have to adjust a bit.
Related
I am trying to select the highest value from this data but i also need the month it comes from too, here printing the whole row. Currently i'm using df.max() which just pulls the highest value. Does anyone know how to do this in pandas.
#current code
accidents["month"] = accidents.Date.apply(lambda s: int(s.split("/")[1]))
temp = accidents.groupby('month').size().rename('Accidents')
#selecting the highest value from the dataframe
temp.max()
answer given = 10937
answer i need should look like this (month and no of accidents): 11 10937
temp dataframe;
month
1 9371
2 8838
3 9427
4 8899
5 9758
6 9942
7 10325
8 9534
9 10222
10 10311
11 10937
12 9972
Name: Accidents, dtype: int64
would also be good to rename the accidents column to accidents is anyone can help too. Thanks
If the value is unique (in your case it is) you can simply get a subset of the dataframe.
temp[temp.iloc[:,1]==temp.iloc[:,1].max()]
So what the code is doing is looking at the integer position (rows then columns) and matching it with your condition, which is the max temp.
I am trying to make a word cloud from my dataframe, below
Borough Minor Text 2019
Bexley Arson 4
Bexley Burglary - Business 11
Bexley Burglary - Residential 130
Bexley Drug Trafficking 5
I want to visualise the most frequent items in the Minor Text column in a wordcloud but the problem is, the frequency is in the '2019' column as an integer. The actual dataframe is quite large but follows the same format as above. Can anyone suggest how I can transform my 'Minor Text' column so that I can accurate create a word cloud?
Thanks
I don't know the visualization criteria. For example PowerBI's wordcloud does not accept an integer value to control the size of the word, but it bases it on how many items it is repetead. Therefore, the way I deal with it is by transforming the text into a list and then multiplying it by the integer (hence repeating the text the number of times the integer says) and then PowerBI sees row 3 is repetead 130 times while row 4 only 5 times, making row 3's text 26 times larger than row 4's text.
Having explained this, this is the line of code I use:
df['Visual text'] = df['Minor Text'].map(lambda x: [x]) * df['2019']
To aggregate and to find values per second, I am doing following in Python using pandas, however, in the output logged to a file doesn't show columns in the way they appear here. Somehow these column names are sorted and hence TotalDMLsSec shows up before UpdateTotal and UpdatesSec.
'DeletesTotal': x['Delete'].sum(),
'DeletesSec': x['Delete'].sum()/VSeconds,
'SelectsTotal': x['Select'].sum(),
'SelectsSec': x['Select'].sum()/VSeconds,
'UpdateTotal': x['Update'].sum(),
'UpdatesSec': x['Update'].sum()/VSeconds,
'InsertsTotal': x['Insert'].sum(),
'InsertsSec': x['Insert'].sum()/VSeconds,
'TotalDMLsSec':(x['Delete'].sum()+x['Update'].sum()+x['Insert'].sum())/VSeconds
})
)
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
Apart from above questions, have couple of other questions-
Despite logging as csv format, all values/columns are appearing in one column in excel, is there anyway to properly load data CSV
Can rows be sorted based on one column(let say InsertsSec) by default when writing to csv file?
Any help here would be really appreciated.
Assume that your DataFrame is something like this:
Deletes Selects Updates Inserts
Name
Xxx 20 10 40 50
Yyy 12 32 24 11
Zzz 70 20 30 20
Then both total and total per sec can be computed as:
total = df.sum().rename('Total')
VSeconds = 5 # I assumed some value
tps = (total / VSeconds).rename('Total per sec')
Then you can add both above rows to the DataFrame:
df = df.append(totals).append(tps)
The downside is that all numbers are converted to float.
But in Pandat there is no other way, as each column must have
values of one type.
Then you can e.g. write it to a CSV file (with totals included).
This is how I eneded up doing
df.to_excel(vExcelFile,'All')
vSortedDF=df.sort_values(['Deletes%'],ascending=False)
vSortedDF.loc[vSortedDF['Deletes%']> 5, ['DeletesTotal','DeletesSec','Deletes%']].to_excel(vExcelFile,'Top Delete objects')
vExcelFile.save()
For CSV, instead of using separate \t used , and it worked just fine.
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
I have a pandas dataframe that looks like:
cleanText.head()
source word count
0 twain_ess 988
1 twain_ess works 139
2 twain_ess short 139
3 twain_ess complete 139
4 twain_ess would 98
5 twain_ess push 94
And a dictionary that contains the total word count for each source:
titles
{'orw_ess': 1729, 'orw_novel': 15534, 'twain_ess': 7680, 'twain_novel': 60004}
My goal is to normalize the word counts for each source by the total number of words in that source, i.e. turn them into a percentage. This seems like it should be trivial but python seems to make it very difficult (if anyone could explain the rules for inplace operations to me that would be great).
The caveat comes from needing to filter the entries in cleanText to just those from a single source, and then I attempt to inplace divide the counts for this subset by the value in the dictionary.
# Adjust total word counts and normalize
for key, value in titles.items():
# This corrects the total words for overcounting the '' entries
overcounted= cleanText[cleanText.iloc[:,0]== key].iloc[0,2]
titles[key]= titles[key]-overcounted
# This is where I divide by total words, however it does not save inplace, or at all for that matter
cleanText[cleanText.iloc[:,0]== key].iloc[:,2]= cleanText[cleanText.iloc[:,0]== key]['count']/titles[key]
If anyone could explain how to alter this division statement so that the output is actually saved in the original column that would be great.
Thanks
If I understand Correctly:
cleanText['count']/cleanText['source'].map(titles)
Which gives you:
0 0.128646
1 0.018099
2 0.018099
3 0.018099
4 0.012760
5 0.012240
dtype: float64
To re-assign these percentage values into your count column, use:
cleanText['count'] = cleanText['count']/cleanText['source'].map(titles)
I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.
Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})