Segment dataframe every time a value repeats in Python using pandas - python

I am trying generate strings or atleast a different dataframe from a dataframe that I have. The one that I have is:
Line MM/DD/YYhh:mm:ss.ms.us TEST
9 04/17/2013:44:18.215.500 S
20 04/17/2013:44:18.216.020 U
27 04/17/2013:44:18.216.544 P
34 04/17/2013:44:18.217.064 P
39 04/17/2013:44:18.217.584 L
48 04/17/2013:44:18.218.104 Y
55 04/17/2013:44:18.218.627 P
62 04/17/2013:44:18.219.147 R
69 04/17/2013:44:18.219.667 <CR>
76 04/17/2013:44:18.220.187 <LF>
179 04/17/2013:44:18.721.249 U
184 04/17/2013:44:18.721.769 L
193 04/17/2013:44:18.722.289 <CR>
200 04/17/2013:44:18.722.812 <LF>
304 04/17/2013:44:19.236.017 E
311 04/17/2013:44:19.236.537 R
318 04/17/2013:44:19.237.060 R
327 04/17/2013:44:19.237.580 <CR>
334 04/17/2013:44:19.238.100 <LF>
371 04/17/2013:44:19.649.033 M
376 04/17/2013:44:19.649.553 O
383 04/17/2013:44:19.650.073 D
390 04/17/2013:44:19.650.596 E
395 04/17/2013:44:19.651.116 ?
402 04/17/2013:44:19.651.636 <CR>
409 04/17/2013:44:19.652.156 <LF>
489 04/17/2013:44:20.160.040 T
496 04/17/2013:44:20.160.560 P
505 04/17/2013:44:20.161.084 <CR>
512 04/17/2013:44:20.161.604 <LF>
607 04/17/2013:44:20.642.301 P
614 04/17/2013:44:20.642.821 R
623 04/17/2013:44:20.643.345 <CR>
630 04/17/2013:44:20.643.865 <LF>
I am trying to format the above snippet to strings such that it looks like this
04/17/2013:44:18.220.187-SUPPLYPR<CR><LF>
04/17/2013:44:18.722.812-UL<CR><LF>
.
.
.
What it should do is that It should take the MM/DD/YY data where the value of TEST is and combine all the values in TEST upto each and make a string for each occurrence of . The raw data that I used to get till this Dataframe was different and was lot of work. But now I am kinda stuck on how to get this format. Any ideas/suggestions will be appreciated. Thanks :)

You are looking for groupby:
(df.groupby(df.TEST.shift().eq('<LF>').cumsum())
.agg({'MM/DD/YYhh:mm:ss.ms.us':'last',
'TEST':''.join})
.reset_index(drop=True)
)
Output:
MM/DD/YYhh:mm:ss.ms.us TEST
0 04/17/2013:44:18.220.187 SUPPLYPR<CR><LF>
1 04/17/2013:44:18.722.812 UL<CR><LF>
2 04/17/2013:44:19.238.100 ERR<CR><LF>
3 04/17/2013:44:19.652.156 MODE?<CR><LF>
4 04/17/2013:44:20.161.604 TP<CR><LF>
5 04/17/2013:44:20.643.865 PR<CR><LF>

Related

Finding Common Elements (Amazon SDE-1)

Given two lists V1 and V2 of sizes n and m respectively. Return the list of elements common to both the lists and return the list in sorted order. Duplicates may be there in the output list.
Link to the problem : LINK
Example:
Input:
5
3 4 2 2 4
4
3 2 2 7
Output:
2 2 3
Explanation:
The first list is {3 4 2 2 4}, and the second list is {3 2 2 7}.
The common elements in sorted order are {2 2 3}
Expected Time complexity : O(N)
My code:
class Solution:
def common_element(self,v1,v2):
dict1 = {}
ans = []
for num1 in v1:
dict1[num1] = 0
for num2 in v2:
if num2 in dict1:
ans.append(num2)
return sorted(ans)
Problem with my code:
So the accessing time in a dictionary is constant and hence my time complexity was reduced but one of the hidden test cases is failing and my logic is very simple and straight forward and everything seems to be on point. What's your take? Is the logic wrong or the question desc is missing some vital details?
New Approach
Now I am generating two hashmaps/dictionaries for the two arrays. If a num is present in another array, we check the min frequency and then appending that num into the ans that many times.
class Solution:
def common_element(self,arr1,arr2):
dict1 = {}
dict2 = {}
ans = []
for num1 in arr1:
dict1[num1] = 0
for num1 in arr1:
dict1[num1] += 1
for num2 in arr2:
dict2[num2] = 0
for num2 in arr2:
dict2[num2] += 1
for number in dict1:
if number in dict2:
minFreq = min(dict1[number],dict2[number])
for _ in range(minFreq):
ans.append(number)
return sorted(ans)
The code is outputting nothing for this test case
Input:
64920
83454 38720 96164 26694 34159 26694 51732 64378 41604 13682 82725 82237 41850 26501 29460 57055 10851 58745 22405 37332 68806 65956 24444 97310 72883 33190 88996 42918 56060 73526 33825 8241 37300 46719 45367 1116 79566 75831 14760 95648 49875 66341 39691 56110 83764 67379 83210 31115 10030 90456 33607 62065 41831 65110 34633 81943 45048 92837 54415 29171 63497 10714 37685 68717 58156 51743 64900 85997 24597 73904 10421 41880 41826 40845 31548 14259 11134 16392 58525 3128 85059 29188 13812.................
Its Correct output is:
4 6 9 14 17 19 21 26 28 32 33 42 45 54 61 64 67 72 77 86 93 108 113 115 115 124 129 133 135 137 138 141 142 144 148 151 154 160 167 173 174 192 193 195 198 202 205 209 215 219 220 221 231 231 233 235 236 238 239 241 245 246 246 247 254 255 257 262 277 283 286 290 294 298 305 305 307 309 311 312 316 319 321 323 325 325 326 329 329 335 338 340 341 350 353 355 358 364 367 369 378 385 387 391 401 404 405 406 406 410 413 416 417 421 434 435 443 449 452 455 456 459 460 460 466 467 469 473 482 496 503 .................
And Your Code's output is:
Please find the below solution
def sorted_common_elemen(v1, v2):
res = []
for elem in v2:
res.append(elem)
v1.pop(0)
return sorted(res)
Your code ignores the number of times a given element occurs in the list. I think this is a good way to fix that:
class Solution:
def common_element(self, l0, l1):
li = []
for i in l0:
if i in l1:
l1.remove(i)
li.append(i)
return sorted(li)

How to use certain rows of a dataframe in a formula

So I have multiple data frames and all need the same kind of formula applied to certain sets within this data frame. I got the locations of the sets inside the df, but I don't know how to access those sets.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #might used/need it later to check the output
df = pd.read_csv('Dalfsen.csv')
l = []
x = []
y = []
#the formula(trendline)
def rechtzetten(x,y):
a = (len(x)*sum(x*y)- sum(x)*sum(y))/(len(x)*sum(x**2)-sum(x)**2)
b = (sum(y)-a*sum(x))/len(x)
y1 = x*a+b
print(y1)
METING = df.ID.str.contains("<METING>") #locating the sets
indicatie = np.where(METING == False)[0] #and saving them somewhere
if n in df[n] != indicatie & n+1 != indicatie: #attempt to add parts of the set in l
append.l
elif n in df[n] != indicatie & n+1 == indicatie: #attempt defining the end of the set and using the formula for the set
append.l
rechtzetten(l.x, l.y)
else: #emptying the storage for the new set
l = []
indicatie has the following numbers:
0 12 13 26 27 40 41 53 54 66 67 80 81 94 95 108 109 121
122 137 138 149 150 162 163 177 178 190 191 204 205 217 218 229 230 242
243 255 256 268 269 291 292 312 313 340 341 373 374 401 402 410 411 420
421 430 431 449 450 468 469 487 488 504 505 521 522 538 539 558 559 575
576 590 591 604 605 619 620 633 634 647
Because my df looks like this:
ID,NUM,x,y,nap,abs,end
<PROFIEL>not used data
<METING>data</METING>
<METING>data</METING>
...
<METING>data</METING>
<METING>data</METING>
</PROFIEL>,,,,,,
<PROFIEL>not usde data
...
</PROFIEL>,,,,,,
tl;dr I'm trying to use a formula in each profile as shown above. I want to edit the data between 2 numbers of the list indicatie.
For example:
the fucntion rechtzetten(x,y) for the x and y df.x&df.y[1:11](Because [0]&[12] are in the list indicatie.) And then the same for [14:25] etc. etc.
What I try to avoid is typing the following hundreds of times manually:
x_#=df.x[1:11]
y_#=df.y[1:11]
rechtzetten(x_#,y_#)
I cant understand your question clearly, but if you want to replace a specific column of your pandas dataframe with a numpy array, you could simply assign it :
df['Column'] = numpy_array
Can you be more clear ?

Regex search for word in column using variable inside regex, return matched word and variable name

Basically I'm trying to get the regex matched word + the word from the loop on the same row of the DataFrame if that makes sense.
I've tried creating lists from column names and zip through but that didn't work. I'm also not sure how to use a variable name in the re.findall with regex, otherwise I'd try something like:
result = pd.DataFrame()
for word in file2:
x = re.findall(word.*, file1)
result = result.append(x, word)
I know the above doesn't work, and there is probably a few reasons why, I'd really love some explanation as to why the above mock code wouldn't work and how to make it work... Meanwhile, I came up with the below, which HALF works:
import pandas as pd
from pandas import read_csv
file1 = pd.DataFrame(read_csv('alldatacols.csv'))
file2 = pd.DataFrame(read_csv('trainColumnsT2.csv'))
new_df = pd.DataFrame()
for word in file2['Name']:
y = file1['0'].str.extractall(r'({}.*)'.format(word))
new_df = new_df.append(y, ignore_index=True)
new_df.reset_index(drop=True)
print(new_df)
print(new_df.columns)
result::::::::::
0 Id
1 MSSubClass
2 MSZoning_C (all)
3 MSZoning_FV
4 MSZoning_RH
5 MSZoning_RL
6 MSZoning_RM
7 LotFrontage
8 LotArea
9 Street_Grvl
10 Street_Pave
11 Alley_Grvl
12 Alley_Pave
13 LotShape_IR1
14 LotShape_IR2
15 LotShape_IR3
16 LotShape_Reg
17 LandContour_Bnk
18 LandContour_HLS
19 LandContour_Low
20 LandContour_Lvl
21 Utilities_AllPub
22 Utilities_NoSeWa
23 LotConfig_Corner
24 LotConfig_CulDSac
25 LotConfig_FR2
26 LotConfig_FR3
27 LotConfig_Inside
28 LandSlope_Gtl
29 LandSlope_Mod
.. ...
266 PoolArea
267 PoolQC_Ex
268 PoolQC_Fa
269 PoolQC_Gd
270 Fence_GdPrv
271 Fence_GdWo
272 Fence_MnPrv
273 Fence_MnWw
274 MiscFeature_Gar2
275 MiscFeature_Othr
276 MiscFeature_Shed
277 MiscFeature_TenC
278 MiscVal
279 MoSold
280 YrSold
281 SaleType_COD
282 SaleType_CWD
283 SaleType_Con
284 SaleType_ConLD
285 SaleType_ConLI
286 SaleType_ConLw
287 SaleType_New
288 SaleType_Oth
289 SaleType_WD
290 SaleCondition_Abnorml
291 SaleCondition_AdjLand
292 SaleCondition_Alloca
293 SaleCondition_Family
294 SaleCondition_Normal
295 SaleCondition_Partial
Example output I'm looking for:
17 LandContour_Bnk LandContour
18 LandContour_HLS LandContour
19 LandContour_Low LandContour
20 LandContour_Lvl LandContour
274 MiscFeature_Gar2 MiscFeature
275 MiscFeature_Othr MiscFeature
276 MiscFeature_Shed MiscFeature
277 MiscFeature_TenC MiscFeature
281 SaleType_COD SaleType
282 SaleType_CWD SaleType
283 SaleType_Con SaleType
284 SaleType_ConLD SaleType
285 SaleType_ConLI SaleType
286 SaleType_ConLw SaleType
287 SaleType_New SaleType
288 SaleType_Oth SaleType
289 SaleType_WD SaleType
290 SaleCondition_Abnorml SaleCondition
291 SaleCondition_AdjLand SaleCondition
292 SaleCondition_Alloca SaleCondition
293 SaleCondition_Family SaleCondition
294 SaleCondition_Normal SaleCondition
295 SaleCondition_Partial SaleCondition
Please help me understand how to get this over the hump. Thank yoU!
So I just made another DataFrame that got populated by me searching for just the word, without the .*. Then I concatenated both DataFrames and got what I needed... that was a simple solution I couldn't think of for the last 7 hours while I tried other approaches
file1 = pd.DataFrame(read_csv('alldatacols.csv'))
file2 = pd.DataFrame(read_csv('trainColumnsT2.csv'))
new_df = pd.DataFrame()
**df = pd.DataFrame()**
for word in file2['Name']:
y = file1['0'].str.extractall(r'({}.*)'.format(word))
**x = file1['0'].str.extractall(r'({})'.format(word))**
new_df = new_df.append(y, ignore_index=True)
**df = df.append(x, ignore_index=True)**
new_df.reset_index(drop=True)
**df.reset_index(drop=True)
result = pd.concat([new_df, df], axis=1)**
result.to_csv('result.csv')

How can I read a file having different column for each rows?

my data looks like this.
0 199 1028 251 1449 847 1483 1314 23 1066 604 398 225 552 1512 1598
1 1214 910 631 422 503 183 887 342 794 590 392 874 1223 314 276 1411
2 1199 700 1717 450 1043 540 552 101 359 219 64 781 953
10 1707 1019 463 827 675 874 470 943 667 237 1440 892 677 631 425
How can I read this file structure in python? I want to extract a specific column from rows. For example, If I want to extract value in the second row, second column, how can I do that? I've tried 'loadtxt' using data type string. But it requires string index slicing, so that I could not proceed because each column has different digits. Moreover, each row has a different number of columns. Can you guys help me?
Thanks in advance.
Use something like this to split it
split2=[]
split1=txt.split("\n")
for item in split1:
split2.append(item.split(" "))
I have stored given data in "data.txt". Try below code once.
res=[]
arr=[]
lines = open('data.txt').read().split("\n")
for i in lines:
arr=i.split(" ")
res.append(arr)
for i in range(len(res)):
for j in range(len(res[i])):
print(res[i][j],sep=' ', end=' ', flush=True)
print()

Sum of specific rows in a dataframe (Pandas)

I'm given a set of the following data:
week A B C D E
1 243 857 393 621 194
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
5 712 734 308 385 303
I’m asked to find the sum of each column for specified rows/a specified number of weeks, and then plot those numbers onto a bar chart to compare A-E.
Assuming I have the rows I need (e.g. df.iloc[2:4,:]), what should I do next? My assumption is that I need to create a mask with a single row that includes the sum of each column, but I'm not sure how I go about doing that.
I know how to do the final step (i.e. .plot(kind='bar'), I just need to know what the middle step is to obtain the sums I need.
You can use for select by positions iloc, sum and Series.plot.bar:
df.iloc[2:4].sum().plot.bar()
Or if want select by names of index (here weeks) use loc:
df.loc[2:4].sum().plot.bar()
Difference is iloc exclude last position:
print (df.loc[2:4])
A B C D E
week
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
print (df.iloc[2:4])
A B C D E
week
3 946 252 453 547 436
4 560 100 864 663 949
And if need also filter columns by positions:
df.iloc[2:4, :4].sum().plot.bar()
And by names (weeks):
df.loc[2:4, list('ABCD')].sum().plot.bar()
All you need to do is call .sum() on your subset of the data:
df.iloc[2:4,:].sum()
Returns:
week 7
A 1506
B 352
C 1317
D 1210
E 1385
dtype: int64
Furthermore, for plotting, I think you can probably get rid of the week column (as the sum of week numbers is unlikely to mean anything):
df.iloc[2:4,1:].sum().plot(kind='bar')
# or
df[list('ABCDE')].iloc[2:4].sum().plot(kind='bar')

Categories