Join 2 text file data based on a column value - python

How to implement a joining of 2 text files using python and output a third file but only adding values present in one file that have corresponding matching value in second file?
Input File1.txt:
GameScore|KeyNumber|Order
85|2568909|2|
84|2672828|1|
80|2689999|5|
65|2123232|3|
Input File2.txt:
KeyName|RecordNumber
Andy|2568909|
John|2672828|
Andy|2672828|
Megan|1000021|
Required Output File3.txt:
KeyName|KeyNumber|GameScore|Order
Andy|2672828|84|1|
Andy|2568909|85|2|
John|2672828|84|1|
Megan|1000021||
Look for a key name and record number in File 2 and match it with KeyNumber in file 1 and copy the corresponding game score and order values.
The files have anywhere from 1 to 500000 records so need to be able to run for a large set.
Edit: I do not have access to any libraries like pandas and not allowed to install any libraries.
Essentially need to run a cmd that will trigger a program that does the reads 2 files, compares and generates the third file.

You can use pandas to do this:
import pandas as pd
df1 = pd.read_csv('Input File1.txt', sep='|')
df2 = pd.read_csv('Input File2.txt', sep='|', header=0, names=['KeyName', 'KeyNumber'])
df3 = df1.merge(df2, on='KeyNumber', how='right')
See the documentation for fine-tuning.

Related

Append to csv file column-wise under one header

Coding in Python 2.7. I have a csv file already present called input.csv. In that file we have 3 headers — Filename, Category, Version — under which certain values already exist.
I want to know how I can reopen the csv file and input only one value multiple times under the "Version" column such that whatever was written under "Version" gets overwritten by the new input.
So suppose under the "Version" column I had 3 inputs in 3 rows:
VERSION
55
66
88
It gets rewritten by my new input 10 so it will look like:
VERSION
10
10
10
I know normally we input csv row-wise but this time around I just want to input column wise under that specific header "Version".
Solution 1:
With pandas, you can use:
import pandas as pd
df = pd.read_csv(file)
df['VERSION'] = 10
df.to_csv(file, index=False)
Solution 2: If there are multiple rows (and you only want first 3), then you can use:
df.loc[df.index < 3, ['VERSION']] = 10
instead of:
df['VERSION'] = 10

Is it possible to modify output data file names in pySpark?

Simplified case.
Given that I have 5 input files in directory data_directory:
data_2020-01-01.txt,
data_2020-01-02.txt,
data_2020-01-03.txt,
data_2020-01-04.txt,
data_2020-01-05.txt
I read them all to pySpark RDD and perform some operation on them that doesn't do any shuffling.
spark = SparkSession.builder.appName("Clean Data").getOrCreate()
sparkContext = spark.sparkContext
input_rdd = sparkContext.textFile("data_directory/*")
result = input_rdd.mapPartitions(lambda x: remove_corrupted_rows(x))
Now I want to save data:
result.saveAsTextFile(
"results",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec",
)
And I get 5 files where each contains name "part". So I've lost information about from which input file does the output file come from:
._SUCCESS.crc
.part-00000.gz.crc
.part-00001.gz.crc
.part-00002.gz.crc
.part-00003.gz.crc
.part-00004.gz.crc
_SUCCESS
part-00000.gz
part-00001.gz
part-00002.gz
part-00003.gz
part-00004.gz
Is there anyway to keep the input file names or introduce my own naming pattern in this case?
Expected desired result:
._SUCCESS.crc
.data_2020-01-01.gz.crc
.data_2020-01-02.gz.crc
.data_2020-01-03.gz.crc
.data_2020-01-04.gz.crc
.data_2020-01-05.crc
_SUCCESS
data_2020-01-01.gz
data_2020-01-02.gz
data_2020-01-03.gz
data_2020-01-04.gz
data_2020-01-05.gz
You could use pyspark.sql.functions.input_file_name() (docs here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name) and then partition your dataframe by the column created.
This way, 5 input files should give you a categorical column with 5 different values and partitioning on it should split your output into 5 parts.
Alternatively, if you wish to have a full naming pattern, then functionally split the dataframe on the input_file_name() column (here into 5 dataframes), repartition (e.g. to 1 using coalesce(1)) and then save with custom logic (e.g. a dict mapping or by extracting the filename from the column and parsing to DataFrameWriter.csv() as name).
N.B.: When changing to 1 partition, be sure that the data fits all into your memory!

Python [Pandas/docx]: Merging two rows based on common name

I am trying to write a script using docx-python and pandas in Python3 which perform following action:
Take input from csv file
Merge common value of Column C and add each value into docx
Export docx
My raw csv is as below:
SN. Name Instance Severity
1 Name1 file1:line1 CRITICAL
2 Name1 file2:line3 CRITICAL
3 Name2 file1:line1 Low
4 Name2 file1:line3 Low
and so on...
and i want my docx outpur as:
`
[1]: https://i.stack.imgur.com/1xNc0.png
I am not able to figure it out how can i filter "Instances" based on "Name" using pandas and later print then into docx.
Thanks in advance.
Below code will select the relevant columns,group by based on 'Name' and 'Severity' and add Instances together
df2 = df[["Name","Instance","Severity"]]
df2["Instance"] = df2.groupby(['Name','Severity'])['Instance'].transform(lambda x: '\n'.join(x))
Finally, remove the duplicates and transform to get the desired output
df2 = df2.drop_duplicates()
df2 = df2.T

Finding all files associated with an id within a folder of images?

I'm trying to populate a dataframe based on a class label and images in a folder.
I have a folder have over 10,000 images with the following name structure: ['leaflet_10000_1.jpg', 'leaflet_10000_2.jpg', 'leaflet_10001_1.jpg', 'leaflet_10001_2.jpg', 'leaflet_10002_1.jpg', 'leaflet_10002_2.jpg', 'leaflet_10003_1.jpg', 'leaflet_10003_2.jpg'
And an accompanying csv file of the structure:
ID,Location,Party,Representative/Candidate,Date
1000,Glasgow North,Liberal Democrats,,02-Apr-10
1001,Erith and Thamesmead,Labour Party,,02-Apr-10
I want to create a new csv file which has the paths of all the images for a said Party. I can separate a certain party from the full csv file using the commands:
df_ = df.loc[df["Party"] == "Labour Party"]
This will give me the party I am interested in, but how do I create a FULL list of all images associated with it.. from the image list shared above, it can be seen that ID 1001 has 2 images associated with it.. this is not a fixed number, some ID's have 3 to 5 images associated with them.
How do I get this new dataframe populated with all the required paths?
My thought process is to apply str.split(name, '_') on each file name and then search every ID against all the results but where to go from there?
You're on the right track!
If all IDs are unique and you want an output dataframe with just the party and image number, you can do something like:
from pathlib import Path
import numpy as np
import pandas as pd
partySer = df.loc[:, ['ID', 'Party']].set_index('ID')
# Get image names
imgFiles = list(Path('./<your-image-path>/').glob('*.jpg'))
imgFiles_str = np.array([str(f) for f in imgFiles])
# Grab just the integer ID from each image name
imgIds = np.array([int(f.stem.split('_')[1]) for f in imgFiles])
# Build dataframe with matching ids
outLst = []
for curId, party in partySer.iterrows():
matchingImgIdxs = imgIds == curId
matchingImgs = imgFiles_str[matchingImgIdxs]
outLst.append({'Party': party, 'images': matchingImgs})
outDf = pd.DataFrame(outLst)
I haven't tested this code, but it should lead you on the right track.
Lets create a dataframe of your images and extract the id.
from pathlib import Path
img_df = pd.DataFrame({'img' : [i.stem for i Path(your_images).glob('*.jpg')]})
img_df['ID'] = img_df['imgs'].astype(str).str.split('_',expand=True)[1].astype(int)
img_dfg = img_df.groupby('ID',as_index=False).agg(list)
ID imgs
0 10000 [leaflet_10000_1.jpg, leaflet_10000_2.jpg]
1 10001 [leaflet_10001_1.jpg, leaflet_10001_2.jpg]
2 10002 [leaflet_10002_1.jpg, leaflet_10002_2.jpg]
3 10003 [leaflet_10003_1.jpg, leaflet_10003_2.jpg]
then we just need to merge the ID columns.
df_merged = pd.merge(df,img_dfg,on='ID',how='left')
you can then do any further operations to group or list your images.
What do you want in your DataFrame ? You said here that you wanted to populate your df with the required paths ? If so, then using the str.split(name, '_') would allow you to get the following information for every file : its ID, and its number.
You can now insert elements in your dataframe using both of these characteristics, adding any other characteristic obtained from the relative .csv file that you described. In the end, filtering the dataframe to get all elements that correspond to a given criteria should give you what you are looking for.
You seem to think that one ID will mean one line inside the dataframe, but its incorrect as each line is described by a (ID, number) in your case, and thus, your function would already give you the full list of all images associated with the party/ID/other characteristic.
If you want to reduce the size of your dataframe, since all images related to the same ID only have one characteristic that differ, you could also have a "Files" column, which contain a list of all images related to this ID (and thus, drop the "number" column), or just the number associated with them as their path is composed of the main path, followed by "_number.jpg". This solution would be a lot more efficient

Delete Rows prior to specific String in Pandas

I have excel files in following format:
Sensor 1 meta
Sensor 2 meta
"Summary of Observation"
Sensor 1
Sensor 2
The number of rows before and after "Summary of Observation" is not fixed (i.e one file may have only sensor 1,2 while other may have 1,2,3....)
In dataframe, I only want information after "Summary of Observation")
Right now, I open the excel file, note the row from which I want information and parse it in
df = pd.read_excel("1.xlsx",skiprows = %put some value here%)
Is there a way to automate this, i.e. I don't want to open excel. Rather only import relevant rows (or delete them after importing).
After importing the file you can find index and select a data from that point.
# I used column name as `text` you can replace it with yours
idx = df[df['text']=='Summary of Observation'].index[0]
df = df[idx+1:]
print(df)
Output:
text
3 Sensor 1
4 Sensor 2
Or if you want to include Summary of Observation just use idx in place of idx+1
you can open the excel and use df.loc[df[0]=="Summary of Observation"].index[0] to get the index
Working code at https://github.com/gklc811/Python3.6/blob/master/stackoverflowsamples/excel.ipynb

Categories