In our daily work, we need to explore text files for different purposes, one of these purposes is for data validation or data quality to apply ETL date test in the staging area.
To count the number of rows that match with a pattern in a Text file locally
Normally, we load our files to a staging area in our data warehouse using several technics, but mostly we load the CDRs or text files using the serialization method respecting the type of Database engine,
sometimes, due to the provisioning process, the received files have bad records, this issue cannot figure out with your automated loading process, for that it is better to reconcile the files regularly by checking them manually or creating a scheduled task to provide you details of received files in a daily basis.
Note: the below example works with text files too, just remove gzip related
Below we describe how you can reconcile CDRs files using AWK commands in Hadoop or Locally.
- To count the number of rows that match with a pattern in a compressed file locally
awk '/TRAF:5/ {count++} END{print count}' <(gzip -dc 2002182301011052_TLG.gz)
awk '/TRAF:5/ {count++} END{print count}' 2002182301011052_TLG.txt
- #To count the number of rows that match with the pattern in multiple compressed files
awk '/TRAF:5/ {count++} END{print count}' <(gzip -dc *.gz)
- Using Awk with Hadoop by doing transformation by merging the last three columns and putting it in Hadoop again
hadoop fs -cat /user/oracle/jawad/ds.txt | awk -F"," '{ print $1","$2","$3$4$5 }' | hadoop fs -put - /user/oracle/jawad/newds.txt
- Count the number of rows that match with pattern in the file hosted on HDFS
hadoop fs -cat /user/oracle/jawad/basrtr2.2004221201011052_TLG.gz | gzip -dc | awk '/TRAF:5/ {count++} END{print count}' | hadoop fs -put - /user/oracle/jawad/chk.txt
- Count the number of the rows that match with the pattern in the files hosted on HDFS and put the result in a file after that download it locally
hadoop fs -cat /user/oracle/jawad/basrtr2.2004221201011052_TLG.gz | gzip -dc | awk '/TRAF:5/ {count++} END{print count}' | hadoop fs -get - /home/user/jawad
Comments
Post a Comment