Data Explorer: Explore files using AWK in Local and Hadoop

In our daily work, we need to explore text files for different purposes, one of these purposes is for data validation or data quality to apply ETL date test in the staging area.


Normally, we load our files to a staging area in our data warehouse using several technics, but mostly we load the CDRs or text files using the serialization method respecting the type of Database engine,

sometimes, due to the provisioning process, the received files have bad records, this issue cannot figure out with your automated loading process, for that it is better to reconcile the files regularly by checking them manually or creating a scheduled task to provide you details of received files in a daily basis.

Note: the below example works with text files too, just remove gzip related

Below we describe how you can reconcile CDRs files using AWK commands in Hadoop or Locally.


  • To count the number of rows that match with a pattern in a compressed file locally
awk '/TRAF:5/ {count++} END{print count}' <(gzip -dc 2002182301011052_TLG.gz)

  • To count the number of rows that match with a pattern in a Text file locally

  • awk '/TRAF:5/ {count++} END{print count}' 2002182301011052_TLG.txt
    • #To count the number of rows that match with the pattern in multiple compressed  files
      awk '/TRAF:5/ {count++} END{print count}' <(gzip -dc *.gz)

    • Using Awk with Hadoop by doing transformation by merging the last three columns and putting it in Hadoop again
    hadoop fs -cat /user/oracle/jawad/ds.txt | awk -F"," '{ print $1","$2","$3$4$5 }' | hadoop fs -put - /user/oracle/jawad/newds.txt


    • Count the number of rows that match with pattern in the file hosted on HDFS
    hadoop fs -cat /user/oracle/jawad/basrtr2.2004221201011052_TLG.gz | gzip -dc | awk '/TRAF:5/ {count++} END{print count}' | hadoop fs -put - /user/oracle/jawad/chk.txt



    • Count the number of the rows that match with the pattern in the files hosted on HDFS and put the result in a file after that download it locally
    hadoop fs -cat /user/oracle/jawad/basrtr2.2004221201011052_TLG.gz | gzip -dc | awk '/TRAF:5/ {count++} END{print count}' | hadoop fs -get - /home/user/jawad

    Comments