Whatsapp is more than a need now. From being a casual chat application used between friends and family members to a more reliable application for even formal and serious communication, Whatsapp has grown wild. There is a lot of information and data in between the conversation stored on your mobile all the time. And the fact that Whatsapp never store this information on there servers your mobile has the only last backup copy of what you have communicated in the past. I usually use to email the conversation to my own email address just as a backup, but then that backup was just the text file where searching something was a pain, and what if I want to know about the conversation on a particular date. Hence I wrote this parser to see if it benefits me in someway or the other.
So I followed the regular steps to write a parser using PHP.
- Read the file
- Parse the single complete message sent in the conversation
- Extract information from that single message (eg: name, date & time, message)
- Once you have the extracted information you can store it wherever you want (In this script I am using SQLite 3)
Reading the file & parsing the information
I used fgets to read the content of the file line by line. But then there were messages which were spread over more than a line. So I first found out the pattern in which the message starts and I keep on reading the file line by line unless I match the pattern where I am sure that my previous message is complete and I can store it and then move on to the newly matched message.
The pattern to match the start of the message:
((\[A-Za-z\]{1,3}\\s\[0-9\]{1,2},\\s\[0-9\]{1,2}\[:\]{1}\[0-9\]{1,2}\\s\[APM\]{2})|(\[A-Za-z\]{1,3}\\s\[0-9\]{1,2},\\s\[0-9\]{1,4},\\s\[0-9\]{1,2}\[:\]{1}\[0-9\]{1,2}\\s\[APM\]{2}))\\s-\\s(.\*):\\s(\[\\s\\S\]\*)
I am not an expert in Regex but this pattern is what I was able to build which works fine AFAIK.
And the script which continuously reads line by line before concluding that the message is complete is:
$handle = @fopen('path-of the-file-to-parse', "r");
if ($handle) {
$tmpBuffer = "";
$line = 0;
while (($buffer = fgets($handle)) !== false) {
$matches = array();
$regEx = '/' . $RegEx . '/';
preg_match($regEx, $buffer, $matches);
if (count($matches) == 6 && $line != 0) {
// here we know that the new line has been matched and hence we can store the message we were reading till now before this line
$tmpBuffer = $buffer;
} else {
//here we know that its not the start of the message that means it is the part of the old message and we append it to the message
$tmpBuffer .= $buffer;
}
$line++;
}
// store the very last message of the conversation
}
Extracting the information from the message
The same regex does the job again, we extract the datetime, person name & the message using the same regex. But there are 2 kinds of time stamp, for current year message the year is not mentioned and for the past year it is. But the regex takes care of that
function parseAndStore($msg)
{
$arrRecord = array();
$matches = array();
$regEx = '/' . $Regex . '/i';
preg_match($regEx, $msg, $matches);
if (count($matches) == 6) {
$arrRecord["datetime"] = isset($matches[2])?$matches[2]:$matches[3];
$arrRecord["author"] = $matches[4];
$arrRecord["message"] = $matches[5];
}
if (count($arrRecord) > 0) {
// store the message
}
return false;
}
Thats it, you can get the complete script on Github – Whatsapp Email Conversation Parser
Let me know if you have any suggestions or doubts.