Using AI for extracting Usernames, Emails, Phone Numbers, and Personal Names from large datasets
Extracting relevant information from large blobs of text, such as text files, PDFs, Excel spreadsheets, JSON files, can be a time-consuming and frustrating task. Everyone in OSINT will most likely acknowledge this since processing and exploiting the collected data in an OSINT investigation is one of most time consuming tasks on average.
Within Open-Source Intelligence (OSINT) we very often collect large amounts of data in the form of text. During investigations there often is a need to extract valuable information from the text that can be used to pivot or help answer research questions.
Not every OSINT investigator feels comfortable extracting usernames, emails, phone numbers and personal names from data. They may have collected the data in various formats which make them think they need specialised software or specific knowledge to extract these key data points.
One could think of crafting Python or Bash scripts alongside with Regular Expression to iterate over your collected data and extract what you need. And yes this is a way to achieve these goals.
However, with the help of AI-powered tools like ChatGPT and Google Bard, you can create prompts that can efficiently extract usernames, emails, phone numbers, and personal names. So you do not need any Python programming skills or extensive knowledge about Regular Expressions. In this blog, I will walk you through the step-by-step example process of creating such prompts, along with practical examples.
Keep in mind that this is a non-exhaustive list of options and examples that you will find below. It is meant to encourage you to start thinking of how to create similar prompts or refine the ones discussed in this blog.
Understand the Data Structure
Before creating prompts, it's important to understand the structure and format of the data you will be working with.
Different file types may require different approaches for extraction.
For instance, a PDF may have its text embedded, while an Excel file may contain multiple sheets.
Also sometimes we find data in XML, JSON and many other formats. Familiarise yourself with the data structure to plan your prompts accordingly.
You can even use OSINT techniques to find information that will help your familiarise with specific data structures or files.
Identify Patterns and Regular Expressions
To extract specific information, you need to identify patterns that are unique to the data you are looking for. For example, an email address typically follows the format "email@example.com."
Spend some time analysing the data and identifying common patterns for usernames, email addresses, phone numbers, and personal names. Also spend some time identifying common and maybe even uncommon enumeration variants of the data you're trying to find and extract.
Sometimes privacy aware people and suspects take great efforts in trying to obfuscate their Personal Identifiers such as phone numbers, aliases and emails.
Create Prompts for ChatGPT or Google Bard or equivalents
Now that you have a better understanding of the data structures in your collected data and identified the patterns, it's time to create prompts. Both ChatGPT and Google Bard can be utilised to recognize and extract information using specific prompts.
As long as your seed data (The large text files I discussed earlier) are obtained through open-sources there should be no issues sharing that data with these AI tools. Keep in mind that if you intend data coming from a closed source you are now sharing this data with a third party.
Let's take a look at some basic practical examples:
Prompt: "Extract all usernames from the given text in the next prompt."
Extracting Email Addresses:
Prompt: "Find all email addresses mentioned in the text in the next prompt."
Extracting Phone Numbers:
Prompt: "Identify and extract all phone numbers from the provided text in the next prompt."
Extracting Personal Names:
Prompt: "Extract personal names from the given text in the next prompt."
All of the above prompt examples will execute and will extract a lot of the data you want to extract. But these prompts might not be specific or precise enough. The key to success with prompting AI is that you have to be very precise and specific in what should be done and what should be executed and taken into account.
Thinking about Regular Expressions to create more advanced prompts
It's important to understand the challenge at hand. People often employ various techniques to obfuscate their personal information in text files.
To enhance the accuracy of extraction I've found that incorporating the principles of regular expressions in your prompts will help improve the accuracy of the results. As an investigator we often do not know how specific text is shared online. For example a phone number can be spelled in many variants.
How would you find the examples of a phone number below?
As you see a phone number is not always digits only. There might be symbols, alphanumeric characters, underscores, periods and emoji to name a few things you should keep in consideration.
This means we should craft our prompts to keep as many of these things and possibilities in account. Think of usernames that contain periods, underscores or maybe even emoji's. You will be amazed at how many ways personal identifiers are being shared online.
These are some examples of topics your could keep in consideration for crafting your prompts:
Misspellings or alterations: Users may intentionally misspell or alter their usernames, emails, or phone numbers to avoid detection.
Symbol substitution: Special characters or symbols may be used as replacements for letters or digits. For example, "@" instead of "a" in an email address.
Disguised formatting: Users may employ unconventional formatting techniques like inserting spaces, dashes, or parentheses to hide their personal information.
Here are some practical examples of prompts that will try to extract information keeping in mind the many variations of how some piece of information may appear in your dataset.
Prompt: "Find all usernames mentioned in the text. A username typically starts with an @ symbol and may consist of alphanumeric characters, underscores, periods or emoji's."
Extracting Email Addresses:
Prompt: "Identify and extract all email addresses from the given content in the next prompt. An email address typically follows the format 'firstname.lastname@example.org' and may contain alphanumeric characters, underscores, periods, or dashes."
Extracting Phone Numbers:
Prompt: "Extract all phone numbers mentioned in the next promt. Phone numbers may vary in format, but common patterns include XXX-XXX-XXXX, (XXX) XXX-XXXX, or XXXXXXXXXX, country codes and may contain alphanumeric characters, underscores, periods, dashes or emoji's"
Extracting Personal Names:
Prompt: "Find and extract personal names from the provided data in the next prompt. Personal names generally consist of a first name and a last name, with the first letter capitalised and may contain alphanumeric characters, underscores, periods, dashes and emoji's. There might be names that are not capitalised or names that have one or more middle names or initials between the first and last name."
Creating effective prompts may require some extra refinement. Test your prompts on different datasets where you know the outcome and adjust them as needed. Fine-tuning your prompts will improve their accuracy in extracting the desired information.
Combining prompts and automating prompts
The individual prompt examples can of course be combined into one large(r) prompt that extracts multiple data points in one run. This will save you more time. You can even automate your prompt to output the data in a specific format like a CSV or Excel file.
Again this is all part of "crafting" your prompt in such a way that the AI model exactly understands what you want it to execute.
There are also many plug-ins available that will let you store your collected text blobs in a cloud hosted or local storage. You could automate your prompt to run as soon as new text blobs are uploaded to your preferred data storage location and automatically output this into a specified location.
One could even think of creating alerts in their prompts. Meaning if you set a specific rule on your prompt it could generate an e-mail or sms alert based upon a username, phone number, email, name that you immediately want to know about if it is found in the dataset.
With the power of critical thinking, the help of AI tools like ChatGPT and Google Bard, extracting specific data from large blobs of text has become more efficient and less time-consuming.
By following a structured methodology, you can create prompts that are capable of extracting valuable information for your OSINT investigations. Remember to understand the data structure, identify patterns, and use critical thinking to enhance the accuracy of your prompts. This is a never ending process. Combine prompts to be more efficient and automate these prompt tasks for repetitive tasks.
And maybe if you have created your own prompts that do a terrific job, share them with the public. Happy hunting ! 🙂