Introduction

Minimum requirements

Twint and other tools apart, I'm assuming you know a bit about OSINT and friends; things like the ‘intelligence cycle’ (from now on just ‘cycle’) and basically how to make decisions (when apply deduction or induction).

If you don't know much about OSINT, I suggest you to read this post made by Dutch OsintGuy.

The Twint Project can be found here.

Why this post?

I have seen various blog posts about Twint, but those simply explain how to use it just to scrape some tweets; while there's nothing wrong with that, it's only one step of the cycle.

Here, today, I'll expose Twint in a few complete cycles, to use it at its maximum, as every tool deserves to be used.

Before to start

Please consider that doing OSINT is almost like playing with LEGOs, everyone can start with the same pieces, the result is determined by what you build up. There are not super-secret databases or tools, there is just you, with your mind, and your tools.

This is what makes OSINT awesome.

Basic usage

Let's cover some examples to get an idea on how to deal with Twint.

Username + Hashtag

With this snippet we will get every tweet sent by noneprivacy that contain the #osint hashtag.

import twint

c = twint.Config()
c.Username = "noneprivacy"
c.Search = "#osint"

twint.run.Search(c)

Followers/Following

With this snippet we will scrape every follower of Twitter, to scrape who Twitter is following, just use twint.run.Following(c). Please consider that Twitter is getting pretty good at blocking scrapers, so you might not be able to get every follower/following. In this case there's nothing wrong with using the Twitter API to get the data.

Remember that during the recon phase you will most probably use different tools. If a tool doesn't return the expected results or doesn't work, just use another one that fits your needs!

import twint

c = twint.Config()
c.Username = "twitter"

twint.run.Followers(c)

Real World Scripts

Enough basic scripts for now, you can find other examples in the Wiki.

Let's say that we want to get the followers of a user which handle is target, and that have at least 1000 followers.

We will first get the followers, then iterate over them and extract only the users that have at least 1000 followers.

import twint

# get the followers first
c = twint.Config()
c.Username = "target"
c.Store_object = True
c.User_full = True

twint.run.Followers(c)

# save them in a list
target_followers = twint.output.users_list

# iterate over them and save in a new list
K_followers = []

for user in target_followers:
    if user.followers >= 1000:
        K_followers.append(user)

# now we can save them in an CSV file, for example
with open('K_followers.csv', 'w') as output:
    output.write('id,username,followers, following\n')
    for u in K_followers:
        output.write('{},{},{},{}\n'.format(u.id, u.username, u.followers, u.following))

In this case we covered a really simple example, what makes it so “special” is that we followed the three basic steps of OSINT:

  • define a target, and the goals we want to achieve;
  • get the information that we need;
  • analyze and filter the obtained data.

At this point we should take a deep dive into those followers, for example getting their tweets and extracting the top 10 used hashtags. Let's see how to!

Extract the top N hashtags

Let's follow the previous example. In this case we already have a list of the users, if the length is less than 27 we can use one single query. We could even just spawn various processes to make the scraping process faster.

import twint

custom_query = ""
hashtags = {}

with open('K_followers.csv', 'r') as input:
    # we can ignore the first row
    input.readline()
    line = input.readline()
    while line:
        user = line.split(',')[1]
        hashtags.update({user: {}})
        custom_query += "from:{} OR ".format(user)
        line = input.readline()
    custom_query = custom_query[:-4]

c = twint.Config()
c.Custom_query = custom_query
c.Store_object = True
c.Store_csv = True
c.Output = "tweets.csv"

# we want to hide the output, there will be a lot of tweets and the terminal might crash
c.Hide_output = True

twint.run.Search(c)

tweets = twint.output.tweets_list

# now we have all the tweets, let's elaborate the data

# first iterate over the tweets
for t in tweets:
    # then iterate over the hashtags of that single tweet
    for h in t.hashtags:
        # increment the count if the hashtag already exists, otherwise initialize it to 1
        try:
            hashtags[t.username][h] += 1
        except KeyError:
            hashtags[t.username].update({h: 1})

# now save the data
with open('hashtags.csv', 'w') as output:
    output.write('username,hashtag,count\n')
    for user in hashtags:
        for h in hashtags[user]:
            output.write('{},{},{}\n'.format(user, h, hashtags[user][h]))

We got the data and filtered it, now we have to analyze it to get more information about the target, thus categorize it. For this we could make a bar chart, for example, to see the percentages of every single hashtag to every single user. Plus we can see the most shared hashtags to see if our targets are part of “communities”.

What communities are talking about?

To identify communities, hashtags aren't enough. You need interactions between users; who replied to and who mentioned who, for example.

So let's say that we have a few users that seem to be part of the same community since they almost use the same hashtags. What we want to achieve is to enlarge the circle to other users.

import requests
import twint

custom_query = ""
mentioned = {}
replied = {}

mentions = ['mention1', 'mention2', 'mention3']
replies = ['reply1', 'reply2', 'reply3']

for m in mentions:
    mentioned.update({m: {}})
    custom_query += "@{} OR ".format(m)
custom_query = custom_query[:-3] # -3 because we want to leave a space

for r in replies:
    replied.update({r: {}})
    custom_query += "to:{} OR ".format(r)
custom_query = custom_query[:-4]

# Twint setup here
c = twint.Config()
c.Custom_query = custom_query
c.Store_object = True
c.Store_csv = True
c.Output = "tweets_mentions_replies.csv"

# we want to hide the output, there will be a lot of tweets and the terminal might crash
c.Hide_output = True

twint.run.Search(c)

tweets = twint.output.tweets_list

# now iterate over the tweets to do a bit of statistics
# we will determine the most mentioned users
# and the user that got the most replies
for t in tweets:
    # iterate over the mentioned users
    for m in t.mentions:
        try:
            mentioned[m]['count'] += 1
        except ValueError:
            mentioned.update({m: {'by': t.username, 'count': 1}})

    # we don't have the user which we reply in the tweet object, tweet
    # but from the tweet ID we can get redirected to the original URL
    # and from the URL, extract the username

    _reply_to = requests.get('https://twitter.com/statuses/{}'.format(t.conversation_id)).request.path_url.split('/')[1]
    try:
        replied[_reply_to]['count'] += 1
    except KeyError:
        replied.update({_reply_to: {'by': t.username, 'count': 1}})
    print('.', end='', flush=True)

# and now save to CSV for further analysis
actions = {'mentioned': mentioned, 'replied': replied}
for a in actions:
    action = actions[a]
    with open('{}.csv'.format(a), 'w') as output:
        output.write('author,{},count\n'.format(a))
        for user in action:
            output.write('{},{},{}\n'.format(action[user]['by'], user, action[user]['count']))

Now looking into mentioned.csv and replied.csv, we will be able to see who mentioned the most and who got the most mentions, the same for the replies. So if, for example, two users are both the most mentioned and mentioned the most, we can deduct that they are at the core of the discussions.

Consistency

The consistency of the data is given by the facts that all the tweets contain specific hashtags and/or keywords.

How do we know our dataset is consistent? Because it is the data Twitter itself returns for our search query, so everything relies on our query being scoped correctly for our analysis goals. So, again, be sure that what you are looking for is what you need and vice versa.

Accuracy

The accuracy is given by the fact that the most used words are the ones that we looked for.

How is confirmed the accuracy of our dataset? By the fact that the words that we were looking for are the most used ones.

Yeah but, in a nutshell?

This is not a one-loop cycle; you could extract other most commonly used words and re-run the script. The ‘most’ is more or less relative, don't get this wrong, you don't need to extract the top three used words, for example. You have to choose the words with ‘statistics in mind’; meaning if the most three used words have a presence of about 2%, you will not have to add these words because their presence is too small! (this case is quite rare, but I guess you got the point). But before doing this, you have to filter your dataset, for example don't count articles, nouns, pronouns, and so ‘too general’ words. These are often called ‘stop words’. Do some testing, play with this a bit to get some confidence, do some plot to see the distribution of the words, let your mind go!

Do some testing with Twitter Advanced Search to learn more about the operators used to exclude or forcefully include words. You will need to change the code as well. I highly suggest you to get some confidence with Python scripting, you will hardly find the exact script that you need, or at least waste more time for searching than actually building it.

How to build the queries

In the previous examples we always assumed to have the correct query, thus the query that returns the data that we want. But, how do we build a query?

Sometimes it might be enough to specify a couple of hashtags and/or keywords. For example, if we want tweets about OSINT, searching for ‘#OSINT’ would be pretty enough. Same for other fields like infosec, privacy and security.

If we would build a network of a community that pretty often is identified with one or more hashtags (and assuming that we don't know anything else), first we might search for those hashtags using OR statements, extract the usernames in the top percentiles, and then search for tweets that mention those users or that are reply to those users. In this way we will most probably determine the users that are covering the main roles.

Going deeper in the analysis and studying the ‘directionality’ of the map, we can differentiate four main roles:

  • who creates new contents;
  • who mentions who;
  • who just reply;
  • who just retweet.

You can graph the interactions in a tree graph with four levels. In the first you place the users that create the content, and then you go down to the last where you place who retweets.

We might go even deeper but the aim of this post is not to explain social media/network analysis. At the same time we might need to identify and construct networks of interactions, I guess that stopping here the social media analysis is good compromise.

Queries useful for OSINT

Fortunately in this case the queries are less complicated to build since we (at least I) usually search for granular information to:

  • connect that account to others in different socials;
  • discover contact and juicy information.

1 - Connect a Twitter account to other socials

We could just search for tweets containing the domains in full, but Twitter allows just a few chars so we'll need to search for shortened URLs as well. Then analyze the data.

Useful list of full and shortened URLs:

  1. Facebook:
    • facebook.com;
    • fb.me;
    • on.fb.me.
  2. Youtube:
    • youtube.com;
    • youtu.be.
  3. Instagram:
    • instagram.com;
    • instagr.am;
    • instagr.com.
  4. Google:
    • google.com;
    • goo.gl.
  5. Linkedin:
    • linkedin.com;
    • lnkd.in.

Instagram case

import twint

# get the followers first
c = twint.Config()

c.Store_object = True
c.Username = "target"
c.Search = "\"instagram.com\"" # "instagram.com" = the tweet must contain this word
# we might add other keywords to scrape only one search
# c.Search = "\"instagram.com\" \"fb.me\"", the tweet must contain at least one of two

twint.run.Search(c)

# let's analyze the data
# I'll go straight to the content, and you'll see why
tweets = twint.output.tweets_list

links = []

for tweet in tweets:
    text = tweet.text.split(' ')
    for t in text:
        if t.startswith('instagram.com'):
            links.append(t)

ig_users = {}
for l in links:
    l = l.replace('instagram.com/', '') # clean up the data
    l = l.split('?')[0] # remove tracking codes, like ?igshid=1x89qxivizphf
    slashes = len(l.split('/'))
    user = l.split('/')[0]
    try:
        ig_users[user] += 1
    except KeyError:
        ig_users[user] = 1

# now let's get some rank and let's see the top 5 users
import operator

i = 0
sorted_users = dict(sorted(ig_users.items(), key=operator.itemgetter(1), reverse=True))

for s_user in sorted_users:
    if i == 4:
        break
    print('User: {} | Rank: {}'.format(s_user, sorted_users[s_user]))
    i += 1

Now if you take a look at the output, you'll see that p/ has a high count. Turns out that p/ is not an user, instead it represents a link to a specific post of an user. We can deduct that if a Twitter user shared a link that contains instagram.com/p/, he/she is sharing a post. Otherwise he/she is sharing an Instagram user.

Basically, if you want to search for Twitter users that shared an Instagram post, you will need to refine you query with something like c.Search = "\"instagram.com/p/\"". Follows that if you want to search for users that shared the link to a specific post, you will have to search something like c.Search = "\"instagram.com/p/postID\"". Similarly, if you want to search for users that shared the link to a specific Instagram account, c.Search = "\"instagram.com/accountName\"".

You might search for ?igshid as well, or at least store them because with these you might be able to track shares across social networks and chats. Sharing ids are almost unique, so if two users shared a link to an Instagram post that has the same sharing-id, there might be a strong interaction between them. Or they could even be the same person with two different accounts.

We started from a simple information (the domain name) to get what we need, got the data, filtered and analyzed it. While analyzing it, we discovered that our first general scope information is divided in two categories: users and posts. Now that we are aware of this, we can enrich our query thus request more granular data.

This method will work pretty well with Facebook, I'll let you play with it and discover what needs to be discovered!

Youtube case

Youtube is not like Instagram, they leak more information when users connect their Twitter account to the Youtube one. You can search for tweets about an user that liked a video, or even that commented it, or simply that shared the video.

Before jumping into the script, we want to first search what we really need. So in this first stage we need to know the pattern of:

  • tweets of liked video;
  • tweets of shared video;
  • tweets of commented video.
Liked video

We will search for something like "youtu.be" "youtube.com" "Liked on YouTube". At the first look we see that after Liked on YouTube there is the title of video. This information can be really useful because if for some reason the video gets deleted, we will still be able to know which video the user liked, and so search in other platforms for that one. We can apply reverse searches as well, for example if we want to search for users that liked a specific video which we have the shortened url, we can search for "https://youtu.be/wcLiJHz3JRc" "Liked on YouTube"

Shared video

If we search just for the shortened url, "https://youtu.be/wcLiJHz3JRc" for example, we see that a specific pattern turns out, we have some random text, the short URL and then ‘via @YouTube’. Since the text is almost random, we will search for the static part. Thus the query is "youtu.be" "youtube.com" via @YouTube. The reverse query is something like https://youtu.be/3z9sq9e5iu4 via @YouTube.

What if we don't know the URL of the video but we have a few keywords that could allow us to guess the name of the video? Just place them in query, but be careful here, it's quite different searching for "keyword1 keyword2" and keyword1 keyword2. I'll let you play a while with this, to get more confidence with queries and boolean algebra.

Commented video

Fortunately it's hard to find recent tweets about comments, at least with the query that I tried, which is via @youtube "Check out this comment". At first look, we see that the URL is about the comment posted, and not the video. I expected to be redirected to the video, but the links that I tried returned 404. No luck here.

Facebook case

Fortunately, or not, I wasn't able to find a specific pattern for comments or shares. Maybe it's not even possible to share out of Facebook what you commented in a post. Anyway, both direct and reverse search works.

fb.me end with something like a hash. No pattern, no luck.

facebook.com is what we will search for juicy information!

This second baseurl has different suburls:

  • userID/posts/postID, (the same for videos and etc.);
  • username/posts/postID, (the same for videos and etc.);
  • story.php?story_fbid=postID&id=userID;
  • permalink.php?story_fbid=postID&id=userID;
  • events/eventID for events.

And maybe a few others, but even those alone can be enough for most cases. If we want to search for tweets the match the second and third pattern for the URL but we don't know the postID, we can just split that url and so searching for something like "facebook.com/story.php" "&id=userID".

Google and Linkedin

Google and Linkedin usually append something like a hash at the end of the url, so again no pattern here. But we can still do direct and reverse search for links.

The ‘connect the socials’ part end here, in general we can almost use the same code wrote for Instagram with little changes.

2 - Discover contact information

Don't worry, this part is going to be less long!

In this case you can use almost every kind of combo that you want, here a few:

  • "@gmail.com", and/or other email services;
  • [at]gmail[dot]com, and/or other email services;
  • "contact us";
  • "hotline" "contact", people sometimes specify the contact for Whatsapp or other IM services;
  • various other and combos of previous examples.

Keybase allows you to verify that you own a Twitter account, this is verified tweeting a customized tweet given you by Keybase. Maybe it's not that popular, but it might be useful since on Keybase you can verify the own of other account on other social networks, and add a PGP key, which may contain an email address. To extract the email address you can use this snippet, credits to Cody Zacharias (Twint Project co-founder and co-owner) link to the gist

import requests, base64, re, sys

r = requests.get("https://keybase.io/" + sys.argv[1] + "/key.asc")
body = r.text.split("\n\n")
key = body[1].split("-----")

for email in re.findall(r' <(.*?)>', str(base64.b64decode(key[0]))):
    print(email)

Activity analysis

We are going to be a bit stalky here, we'll determine the daily and weekly activity of an user (or more). Given a timezone (our one, in this case), we can estimate the timezone of our target by seeing in which hours he/she is more active.

To make this easy, you can either use Excel or Kibana. In the first case you will have to “extract” the hour from the time field and the number of the day in the week from the date field. I suggest to use Kibana when you have to deal a lot of tweets.

Conclusions

That's all for now, further on I'll update this post with other contents!

Feel free to provide any feedback!