Skip to main content

Upvotocracy bot creation project (selenium + post api + webscraping)


Introduction to the project concept:

After a long time with machine learning, I recently again got interested to use web-scraping for an exciting project. One of my friends, Anthony Ettinger has started a new exciting website called upvotocracy which is a reddit clone with a number of interesting changes in both voting, moderation(there is none in upvotocracy) and other things. So I have planned to create a bot to post automated contents to his site to increase my karma. And for now, this act is allowed and is not considered illegal. So I have planned to initially populate the thread FakeNews from the onion news using my bot. Later on we will focus on further progresses in the similar direction.
Just to provide caution, this was a test project and posting actually using this process ended up in spamming the site unnecessarily; resulting me being heavily thrashed by the whole community. So don't try this on the actual platform. The goal of writing this post is to provide a knowledge of how to use basic api, as well as how to scrape a number of different types of rss feeds.

Project Idea:

The idea of this project is in three steps.
The first step is to find out and scrape RSS feeds of famous pages from  which I want to create the posts; as well as there is no documentation for the APIs of upvotocracy, I will have to manually figure out how the get and post API works for posting in upvotocracy.
The second step is to create a function which successfully scrapes RSS feeds. Obviously this will consider multiple options to cover one or more templates of RSS pages; and then retrieves the required elements like text,description and images.
The third and last step is to properly create the JSON body for each post from the previously scraped data, and then hit API with proper format.
Once this is done, finally to create a automated commenting procedure, we have to create a program for pulling the relevant information for each category from a csv file, as well as a cron job to run the script on each specific time interval.
But this blog post, while will contain the first and most important parts, will only explain the last parts how to do without explicit code written.

Basics and pre-requisites:

There are two pre-requisites you need to learn for completely understanding this project. The first one is web scraping, which is the technique of scraping websites for data programmatically. The second one is the basic knowledge about API and how to handle basic get,put and post requests. As we have already discussed web scraping in this web scraping post, we will dive into the basic usage of API and different forms of the requests.

What is API?

API is the abbreviation of application programming interface. API is created by a organization or service to provide data in a programmatic way to developers or users. Example will be API resources for data provided by reddit, Facebook and many other sources.
An API, is basically, a URL which, on accessing via URL request, provides data or can accept data and complete some action in the source. For using any API, I will recommend first to either access it through your browser, or use it via postman. Basically, postman is a platform to access different API using the proper headers in a text format and then you can see the output of the API. The whole purpose of the Postman platform is to visualize and experiment APIs. Therefore, we will use postman for API visualization in this platform.
So how does a API work for an end user? to say that, we have to first know how many types of API are there.
The APIs from user perspectives are (1) get (2) put (3) post (4) rest. get API is used to fetch data from a origin server using a HTTPS request. Put and post are basically used to create and replace different resources in the database within the server; i.e. to create and replace new resources. There are large number of differences between put and post, but the discussion of that is beyond scope of this post, so if you are interested, you can follow put-vs-post-in-rest.
For the scope of this post, it is enough to carry this notion that using get api one can use get API to fetch data and post API to create new data in the providing server.

What is RSS feed and normal template to scrape?

Now, we are going to scrape RSS feed in this project. But before that you need to know what is RSS feed and what is the expected normal structure of an RSS feed. We are going to visit 2-3 types of RSS pages, see their HTML structures and discuss how to scrape each of the items based on such templates.
Notable point here is that I am using a firefox browser here to open these feeds and therefore some of them will appear in xml format but that is not of our concern.
So the first and easily scrapable format is one in the page https://blog.hubspot.com/marketing/rss.xml. These are standard xml document tree format; and once you open them,you look at something like:

Now, from this page you can see that there is a clear hierarchy of the different tags. Now, we want to scrape the actual news article items from these feeds. So once you look around a little bit, you can see that all the items are located under item tags. And now you have seen one rss feed, but if you look into a number of rss feeds, it will be understandable pretty soon that locating article items under the item tag is pretty standard in case of rss feeds.
So for scraping such a tag, you will have to start the webdriver and then use the find_elements_by_tag function and write
AllItems = driver.find_elements_by_tag_name('item')
which will now store all the item tags as session element objects. Here is a point to note. A session element object is alive up to the time when the browser is open and selenium is accessing the page. That's why it is always necessary to scrape the constant items out of these session element objects before closing the browser.
Now there is a new problem which come up sometimes when you scrap the elements of the item for different things, like title,link and descriptions. This problem arises from the case when inside the page, people have these items hidden using [cdata!... format. I don't know the proper reason of hiding contents using this, but the thing is that, using .text tag does not work in these tags. Therefore, for accessing the hidden text contents, you should access them by .get_attribute('textContent'). this textContent attribute helps you to retrieve all the hidden text content from a tag in general. I got this little solution from here in stackoverflow.
So in short, I created this small function below, to scrap all the items in one rss feed and then nicely put each item as a dict and all the items together in a dict list.
So this is what the standard function which I am going to scrape a target page. The only input needed for this function is the url of the target page. I will describe it line by line now, so if you pretty much understand it totally, then you can skip the next para, otherwise keep reading.

So, we at first initialize the webdriver element using webdriver.Firefox element. Next we access this page via driver.get(page), but to let the function work smoothly, we add a try except loop where if it is a wrongly formatted or broken page, it will just put a error message out.
Next, we scrap all the session elements with tag 'item' on them. Now, using the same find_element_by_tag_name and get_attribute('textContent') tricks, we get the link, title and thumbnail tags.
Point to note here is that at the time of writing this post, my function scraps through specific rss feeds, and is not fully automated for any RSS feed and the reason we will come into later.
Now clearly, the last parts are very basic, with the list append and time.sleep. The only important part we do before closing the function with return here is driver.close() which closes the browser opened in the beginning. And there goes our scraping of a simple RSS feed being done.

How to post contents through API?

So, in this part we will go through the second part of the task, which is how to use the API. Now important thing here is that, there is no documentation of the APIs used in upvotocracy and therefore we will first decode the functioning of the API.

In specific, the simple way I figured it out was going through the network connections in your browser during a manual posting. You will see that when you are posting contents, in the network, your post API calls can be seen in details. That's how I figured out that for posting contents to https://upvotocracy.com/api/1/posts, you need to have your json body three items, i.e. title, url and thumb. These are basically for posting link contents to the threads, which generally have a title, a url and a thumbnail picture. So you also need to provide a text title, a valid url, and in thumb also, a valid thumbnail image url.

Now for posting to an API, I have used the requests library of python. Although I know only a bit about the requests library; I have used it to post the contents. For posting using this library, you have to use a format like the following:

req = requests.post(api_url,headers = login_header,json = actual_json)
where post function is used for accessing a post api, the api_url has to be provided for the parameter url, for headers, you have to provide specific type of headers, and for json you have to provide the actual json body of the request.

While similar approaches are applicable for using get api via requests module also, that is not in the scope of current discussion.

So finally, for posting each and every contents in the itemlist we created in the last section, I go through each of those dictionaries, create the proper json body and hit the api with that. Here is the function related to that:
Clearly, as you can see, in the post_document function, I go through each of the items in the dictionary list, add the additional tags required ( figured them out from the params sent in a manual posting session in upvotocracy from the Network in console of browser, as we described earlier); and finally post them using the request.
Here, you should ask the question if you don't know it already that, what is it that the post function returns. The post function returns the https status of the request we send, and in other cases, like in cases of get APIs, it returns the server sent data. So in case of post API, which is the case in hand, request is important to observe, because if the status is above 200 series, then there is something wrong. This requires basic knowledge of https status codes, but I will discuss a few common ones which I ended up getting in this project.
(1) 404 error: this comes when you are trying to access a non-existing page. If you are getting this request, please re-check your code, as you may have entered a wrong url after all.
(2) 401 error: this comes when you actually make a badly formed request to the server and the server does not understand the request. This can arise from multiple issues, such as missing/wrong elements in headers. I personally didn't enter the authorization item properly first few times and therefore wound up to 401 error.
(3) 422 error: this comes when you have made the request, the request has reached the server, the server understands who you are and what all is in your header, but the json, or the body of the request consists of commands which the server does not understand. In simpler language, your json is not written correctly or misses one or more than items. Personally I found this error when I forgot to create json out of my dictionary and ended up sending dictionary instead of json. That gave me the 422 error.
But you eventually overcome these and still then you may get
(4) 502/504/503 error: these are not client side but server side errors. These are basically moments when server does not act properly and therefore your request is declined.

When you successfully complete the request, you get a 201 status code for post and 200 for get API.
In my code, I have used json module to turn my dictionaries into json format. Other than that, the rest of the post_document function is basic python.
So, this was the last step of how we comment in upvotocracy using a automated project. Now we will discuss what are other possible developments and why I got tossed by the community instead of creating such an awesome program. But before all of that, please take a look in the whole code for this project, which still contains some small functions doing small and cool stuffs. The link for the github project is here.

Onto the failures and learning from this project:

As you can tell by this point, I wanted to populate the community with good quality posts in all the threads; and I was just starting up. With the automatic posting thing in my hand, all I now had to do was to assemble a bunch of similar rss feeds and assign them with respective formats and category id ( thread specific category id in upvotocracy) ; and boom my free karma and the great posts in upvotocracy threads were on the way.
But while these all were great, there were certain problems with the rss feeds. I ended up taking up CBN ( christian broadcast network) as a news rss feed and that's where it started getting community hatred. CBN rss feeds were full of religiously targeted, and sometimes communal hatred spreading news, which I totally didn't see at the beginning.
Also, another point in this project was quality of the posts. Some of the marketing posts I started to post, were termed too basic or normal for the marketing thread and people did not like the fact that I was posting them all in the same fashion.
The same quality issue occurred in some of the posts I scraped from echojs feed, which ended up in posting someone's low quality and wrong code from github being posted to the javascript thread. People heavily disliked that post.
And last but not the least, the sheer look of spam in the project. The project was good, but people don't like the idea of one person posting 100 posts in one thread under 3 seconds, among which even 1 also has a quality issue. So spam looking was a issue too.

So in the end, as ambitious the project have been, I had to stop it in its current format, and had to delete the CBN posts manually ( could have programmed it though, but didn't want to). Now, the reason I discussed these, because all of them lead to the phase 2 of the project, which at current time (15may, 2020), I have not started yet.
The phase 2 of the project will be more complicated than this. The goals are:
(1) to find racist or community offending news via the description of a feed item and delete that from item list before posting
(2) automatically detect when a rss feed changes and then scraping that again when it changes
(3) Finding out how to measure quality of a technical post. [this can become a huge project in itself]
(4) inserting manual behaviour in posting such that it does not look spammy.
So with these, I conclude this discussion. Thanks if you read this. I plan to post the phase 2 updates later on. Till then, stay tuned!

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...