Being in-house for a publisher (who publishes a lot by definition) I was looking for a way to look at everything that was released over the past X days and that would allow me to quickly scan the SEO Titles.
The solution I came up with (and I’m giving you today) is:
- Crawl a list of URLs from Google Sheet
- Retrieve and scrap some useful data
- Export the result back to Google Sheet
It doesn’t cover:
- how to get the list in Google Sheet since that method will vary depending on the platform you’re using
- how to run the script every day automatically (I will in a future post, spoiler alert it’s a con job running on AWS) (and for FREE 🙂 )
1- The first step (and probably the less fun) is to setup you Google Account
I’m using the excellent gspread Python library and the website is very clear so instead of copy/pasting I will just give you the right link: https://gspread.readthedocs.io/en/latest/oauth2.html#oauth-client-id
Summary:
- Go to Google developer console https://console.developers.google.com/project
- Create a project (or use an existing one)
- Activate Google Drive API and Google Sheets API
- APIs & Services > Credentials | Create credentials > Service account key | Create
- Actions>Create Key>JSON
Congratulations! You now have a JSON file on your computer. I renamed it ‘secret.json’ and place it in the same directory than your Python code.
Copy the service account email in the Service accounts menu and share your Google doc with that email. It looks like something like this:
{account name}@{project name}-266558842.iam.gserviceaccount.com
You can also find it in the JSON previously downloaded on the line “client_email”
2 – Now, the fun part where Python does all the work for you
Disclaimer: I’m not a professional developer so the code is probably (absolutely?) suboptimal but it works. I’m more than open to hearing your thoughts and optimizations!
I assume it’s not your first Python gig, so I won’t explain how to install Python or libraries. You’ll need:
- gspread https://gspread.readthedocs.io/en/latest/index.html
- requests https://requests.readthedocs.io/en/master/
- BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/
pip install requests gspread beautifulsoup4
In the code below, you just have to replace where it’s commented:
- json file name
- spreadsheet name
- source sheet name + column number
- destination sheet name
- if you don’t have a header you can remove the ‘pop’
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import requests
from bs4 import BeautifulSoup
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('secret.json',scope)
client = gspread.authorize(creds)
ssheet = client.open('SEO - Daily <title> check')
#name of your spreadsheet
sheet= ssheet.worksheet('urls source')
#name of your sheet
urls = sheet.col_values(7)
#I'm getting the URLs in the 7th column, there's a good chance it'll be different for you
urls.pop(0) #removes first item (i.e. header) if no header just remove that line
result = []
for url in urls:
if len(url)>0:
print(str(url))
try:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result.append([url,page.status_code,soup.find('title').text])
# I'm only getting URL + STATUS CODE + <TITLE> but your imagination (and bs4 here) is the limit
except:
print("error")
ssheet.values_clear('data!A2:D')
#clear the sheet first
ssheet.values_update(
'data!A2',
params={
'valueInputOption': 'USER_ENTERED'
#https://developers.google.com/sheets/api/reference/rest/v4/ValueInputOption
},
body={
'values': result
}
)
Voila! The next step is to run a cron job every day at night so the data magically appears in the morning.
Was this helpful?
0 / 0
SEO/Data Enthusiast
I help international organizations and large-scale websites to grow intent-driven audiences on transactional content and to develop performance-based strategies.
Currently @ZiffDavis – Lifehacker, Mashable, PCMag
ex @DotdashMeredith, @FuturePLCÂ